We recently migrated our server infrastructure onto a variety of Amazon Web Services; the first step in laying the foundation for massively scalable growth. Having survived this initial transition, I wanted to reflect on how we got here, some things to consider for those following a similar path, and some of what's to come.
Slicehost
The original myGengo service launched on a single Slicehost.com "slice" - their nomenclature for a VPS. The upgrade flexibility and full-control root access were basically what sold us at the time. We launched with one server for pretty much everything - web (apache), database (mysql), assets (html/css, etc.). Even the staging environment was hosted on the same machine to ensure the running environment was identical (!!).
The original system infrastructure went undisturbed for about a year. Our first significant change after the first year was to create a separate assets server - a completely different machine and domain - allowing browsers to capitalize on simultaneous connections more efficiently, sometimes referred to as 'parallelizing downloads'. It's one of the steps Yahoo! recommends for performance optimization. Since you don't need cookies for assets, it also potentially reduces some overhead (though admittedly only if you're serving massive traffic).
DB Server
As traffic and translation volume increased, our next significant change came when we moved the database onto its own server. This is probably a common step for nascent start-ups that were originally unsure of traffic volume when launching. “Detaching” the database server from the web-server allowed us to beef-up and optimize the database machine and not worry as much about the web / apache setup. The separation was also a step in the direction towards a load-balanced "front-end" independent of the data it'll be serving.
At this point it became clear that the best ROI to optimize the system would come from rewriting some of the queries, so we spent a good few weeks logging and re-coding core SQL statements. In some cases we managed to speed up integral queries by over 8000%! That's both an indication of the growing pains a start-up goes through and the increase in volume of transactions we were seeing. As developers who've had the opportunity to experience massive growth in short periods of time can attest, initial coding decisions that are made to be efficient for a certain volume of traffic don't always carry to higher volumes.
Moving to EC2
At any rate, after running on Slicehost for two years it was time to transition to an infrastructure that supported essentially just-in-time server scalability - ala Amazon's EC2. Keeping with the separate web and DB setup, we beefed up the database server by adding more memory, created a secondary slave machine that replicated the master DB, and offloaded some of the processor-intensive activities such as DB backups onto the slave. On the web-serving side, we added another instance for a current total of two web-servers, and both are fed traffic from an Elastic Load Balancer (ELB). Finally, we created a near-similar setup for our staging environment (with only one web-instance and no slave DB), including with its own ELB (i.e. even for just one web-instance), which is important as the ELB in front affects how a system runs significantly.
Let me share a couple considerations we encountered along the way.
Cron Jobs
With two (and in the future quite possibly more) web-servers, there's a question of where to run cron jobs. One approach is to assign one instance to do them all; another is to divide the scheduling across the number of instances. We ultimately decided to let the master DB handle the scheduling of cron jobs (for now).
Log Aggregation
Each of the web-instances generates its own logs, which means it can be difficult to debug issues if you need to review apache logs (for instance) - i.e., "which log has the error in it?" So we've setup the slave DB instance to pull and aggregate specific system logs. This is also the case for all our other service-related logs, such as for calls to our API [link], etc.
IP-dependent issues
In a few specific cases our code needs to know where a request is coming from. Due to the load-balancer, the web-instances now think the load-balancer is making all requests. So we’ve had to code some logic to account for this discrepancy.
What's Next
This recent server migration is a significant step in our move towards an efficiently scalable infrastructure to handle the high volume of translation content we hope to handle this coming year. There is still much room for progress, including a better assets management framework and CDN. And of course, not all scalability problems are solved just through hardware :)
I would be remiss in not acknowledging the team players who've helped in this significant migration. In particular, @fvbock for spending an entire Saturday with the me!
Outline: How we did it
myGengo first launched on Slicehost
- one server at first; for both production and staging
- created secondary assets server; different domain to capitalize on browsers' simultaneous connections
- separated DB server from web server; beefier DB server
- optimized queries
Big transition to AWS
- load balancer in front of two small instances for web
- :80 / :443 ala ELB
- large instance for master DB
- small instance for slave DB
- striped Raid 0 EBS storage on DB instances
- backups on slave; also off-site backups on server outside AWS environment
Things to consider:
- cron jobs and where to call them
- log aggregation and storage
- IP-reliant issues due to web-instances seeing calls from ELB's internal IP
- i.e. letting Paypal IPNs through basic auth on staging
- scripts for deploying instances and deploying code
- sticky sessions (up to 1 hr) for now due to local image rendering
Next steps: better assets caching; CDN
- go non-sticky sessions
- can't fix everything with hardware of course