Jake Loomis is in charge at Yahoo! of VP. The most visited Web site in USA. over 600 millions global users. 300k+ servers, 100pb+ storage. The Yahoo! home page is the entry point. around 40,000 requests per second. When Yahoo! went down it made a lot of news. Amazon or Playstation outages have business consequences
Tip #1 Redundancy for everything
Down to the code, application features. Understanding your system’s failure points is very important. Everything can and will fail at one point. Redundancy at servers, network devices, colo loses power, user database. ad lookup fails, page fails completely (show a static page instead).
Tip #2 Error proofing change
Make changes in a safe environment. Practice how you play. Test also in environment which is very close from the production environment. QA environement has to be treated the same way that the production environment. Same logs, same type of catching issues. etc. Fork production traffic in the staging environment. We did that for Le Grand Club at RDS. You have to be able to recover quickly.
Tip #3: Global Load balancing
Route traffic to different places in the world. Ability to serve any market from any data center. If your site depends on one coloc, then there is an issue. By being close to the user, it offloads traffic, part of the performance issues.
Tip #4: Monitor everything
There are huge traffics you can prepare for such as a big planned event, like a royal wedding, but there are news coming by chance. These can generate stress on the infrastructure. At Pheromone, we were planning the days where the players of LNH are exchanged. Unplugging some features, etc.
Tip #5: Fallback plans in case of failure
What is happening and how do you handle in case of accidents? Isolate failure for not impacting users. You may want to drop features to add capacity. Learn from failures and followup.
Bonus Tip: Talent
“Having people who are talented is key. ” Well… lapalissade.