Friday, October 8, 2010

Etsy.com opens the kimono and talks frankly about outages

You know when John Allspaw (VP of Ops at Etsy, Manager of Operations at Flickr, Infrastructure Architect at Friendster) is involved, you're going to get a unique perspective on things. A few weeks ago Etsy.com was down. John (and his operations) department decided it would be a good opportunity to take what I'll call an "outage bankruptcy" and basically reset expectations. In an extremely detailed and well thought out post (titled "Frank Talk about  Site Outages") he goes on to describe the entire end-to-end processes that go into managing uptime at Etsy. I would recommend reading the entire post, but I thought it would be useful to point out the things that we can all take away from the experience of one of the most well respected operations people in the industry:


Metrics
"Today, we gather a little over 30,000 metrics, on everything from CPU usage, to network bandwidth, to the rate of listings and re-listings done by Etsy sellers. Some of those metrics are gathered every 20 seconds, 24 hours a day, 365 days a year. About 2,000 metrics will alert someone on our operations staff (we have an on-call rotation) to wake up in the middle of the night to fix a problem."
Takeaway: Capture data on every part of your infrastructure, and later decide which metrics are leading indicators of problems. He goes on to talk about the importance of external monitoring (outside of your firewall) to measure the actual end-user experience.
Communication
"When we have an outage or issue that affects a measurable portion of the site’s functionality, we quickly group together to coordinate our response. We follow the same basic approach as most incident response teams. We assign some people to address the problem and others to update the rest of the staff and post to http://etsystatus.com to alert the community. Changes that are made to mitigate the outage are largely done in a one-at-a-time fashion, and we track both our time-to-detect as well as our time-to-resolve, for use in a follow-up meeting after the outage, called a “post-mortem” meeting. Thankfully, our average time-to-detect is on the order of 2 minutes for any outages or major site issues in the past year. This is mostly due to continually tuning our alerting system."
Takeaway: Two important points here. First, communication and collaboration are key to successfully managing issues. Second, and even more interesting, is the need for two teams...one to address the problem and one to communicate status updates both internally and externally. This is often a missing piece for companies, where no updates go out because everyone is busy fixing the problem.
Post-Mortems
"After any outage, we meet to gather information about the incident. We reconstruct the time-line of events; when we knew of the outage, what we did to fix it, when we declared the site to be stable again. We do a root cause analysis to characterize why the outage happened in the first place. We make a list of remediation tasks to be done shortly thereafter, focused on preventing the root cause from happening again. These tasks can be as simple as fixing a bug, or as complex as putting in new infrastructure to increase the fault-tolerance of the site. We document this process, for use as a reference point in measuring our progress."
Takeaway: Fixing the problem and getting back online is not enough. Make it a an automatic habit to schedule a postmortem to do a deep dive into the root cause(s) of the problem, and address not only the immediate bugs but also the deeper issues that led to the root cause. The Five Why's can help here, as can the Lean methodology of investing a proportional number of hours into the most problematic parts of the infrastructure.
Single Point of Failure Reduction
"As Etsy has grown from a tiny little start-up to the mission-critical service it is today, we’ve had to outgrow some of our infrastructure. One reason we have for this evolution is to avoid depending on single pieces of hardware to be up and running all of the time. Servers can fail at any time, and Etsy.com should be able to keep working if a single server dies. To do that, we have to put our data in multiple places, keep them in sync, and make sure our code can route around any individual failures.
So we’ve been working a lot this year to reduce those “single points of failure,” and to put in redundancy as fast as we safely can. Some of this means being very careful (paranoid) as we migrate data from the single instances to multiple or replicated instances. As you can imagine, it’s a bit of a feat to move that volume of data around while still seeing a peak of 15 new listings per second, all the while not interrupting the site’s functionality."
Takeaway: Reduce single points of failure incrementally. Do what you can in the time you have.
Change Management and Risk
"For every type of technical change, we have answers to questions like:
  • What problem does the change solve?
  • Has this kind of change happened before? Is there a successful history?
  • When is the change going to start? When is it expected to end?
  • What is the expected effect of this change on the Etsy community? Is a downtime required for the change?
  • What is the rollback plan, if something goes wrong?
  • What test is needed to make sure that the change succeeded?
As with all change, the risk involved and the answers to these questions are largely dependent on the judgment of the person at the helm. At Etsy, we believe that if we understand the likely failures, and if there’s a plan in place to fix any unexpected issues, we’ll make progress.
Just as important, we also track the results of changes. We have an excellent history with respect to the number of successful changes. This is a good record that we plan on keeping."
Takeway: Be prepared for failure by anticipating worst-case scenario's for every change. Be ready to roll back and respond. More importantly, make sure to track when things go right to have a realistic measure of risk.
Other takeaways:
  • Declaring "outage bankruptcy" is not the ideal approach. But it is better than simply going along without any authentic communication with your customers throughout a period of instability. Your customers will understand, if you act human.
  • Etsy has been doing a great job keeping customers up to date at http://etsystatus.com/.
  • A glance at the comments on the page shows a few upset customers, but a generally positive response.

Wednesday, October 6, 2010

Foursquare gets transparency

Early Monday morning of this week, Foursquare went down hard:



11 hours later, the #caseofthemondays was over and they were back online. Throughout the those 11 hours, users had one of the following experiences:


1. When visiting foursquare.com, they saw:



2. When using the iPhone/Android/Blackberry app, they saw an error telling them the service is down and to try again later.

3. When checking Twitter (the not default source of downtime information), they saw a lot of people complaining and the following tweets from the official @foursquare account (if they thought of checking the @foursquare account):


Those were the only options available to a user of Foursquare for those 11 hours. A important question we need to answer is whether anyone seriously cared. Are users of consumer services like Foursquare legitimately concerned with Foursquare's downtime? Are they going to leave for competing services or just quit the whole check-in game? I'd like to believe that 11 hours of downtime matters, but honestly it's too early to tell. This will be a great test of the stickiness and Whuffie that Foursquare has built up.

The way I see it is that this is one strike against Foursquare (which includes the continued instability they've seen since Monday). They probably won't see a significant impact to their user base. However, if this happens again, and again, and again, the story changes. And as I've argued, downtime is inevitable. Foursquare will certainly go down again. They key is not reducing downtime to zero, but how you handle that downtime to avoid giving your competition an opening and even more importantly using that downtime to build trust and loyalty with your users. How do you accomplish this? Transparency.

We've talked about the benefits of transparency, why transparency works, and how to implement it. We saw above how Foursquare handled the pre- and intra- downtime steps (not well), so let's take a look at how they did in the post-downtime phase by reviewing the public postmortem (both of them) they published. As always, let's run it through the gauntlet



Prerequisites:
  1. Admit failure - Excellent. The entire first paragraph describes the downtime, and how painful it was to users.
  2. Sound like a human - Very much. This has never been a problem for Foursquare. The tone is very trustworthy.
  3. Have a communication channel - Prior to the event, all they had were their twitter accounts and their API developer forums. As a result of this incident, they have since launched http://status.foursquare.com/, and have promised to update @4sqsupport on a regular basis throughout the incident.
  4. Above all else, be authentic - This may be the biggest thing going for them. 
Requirements:
  1. Start time and end time of the incident - Missing. All we know is that they were down for 11 hours. I don't see this as being critical in this case, but it would have been nice to have.
  2. Who/what was impacted - A bit vague, but the impression was that everyone was impacted.
  3. What went wrong - Extremely well done. I feel very informed, and can sympathize with the situation.
  4. Lessons learned - Again, extremely well done. I love the structure they used: What happened, What we’ll be doing differently – technically speaking, What we’re doing differently – in terms of process. Very effective.
Bonus:
  1. Details on the technologies involved - Yes!
  2. Answers to the Five Why's - No :(
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Yes!
  4. What others can learn from this experience - Yes!


Other takeaways:

  • Foursquare launched a public heath status feed! Check it out at http://status.foursquare.com/.
  • I really like the structure used in this postmortem. It has inspired me to want to create a basic template for postmortems. Stay tuned...
  • Could this be Facebook's Friendster moment? I hope not. My personal project rely's completely on Foursquare.
  • I've come to realize that for in most cases, downtime is less impactful to the long term success of a business than site performance. Downtime users understand and just try again later. Slowness eats away at you, you start to hate using the service and jump on an opportunity to use something more fun/fast/pleasant.
Going forward, the big question will be whether Foursquare maintains their new processes, keeps the status blog up to date, and can fix their scalability issues. I for one am rooting for them.