Using the "Upside of Downtime" framework (above) as a guide:
- Prepare: Much room for improvement. The health status feed is hard to find for the average user/developer, and the information was limited. On the plus side, it exists. Twitter was also used to communicate updates, but again the information was limited.
- Communicate: Without a strong foundation create by the Prepare step, you don't have much opportunity to excel at the Communicate step. There was an opportunity to use the basic communication channels they had in place (status feed, twitter) more effectively by communicating throughout the incident, with more actionable information, but alas this was not the case. Instead, there was mass speculation about the root cause and the severity. That is exactly what you want to strive to avoid.
- Explain: Let's find out by running the postmortem through our guideline for postmortem communication...
Prerequisites:
- Admit failure - Excellent, almost a textbook admittance without hedging or blaming.
- Sound like a human - Well done. Posted from Director of Engineering at Facebook Robert Johnson's personal account, the tone and style was personal and effective.
- Have a communication channel - Can be improved greatly. Making the existing health status page easier to find, more public, and more useful would help in all future incidents. I've covered how Facebook can improve this page in a previous post.
- Above all else, be authentic - No issues here.
Requirements:
- Start time and end time of the incident - Missing.
- Who/what was impacted - Partial. I can understand this being difficult in the case of Facebook, but I would have liked to see more specifics around how many many users were affected. On one hand this is a global consumer service that may not be critical to people's lives. On the other hand though, if you treat your users with respect, they'll reward you for it.
- What went wrong - Well done, maybe the best part of the postmortem.
- Lessons learned - Partial. It sounds like many lessons were certainly learned, but they weren't directly shared. I'd love to know what the "design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes" look like.
Bonus:
- Details on the technologies involved - No
- Answers to the Five Why's - No
- Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - No
- What others can learn from this experience - Marginal
Biggest lesson for us to take away: Preparation is key to successfully managing outages, and using them to build trust with your users.