"Yesterday an unannounced DNS change apparently made our mail server go incognito to the rest of the world. The consequences of this came sneaking over night as the changes propagated through the DNS network. Whammy.How well does such an informal and simple postmortem stack up against the postmortem best practices? Let's find out:
On top of this our upstream internet provider late last night PST (early morning CET) experienced a failure that prevented our servers from reaching external destinations. Web access was not affected but email, widget, targets, basically everything that relied on communication from our servers to the outside world were. Double whammy.
It took too long time to realize that we had two separate issues at hand. We kept focusing on the former as root cause for the latter. And it took unacceptably long to determine that we had a network outage."
Prerequisites:
- Admit failure - Yes, no question.
- Sound like a human - Yes, very much so.
- Have a communication channel - Yes, both the blog and the Twitter account.
- Above all else, be authentic - Yes, extremely authentic.
Requirements:
- Start time and end time of the incident - No.
- Who/what was impacted - Yes, though more detail would have been nice.
- What went wrong - Yes, well done.
- Lessons learned - Not much.
Bonus:
- Details on the technologies involved - No
- Answers to the Five Why's - No
- Human elements - Some
- What others can learn from this experience - Some
The meat was definitely there. The biggest missing piece is insight into what lessons were learned and what is being done to improve for the future. Mikkel says that "We've learned an important lesson and will do our best to ensure that no 3rd parties can take us down like this again", but the specifics are lacking. The exact time of the start and end of the event would have been useful as well, for those companies wondering whether this explains their issues that day.
It's always impressive to see the CEO of a company put himself out there like this and admit failure. It is (naively) easier to pretend like everything is OK and hope the downtime blows over. In reality, getting out in front of the problem and being transparent, communicating during the downtime (in this case over Twitter), and after the event is over (in this postmortem), are the best things you can do to turn your disaster into an opportunity to increase customer trust.
As it happens, I will be speaking at the upcoming Velocity 2010 conference about this very topic!
Update: Zendesk has put out a more in-depth review of what happened, which includes everything that was missing from the original post (which as the CEO pointed out in the comments, was meant to be a quick update of what they knew at the time). This new post includes the time frame of the incident, details on what exactly went wrong with the technology, and most importantly lessons and takeaways to improve things for the future. Well done.
Thanks for your post, Lenny. We wanted to put something out there as early as possible, but a lot of the technical details were still vague at that point. We will post a more technical postmortem in our support forums later this week. Thanks again. Hope to meet you at Velocity 2010.
ReplyDeleteThanks for the additional info Mikkel, I'll be looking forward to that official postmortem (and will update this post with a link to it). You guys are doing really good things over there when it comes to transparency.
ReplyDeleteUpdated post to link to the newly released follow up:
ReplyDeletehttps://support.zendesk.com/entries/144965-operations-follow-up
Excellent post. I used to be checking continuously this blog and I’m impressed!
ReplyDeleteVery helpful info particularly the ultimate part
care for such information much. I used to be seeking this
particular info for a long time. Thanks and good luck. If you are you living in Delhi or thinking to visit this city and alone also searching for a female partner to spend all the time with love moments then I am here to complete your all the desires.Self hygiene and self health is the best factor of Goa escorts. In this manner, they keep their silky body polished and clean so that the respected customers can be served as they deserve. The main motive of Goa escorts Manfaat Jelly Gamat untuk Ibu Hamil agency is to provide high class and most reliable escort services in Goa to get client's attraction towards our services in future too.
packers and movers faridabad to delhi charges
ReplyDeletepackers and movers faridabad to ghaziabad charges
packers and movers faridabad to gurgaon charges
packers and movers faridabad to lucknow charges
packers and movers faridabad to noida charges
packers and movers indore to bangalore charges
packers and movers indore to delhi charges
packers and movers indore to hyderabad charges
packers and movers indore to mumbai charges
packers and movers indore to pune charges