15

We had a rather serious outage this past week affecting several services which put us out of our SLA with customers. Now that everything has been resolved, I am conducting a post-mortem review.

From this review, I would like to come up with an internal document that describes the outage, its effects, our response and the resolution. I want to come up with a fairly standard form for future reuse. I have included my thoughts below, but what other items should be included? If this were a security-related incident, what would you add?

  • Summary Executive level summary of event.
  • Affected Services
  • Impact What was the impact on our users and SLAs? Was there a cost in dollar terms, missed transactions, lost customers, etc?
  • Outage Duration For each affected service if there were variances
  • Cause Including primary and secondary causes
  • Resolution
  • Timeline of events Notifications, contact with external vendors, customer notifications, responses, etc.
  • Problems with our response Did things not go as planned with our response to the outage? Correct people notified? Did vendors meet their contracted obligations?
  • Preventative measures to take How do we prevent this outage from occurring again or reduce its impact?
  • Detection Method How well did we detect this outage and how do we improve detection in the future?
  • Changes to make in future outage responses

Try to keep posts down to one item and explanation, and this post can be updated with the top voted answers.

4 Answers 4

6

Although it could be covered in the Preventative measures to take, I would recommend having a Detection method section that you could use to note what the true symptoms were and how you could detect the problem (faster) if it happens again, ideally using automation.

1
  • Added to the wiki
    – Doug Luxem
    Aug 20, 2009 at 18:16
2

Looks good. I would only add the following:

Effects/Consequences: What is the consequence of the outage - who was affected, which SLAs were violated (if any), were there any knock-on effects?

1

Affected services and outage duration only tells you part of how bad an outage was. You also want to know what the impact on the business was.

Impact: What effect did this have on users, and how was it perceived? How much money did this cost us (by missing of SLA, lost orders etc.)?

1
  • I like the distinction between affected services and business impact, but I would categorize it as "Business Impact" and not just impact (to draw a distinction between it and the affected services/duration information). Plus it'll draw the eye of management who need to be aware of the business impact, if not all the technical details of what services were impacted...
    – Milner
    Jun 26, 2009 at 17:39
1

Public release & internal release

This is more something for management to decide but you might what to include what should be released to customers about it or your recommendation anyway. Also either way get sign off from management on the exact wording of what will be released to customers before releasing anything.

The public release should be included in the this so anyone in the company knows what they are allowed to tell customers.

1
  • I think this internal document could be used for generating an external release to customers. Exactly what would be told to customers would be up to our executives and marketing/communcations.
    – Doug Luxem
    Jun 21, 2009 at 19:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .