We had a rather serious outage this past week affecting several services which put us out of our SLA with customers. Now that everything has been resolved, I am conducting a post-mortem review.
From this review, I would like to come up with an internal document that describes the outage, its effects, our response and the resolution. I want to come up with a fairly standard form for future reuse. I have included my thoughts below, but what other items should be included? If this were a security-related incident, what would you add?
- Summary Executive level summary of event.
- Affected Services
- Impact What was the impact on our users and SLAs? Was there a cost in dollar terms, missed transactions, lost customers, etc?
- Outage Duration For each affected service if there were variances
- Cause Including primary and secondary causes
- Resolution
- Timeline of events Notifications, contact with external vendors, customer notifications, responses, etc.
- Problems with our response Did things not go as planned with our response to the outage? Correct people notified? Did vendors meet their contracted obligations?
- Preventative measures to take How do we prevent this outage from occurring again or reduce its impact?
- Detection Method How well did we detect this outage and how do we improve detection in the future?
- Changes to make in future outage responses
Try to keep posts down to one item and explanation, and this post can be updated with the top voted answers.