Skip to content

Post Mortem Template

Henne Vogelsang edited this page May 28, 2020 · 4 revisions

Whenever something (a deployment, a system update, a configuration change etc.) on the reference server goes horribly wrong and affects our users by producing downtime, slowness, data loss or other noticeable things, basically every time we cause a situation where our users would ask themselves

"WTF???"

we developers will write, as part of our Site-Reliability strategy, a post mortem report to institutionalize improvement.

We do this to assure we...

  1. investigate the root cause of the failure
  2. determine follow-up actions
  3. create a continuous, transparent feedback loop for our fellow OBS team mates, our users and devops people in the wider community

We publish these reports on https://openbuildservice.org/categories/deployments/

To write up these reports we use the following template. We usually start with building the timeline, then derive the rest from this conversation. Check out the already published reports for inspiration.

< TEMPLATE >

Title: What happened?

A brief summary of what happened

Date: When did this problem happen?

Impact: What was the result of the problem?

Root Causes: Why did this problem happen?

Trigger: What caused this problem to happen?

Resolution: How did you resolve this problem?

Detection: How did you get alerted that the problem happened?

Action Items

Action Item Owner

Lessons Learned

What went well?

What went wrong?

Where we got lucky?

Timeline (CEST)

  • 11:15 We got an alert about...

< /TEMPLATE >

Clone this wiki locally