Post Mortem Template

Whenever something (a deployment, a system update, a configuration change etc.) on the reference server goes horribly wrong and affects our users by producing downtime, slowness, data loss or other noticeable things, basically every time we cause a situation where our users would ask themselves

"WTF???"

we developers will write, as part of our Site-Reliability strategy, a post mortem report to institutionalize improvement.

We do this to assure we...

investigate the root cause of the failure
determine follow-up actions
create a continuous, transparent feedback loop for our fellow OBS team mates, our users and devops people in the wider community

We publish these reports on https://openbuildservice.org/categories/deployments/

To write up these reports we use the following template. We usually start with building the timeline, then derive the rest from this conversation. Check out the already published reports for inspiration.

< TEMPLATE >

Title: What happened?

A brief summary of what happened

Date: When did this problem happen?

Impact: What was the result of the problem?

Root Causes: Why did this problem happen?

Trigger: What caused this problem to happen?

Resolution: How did you resolve this problem?

Detection: How did you get alerted that the problem happened?

Action Items

Action Item	Owner

Post Mortem Template

Title: What happened?

Action Items

Lessons Learned

Timeline (CEST)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!