-
Notifications
You must be signed in to change notification settings - Fork 461
Incident Management Protocol
Effective incident management is vital to go back to normal business operation as fast as possible. To be effective it's very important that everybody working in the incident resolution knows their role. The role separation helps in knowing what the role should and should not do in order to avoid confusion and chaos around who's responsible for what.
Here is who needs to do what, when the service is down or there are other major disruptions.
By default the Production-Squad is the incident manager.
However, everybody can declare themselves incident manager, especially if they notice that the Production-Squad is not responding for any reason.
The duties of the Incident Manager are:
- Create an incident state document
- Declare the incident to our team channel (
:warning: We have an incident going on, follow it here: https://etherpad.opensuse.org/p/....) - Fulfill all the other roles (Communications & OPS) OR delegate roles to anyone in the team
- Declare the incident as resolved
The duties of the communications are:
- Continuous update of the incident state document
- Continuous update of stakeholders through the communication channels
- Write the Post Mortem Report
The duties of Ops are:
- Stop the bleeding and restore the service
- Find the root-cause
For the time the incident is unresolved, we are updating an Incident State Document (template) on https://etherpad.opensuse.org. We do this to keep people, who are affected by the incident, updated on what is going on.
After the incident is under control and we have understood what has happened, we are writing a Post Mortem Report (template) on our blog. We do this to share our learnings with our community of OBS admins, users and contributors.
It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.
The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.
Please add a reply to them (in a slack thread) why they are false.
Having to come up with sentences to use in communication is hard during an incident. Find some templates below, you can find more on the internet.
Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution.
All build.opensuse.org users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.
Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.
Our communication channels include
- Our mailing list [email protected]
- IRC (irc://irc.libera.chat/openSUSE-buildservice)
- OBS Status Messages
- Slack (#help-obs & #team-build-solutions)
- Development Environment Overview
- Development Environment Tips & Tricks
- Spec-Tips
- Code Style
- Rubocop
- Testing with VCR
- Test in kanku
- Authentication
- Authorization
- Autocomplete
- BS Requests
- Events
- ProjectLog
- Notifications
- Feature Toggles
- Build Results
- Attrib classes
- Flags
- The BackendPackage Cache
- Maintenance classes
- Cloud uploader
- Delayed Jobs
- Staging Workflow
- StatusHistory
- OBS API
- Owner Search
- Search
- Links
- Distributions
- Repository
- Data Migrations
- Package Versions
- next_rails
- Ruby Update
- Rails Profiling
- Remote Pairing Setup Guide
- Factory Dashboard
- osc
- Setup an OBS Development Environment on macOS
- Run OpenQA smoketest locally
- Responsive Guidelines
- Importing database dumps
- Problem Statement & Solution
- Kickoff New Stuff
- New Swagger API doc
- Documentation and Communication
- GitHub Actions
- Brakeman
- How to Introduce Software Design Patterns
- Query Objects
- Services
- View Components
- RFC: Core Components
- RFC: Decorator Pattern
- RFC: Backend models
- RFC: Hotwire Turbo Frames Pattern