|
| 1 | +# Service readiness checklist |
| 2 | + |
| 3 | +This document aims to provide a checklist for determining whether service is mature enough for general availability. |
| 4 | + |
| 5 | +## Checklist |
| 6 | + |
| 7 | +Legend: `M` - Mandatory, `R` - Recommended |
| 8 | + |
| 9 | +### Documentation |
| 10 | + |
| 11 | +- `M`: README.md in the repository: |
| 12 | + * service description, ownership, links to documentation, |
| 13 | + build and deployment instructions, configuration variables, healthchecks, etc |
| 14 | +- `R`: Architectural diagram for the service |
| 15 | +- `R`: SLI/SLO/SLA definitions (at least service criticality level: P1/2/3) |
| 16 | + |
| 17 | +### Build and release pipeline |
| 18 | + |
| 19 | +- `M`: GitHub Actions worklows for building the code, running linting check and executing tests |
| 20 | +- `M`: Artifacts are container images |
| 21 | + |
| 22 | +### Infrastructure |
| 23 | + |
| 24 | +- `M`: All persistent state (if any) is stored in external storage |
| 25 | +- `M`: All components must be designed for HA (e.g. an app should be able to run as |
| 26 | + multiple instances in active/active configuration) |
| 27 | +- `M`: Deployment diagram |
| 28 | +- `M`: |
| 29 | + - traffic types (HTTP/gRPC/other) |
| 30 | + - spikes / static IP address |
| 31 | + - adv L7 feat.: waf/auth/sticky sessions/request routing/load balancing algs |
| 32 | + - CORS requirements |
| 33 | + |
| 34 | +### Security |
| 35 | + |
| 36 | +- `M`: No secrets in the code |
| 37 | +- `M`: Any housekeeping/one-off/migration/etc tasks must be part of the |
| 38 | + application; `stage` and `live` environment are not accessible directly. |
| 39 | +- `M`: Externally exposed services must require authentication |
| 40 | +- `M`: Documentation must have answers for the following questions: |
| 41 | + * Is this internal or external service? |
| 42 | + * Does the service make any outbound connections? If yes, specify destinations. |
| 43 | + * Does the service handle personally identifiable information? |
| 44 | +- `R`: HTTP headers / CORS: |
| 45 | + * `X-Frame-Options`, `Strict-Transport-Security`, `X-XSS-Protection`, |
| 46 | + `X-DNS-Prefetch-Control` |
| 47 | + |
| 48 | +### Operations |
| 49 | + |
| 50 | +- `M`: Application configuration is set via environment variables |
| 51 | +- `M`: Logging satisfies the following requirements: |
| 52 | + * single-line json to stdout/stderr, `message` or `msg` field at the root level |
| 53 | + * make sure data types for json fields aren't mixed otherwise parsing will not work |
| 54 | + * at least two verbosity levels: debug/error; |
| 55 | + * `error`: unexpected error that prevents further processing |
| 56 | + * `warn` : irregular events with defined recovery strategy |
| 57 | + * `info` : major state changes; must log: component start, became operational, |
| 58 | + event/task processed, shutdown started, just before exited |
| 59 | + * `debug`: diagnostic and troubleshooting event |
| 60 | + * global error handler; make sure all errors are logged |
| 61 | + * error response structure adheres to defined standard |
| 62 | + * field `level` must contain a string (error, warn) and not a number |
| 63 | + * distributed tracing: |
| 64 | + * request id is passed via `x-request-id` header, must propagate if received, otherwise generate a new one |
| 65 | + * request id must be included with the request-scoped logging and outgoing HTTP requests |
| 66 | +- `M`: Implements APM integration |
| 67 | +- `M`: Healthcheck endpoint; should provide: |
| 68 | + * at a minimum: 200 response if the service is operational, non-200 response code otherwise |
| 69 | + * include app version and commit hash in the response |
| 70 | + * recommended: [readiness and liveness endpoints] |
| 71 | +- `M`: Implements metrics: |
| 72 | + * 4 golden signals: latency/traffic/errors/saturation |
| 73 | + * endpoint (preferably `/metrics`) in Prometheus format on a separate port (eg `9090`) |
| 74 | + * availability, authentication status, and latency for all backend services |
| 75 | + * Node.js metrics |
| 76 | + * business metrics as necessary/defined by the service owner |
| 77 | +- `R`: Perform simple load testing of the service, use the results for: |
| 78 | + * sizing the live infrastructure; eg cores, RAM, storage size |
| 79 | + * define alerting thresholds; eg: 4 golden signals, latency/traffic(req cnt)/error/saturation |
| 80 | + |
| 81 | +### Resiliency |
| 82 | + |
| 83 | +- `M`: The service can run in multiple instances simultaneously |
| 84 | +- `M`: Must handle component unavailability gracefully (eg. unable to connect to storage): |
| 85 | + - all connections must have reasonable timeouts and error handling |
| 86 | + - all HTTP calls must have reasonable timeouts and error handling |
| 87 | + - reconnect with exponential back-off where necessary |
| 88 | +- `M`: [Graceful shutdown] on SIGTERM (15): stop accepting connections, complete in-flight work, exit |
| 89 | + |
| 90 | +## References |
| 91 | + |
| 92 | +- https://www.opslevel.com/blog/production-readiness-in-depth#deployment |
| 93 | +- https://gruntwork.io/devops-checklist/ |
| 94 | +- https://aleksei-kornev.medium.com/production-readiness-checklist-for-backend-applications-8d2b0c57ccec |
| 95 | +- https://github.com/mercari/production-readiness-checklist/blob/master/docs/references/pre-production-checklist.md |
| 96 | +- https://blog.last9.io/deployment-readiness-checklists/ |
| 97 | +- https://habr.com/en/post/438186/ |
| 98 | +- https://12factor.net |
| 99 | +- https://cloud.google.com/blog/products/containers-kubernetes/your-guide-kubernetes-best-practices |
| 100 | + |
| 101 | +[readiness and liveness endpoints]: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes |
| 102 | +[Graceful shutdown]: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace |
| 103 | + |
| 104 | +This document is based on a Lokalise Service Release Checklist, prepared by the Lokalise Platform Squad. |
0 commit comments