|
| 1 | +# GET Health Check |
| 2 | + |
| 3 | +Design discussion for new Health Check implementation. |
| 4 | + |
| 5 | + |
| 6 | +## Objectives |
| 7 | + |
| 8 | +The goal for this design is to implement a new Health check for mojaloop switch services that allows for a greater level of detail. |
| 9 | + |
| 10 | +It Features: |
| 11 | +- Clear HTTP Statuses (no need to inspect the response to know there are no issues) |
| 12 | +- ~Backwards compatibility with existing health checks~ - No longer a requirement. See [this discussion](https://github.com/mojaloop/project/issues/796#issuecomment-498350828). |
| 13 | +- Information about the version of the API, and how long it has been running for |
| 14 | +- Information about sub-service (kafka, logging sidecar and mysql) connections |
| 15 | + |
| 16 | +## Request Format |
| 17 | +`/health` |
| 18 | + |
| 19 | +Uses the newly implemented health check. As discussed [here](https://github.com/mojaloop/project/issues/796#issuecomment-498350828) since there will be no added connection overhead (e.g. pinging a database) as part of implementing the health check, there is no need to complicate things with a simple and detailed version. |
| 20 | + |
| 21 | +Responses Codes: |
| 22 | +- `200` - Success. The API is up and running, and is sucessfully connected to necessary services. |
| 23 | +- `502` - Bad Gateway. The API is up and running, but the API cannot connect to necessary service (eg. `kafka`). |
| 24 | +- `503` - Service Unavailable. This response is not implemented in this design, but will be the default if the api is not and running |
| 25 | + |
| 26 | +## Response Format |
| 27 | + |
| 28 | +| Name | Type | Description | Example | |
| 29 | +| --- | --- | --- | --- | |
| 30 | +| `status` | `statusEnum` | The status of the service. Options are `OK` and `DOWN`. _See `statusEnum` below_. | `"OK"` | |
| 31 | +| `uptime` | `number` | How long (in seconds) the service has been alive for. | `123456` | |
| 32 | +| `started` | `string` (ISO formatted date-time) | When the service was started (UTC) | `"2019-05-31T05:09:25.409Z"` | |
| 33 | +| `versionNumber` | `string` (semver) | The current version of the service. | `"5.2.5"` | |
| 34 | +| `services` | `Array<serviceHealth>` | A list of services this service depends on, and their connection status | _see below_ | |
| 35 | + |
| 36 | +### serviceHealth |
| 37 | + |
| 38 | +| Name | Type | Description | Example | |
| 39 | +| --- | --- | --- | --- | |
| 40 | +| `name` | `subServiceEnum` | The sub-service name. _See `subServiceEnum` below_. | `"broker"` | |
| 41 | +| `status` | `enum` | The status of the service. Options are `OK` and `DOWN` | `"OK"` | |
| 42 | + |
| 43 | +### subServiceEnum |
| 44 | + |
| 45 | +The subServiceEnum enum describes a name of the subservice: |
| 46 | + |
| 47 | +Options: |
| 48 | +- `datastore` -> The database for this service (typically a MySQL Database). |
| 49 | +- `broker` -> The message broker for this service (typically Kafka). |
| 50 | +- `sidecar` -> The logging sidecar sub-service this service attaches to. |
| 51 | +- `cache` -> The caching sub-service this services attaches to. |
| 52 | + |
| 53 | + |
| 54 | +### statusEnum |
| 55 | + |
| 56 | +The status enum represents status of the system or sub-service. |
| 57 | + |
| 58 | +It has two options: |
| 59 | +- `OK` -> The service or sub-service is healthy. |
| 60 | +- `DOWN` -> The service or sub-service is unhealthy. |
| 61 | + |
| 62 | +When a service is `OK`: the API is considered healthy, and all sub-services are also considered healthy. |
| 63 | + |
| 64 | +If __any__ sub-service is `DOWN`, then the entire health check will fail, and the API will be considered `DOWN`. |
| 65 | + |
| 66 | +## Defining Sub-Service health |
| 67 | + |
| 68 | +It is not enough to simply ping a sub-service to know if it is healthy, we want to go one step further. These criteria will change with each sub-service. |
| 69 | + |
| 70 | +### `datastore` |
| 71 | + |
| 72 | +For `datastore`, a status of `OK` means: |
| 73 | +- An existing connection to the database |
| 74 | +- The database is not empty (contains more than 1 table) |
| 75 | + |
| 76 | + |
| 77 | +### `broker` |
| 78 | + |
| 79 | +For `broker`, a status of `OK` means: |
| 80 | +- An existing connection to the kafka broker |
| 81 | +- The necessary topics exist. This will change depending on which service the health check is running for. |
| 82 | + |
| 83 | +For example, for the `central-ledger` service to be considered healthy, the following topics need to be found: |
| 84 | +``` |
| 85 | +topic-admin-transfer |
| 86 | +topic-transfer-prepare |
| 87 | +topic-transfer-position |
| 88 | +topic-transfer-fulfil |
| 89 | +``` |
| 90 | + |
| 91 | +### `sidecar` |
| 92 | + |
| 93 | +For `sidecar`, a status of `OK` means: |
| 94 | +- An existing connection to the sidecar |
| 95 | + |
| 96 | + |
| 97 | +### `cache` |
| 98 | + |
| 99 | +For `cache`, a status of `OK` means: |
| 100 | +- An existing connection to the cache |
| 101 | + |
| 102 | + |
| 103 | +## Swagger Definition |
| 104 | + |
| 105 | +>_Note: These will be added to the existing swagger definitions for the following services:_ |
| 106 | +> - `ml-api-adapter` |
| 107 | +> - `central-ledger` |
| 108 | +> - `central-settlement` |
| 109 | +> - `central-event-processor` |
| 110 | +> - `email-notifier` |
| 111 | +
|
| 112 | +```json |
| 113 | +{ |
| 114 | + /// . . . |
| 115 | + "/health": { |
| 116 | + "get": { |
| 117 | + "operationId": "getHealth", |
| 118 | + "tags": [ |
| 119 | + "health" |
| 120 | + ], |
| 121 | + "responses": { |
| 122 | + "default": { |
| 123 | + "schema": { |
| 124 | + "$ref": "#/definitions/health" |
| 125 | + }, |
| 126 | + "description": "Successful" |
| 127 | + } |
| 128 | + } |
| 129 | + } |
| 130 | + }, |
| 131 | + // . . . |
| 132 | + "definitions": { |
| 133 | + "health": { |
| 134 | + "type": "object", |
| 135 | + "properties": { |
| 136 | + "status": { |
| 137 | + "type": "string", |
| 138 | + "enum": [ |
| 139 | + "OK", |
| 140 | + "DOWN" |
| 141 | + ] |
| 142 | + }, |
| 143 | + "uptime": { |
| 144 | + "description": "How long (in seconds) the service has been alive for.", |
| 145 | + "type": "number", |
| 146 | + }, |
| 147 | + "started": { |
| 148 | + "description": "When the service was started (UTC)", |
| 149 | + "type": "string", |
| 150 | + "format": "date-time" |
| 151 | + }, |
| 152 | + "versionNumber": { |
| 153 | + "description": "The current version of the service.", |
| 154 | + "type": "string", |
| 155 | + "example": "5.2.3", |
| 156 | + }, |
| 157 | + "services": { |
| 158 | + "description": "A list of services this service depends on, and their connection status", |
| 159 | + "type": "array", |
| 160 | + "items": { |
| 161 | + "$ref": "#/definitions/serviceHealth" |
| 162 | + } |
| 163 | + }, |
| 164 | + }, |
| 165 | + }, |
| 166 | + "serviceHealth": { |
| 167 | + "type": "object", |
| 168 | + "properties": { |
| 169 | + "name": { |
| 170 | + "description": "The sub-service name.", |
| 171 | + "type": "string", |
| 172 | + "enum": [ |
| 173 | + "datastore", |
| 174 | + "broker", |
| 175 | + "sidecar", |
| 176 | + "cache" |
| 177 | + ] |
| 178 | + }, |
| 179 | + "status": { |
| 180 | + "description": "The connection status with the service.", |
| 181 | + "type": "string", |
| 182 | + "enum": [ |
| 183 | + "OK", |
| 184 | + "DOWN" |
| 185 | + ] |
| 186 | + } |
| 187 | + } |
| 188 | + } |
| 189 | + } |
| 190 | +} |
| 191 | +``` |
| 192 | + |
| 193 | + |
| 194 | +### Example Requests and Responses: |
| 195 | + |
| 196 | +__Successful Legacy Health Check:__ |
| 197 | + |
| 198 | +```bash |
| 199 | +GET /health HTTP/1.1 |
| 200 | +Content-Type: application/json |
| 201 | + |
| 202 | +200 SUCCESS |
| 203 | +{ |
| 204 | + "status": "OK" |
| 205 | +} |
| 206 | +``` |
| 207 | + |
| 208 | + |
| 209 | +__Successful New Health Check:__ |
| 210 | + |
| 211 | +``` |
| 212 | +GET /health?detailed=true HTTP/1.1 |
| 213 | +Content-Type: application/json |
| 214 | +
|
| 215 | +200 SUCCESS |
| 216 | +{ |
| 217 | + "status": "OK", |
| 218 | + "uptime": 0, |
| 219 | + "started": "2019-05-31T05:09:25.409Z", |
| 220 | + "versionNumber": "5.2.3", |
| 221 | + "services": [ |
| 222 | + { |
| 223 | + "name": "broker", |
| 224 | + "status": "OK", |
| 225 | + } |
| 226 | + ] |
| 227 | +} |
| 228 | +``` |
| 229 | + |
| 230 | +__Failed Health Check, but API is up:__ |
| 231 | + |
| 232 | +``` |
| 233 | +GET /health?detailed=true HTTP/1.1 |
| 234 | +Content-Type: application/json |
| 235 | +
|
| 236 | +502 BAD GATEWAY |
| 237 | +{ |
| 238 | + "status": "DOWN", |
| 239 | + "uptime": 0, |
| 240 | + "started": "2019-05-31T05:09:25.409Z", |
| 241 | + "versionNumber": "5.2.3", |
| 242 | + "services": [ |
| 243 | + { |
| 244 | + "name": "broker", |
| 245 | + "status": "DOWN", |
| 246 | + } |
| 247 | + ] |
| 248 | +} |
| 249 | +``` |
| 250 | + |
| 251 | +__Failed Health Check:__ |
| 252 | + |
| 253 | +``` |
| 254 | +GET /health?detailed=true HTTP/1.1 |
| 255 | +Content-Type: application/json |
| 256 | +
|
| 257 | +503 SERVICE UNAVAILABLE |
| 258 | +``` |
| 259 | + |
| 260 | + |
| 261 | +## Sequence Diagram |
| 262 | + |
| 263 | +Sequence design diagram for the GET Health |
| 264 | + |
| 265 | +{% uml src="mojaloop-technical-overview/central-ledger/assets/diagrams/sequence/seq-get-health-1.0.0.plantuml" %} |
| 266 | +{% enduml %} |
| 267 | + |
| 268 | + |
0 commit comments