Skip to content

Commit a42d6a0

Browse files
Lewis Dalykjw000
Lewis Daly
authored andcommitted
Feature/796 improved health check design (mojaloop#67)
* 796 add sequence diagram and design for improved health check * remove unneeded document * fix latency typo * Add clarification about statusEnum * Remove latency and sidear from design. Also clarify what `DOWN` means for subservices * Better generalize the sub-service enums and update docs accordingly
1 parent bd2fb75 commit a42d6a0

File tree

3 files changed

+603
-0
lines changed

3 files changed

+603
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
# GET Health Check
2+
3+
Design discussion for new Health Check implementation.
4+
5+
6+
## Objectives
7+
8+
The goal for this design is to implement a new Health check for mojaloop switch services that allows for a greater level of detail.
9+
10+
It Features:
11+
- Clear HTTP Statuses (no need to inspect the response to know there are no issues)
12+
- ~Backwards compatibility with existing health checks~ - No longer a requirement. See [this discussion](https://github.com/mojaloop/project/issues/796#issuecomment-498350828).
13+
- Information about the version of the API, and how long it has been running for
14+
- Information about sub-service (kafka, logging sidecar and mysql) connections
15+
16+
## Request Format
17+
`/health`
18+
19+
Uses the newly implemented health check. As discussed [here](https://github.com/mojaloop/project/issues/796#issuecomment-498350828) since there will be no added connection overhead (e.g. pinging a database) as part of implementing the health check, there is no need to complicate things with a simple and detailed version.
20+
21+
Responses Codes:
22+
- `200` - Success. The API is up and running, and is sucessfully connected to necessary services.
23+
- `502` - Bad Gateway. The API is up and running, but the API cannot connect to necessary service (eg. `kafka`).
24+
- `503` - Service Unavailable. This response is not implemented in this design, but will be the default if the api is not and running
25+
26+
## Response Format
27+
28+
| Name | Type | Description | Example |
29+
| --- | --- | --- | --- |
30+
| `status` | `statusEnum` | The status of the service. Options are `OK` and `DOWN`. _See `statusEnum` below_. | `"OK"` |
31+
| `uptime` | `number` | How long (in seconds) the service has been alive for. | `123456` |
32+
| `started` | `string` (ISO formatted date-time) | When the service was started (UTC) | `"2019-05-31T05:09:25.409Z"` |
33+
| `versionNumber` | `string` (semver) | The current version of the service. | `"5.2.5"` |
34+
| `services` | `Array<serviceHealth>` | A list of services this service depends on, and their connection status | _see below_ |
35+
36+
### serviceHealth
37+
38+
| Name | Type | Description | Example |
39+
| --- | --- | --- | --- |
40+
| `name` | `subServiceEnum` | The sub-service name. _See `subServiceEnum` below_. | `"broker"` |
41+
| `status` | `enum` | The status of the service. Options are `OK` and `DOWN` | `"OK"` |
42+
43+
### subServiceEnum
44+
45+
The subServiceEnum enum describes a name of the subservice:
46+
47+
Options:
48+
- `datastore` -> The database for this service (typically a MySQL Database).
49+
- `broker` -> The message broker for this service (typically Kafka).
50+
- `sidecar` -> The logging sidecar sub-service this service attaches to.
51+
- `cache` -> The caching sub-service this services attaches to.
52+
53+
54+
### statusEnum
55+
56+
The status enum represents status of the system or sub-service.
57+
58+
It has two options:
59+
- `OK` -> The service or sub-service is healthy.
60+
- `DOWN` -> The service or sub-service is unhealthy.
61+
62+
When a service is `OK`: the API is considered healthy, and all sub-services are also considered healthy.
63+
64+
If __any__ sub-service is `DOWN`, then the entire health check will fail, and the API will be considered `DOWN`.
65+
66+
## Defining Sub-Service health
67+
68+
It is not enough to simply ping a sub-service to know if it is healthy, we want to go one step further. These criteria will change with each sub-service.
69+
70+
### `datastore`
71+
72+
For `datastore`, a status of `OK` means:
73+
- An existing connection to the database
74+
- The database is not empty (contains more than 1 table)
75+
76+
77+
### `broker`
78+
79+
For `broker`, a status of `OK` means:
80+
- An existing connection to the kafka broker
81+
- The necessary topics exist. This will change depending on which service the health check is running for.
82+
83+
For example, for the `central-ledger` service to be considered healthy, the following topics need to be found:
84+
```
85+
topic-admin-transfer
86+
topic-transfer-prepare
87+
topic-transfer-position
88+
topic-transfer-fulfil
89+
```
90+
91+
### `sidecar`
92+
93+
For `sidecar`, a status of `OK` means:
94+
- An existing connection to the sidecar
95+
96+
97+
### `cache`
98+
99+
For `cache`, a status of `OK` means:
100+
- An existing connection to the cache
101+
102+
103+
## Swagger Definition
104+
105+
>_Note: These will be added to the existing swagger definitions for the following services:_
106+
> - `ml-api-adapter`
107+
> - `central-ledger`
108+
> - `central-settlement`
109+
> - `central-event-processor`
110+
> - `email-notifier`
111+
112+
```json
113+
{
114+
/// . . .
115+
"/health": {
116+
"get": {
117+
"operationId": "getHealth",
118+
"tags": [
119+
"health"
120+
],
121+
"responses": {
122+
"default": {
123+
"schema": {
124+
"$ref": "#/definitions/health"
125+
},
126+
"description": "Successful"
127+
}
128+
}
129+
}
130+
},
131+
// . . .
132+
"definitions": {
133+
"health": {
134+
"type": "object",
135+
"properties": {
136+
"status": {
137+
"type": "string",
138+
"enum": [
139+
"OK",
140+
"DOWN"
141+
]
142+
},
143+
"uptime": {
144+
"description": "How long (in seconds) the service has been alive for.",
145+
"type": "number",
146+
},
147+
"started": {
148+
"description": "When the service was started (UTC)",
149+
"type": "string",
150+
"format": "date-time"
151+
},
152+
"versionNumber": {
153+
"description": "The current version of the service.",
154+
"type": "string",
155+
"example": "5.2.3",
156+
},
157+
"services": {
158+
"description": "A list of services this service depends on, and their connection status",
159+
"type": "array",
160+
"items": {
161+
"$ref": "#/definitions/serviceHealth"
162+
}
163+
},
164+
},
165+
},
166+
"serviceHealth": {
167+
"type": "object",
168+
"properties": {
169+
"name": {
170+
"description": "The sub-service name.",
171+
"type": "string",
172+
"enum": [
173+
"datastore",
174+
"broker",
175+
"sidecar",
176+
"cache"
177+
]
178+
},
179+
"status": {
180+
"description": "The connection status with the service.",
181+
"type": "string",
182+
"enum": [
183+
"OK",
184+
"DOWN"
185+
]
186+
}
187+
}
188+
}
189+
}
190+
}
191+
```
192+
193+
194+
### Example Requests and Responses:
195+
196+
__Successful Legacy Health Check:__
197+
198+
```bash
199+
GET /health HTTP/1.1
200+
Content-Type: application/json
201+
202+
200 SUCCESS
203+
{
204+
"status": "OK"
205+
}
206+
```
207+
208+
209+
__Successful New Health Check:__
210+
211+
```
212+
GET /health?detailed=true HTTP/1.1
213+
Content-Type: application/json
214+
215+
200 SUCCESS
216+
{
217+
"status": "OK",
218+
"uptime": 0,
219+
"started": "2019-05-31T05:09:25.409Z",
220+
"versionNumber": "5.2.3",
221+
"services": [
222+
{
223+
"name": "broker",
224+
"status": "OK",
225+
}
226+
]
227+
}
228+
```
229+
230+
__Failed Health Check, but API is up:__
231+
232+
```
233+
GET /health?detailed=true HTTP/1.1
234+
Content-Type: application/json
235+
236+
502 BAD GATEWAY
237+
{
238+
"status": "DOWN",
239+
"uptime": 0,
240+
"started": "2019-05-31T05:09:25.409Z",
241+
"versionNumber": "5.2.3",
242+
"services": [
243+
{
244+
"name": "broker",
245+
"status": "DOWN",
246+
}
247+
]
248+
}
249+
```
250+
251+
__Failed Health Check:__
252+
253+
```
254+
GET /health?detailed=true HTTP/1.1
255+
Content-Type: application/json
256+
257+
503 SERVICE UNAVAILABLE
258+
```
259+
260+
261+
## Sequence Diagram
262+
263+
Sequence design diagram for the GET Health
264+
265+
{% uml src="mojaloop-technical-overview/central-ledger/assets/diagrams/sequence/seq-get-health-1.0.0.plantuml" %}
266+
{% enduml %}
267+
268+
![](../assets/diagrams/sequence/seq-get-health-1.0.0.svg)

0 commit comments

Comments
 (0)