Skip to content

Commit 6ccdcd6

Browse files
committed
feat(mojaloop/#3424): analyse als perf results
feat(mojaloop/#3424): analyse als perf results - mojaloop/project#3424 - Added analysis for scenarios 1-14 feat(mojaloop/#3400): benchmarking performance for als - mojaloop/project#3400 - Updated comments based on PR review.
1 parent a03e42b commit 6ccdcd6

File tree

13 files changed

+212
-87
lines changed

13 files changed

+212
-87
lines changed

20230726/s1-1690367402771/README.md

+19-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# Scenario 1 - ALS-bypass Baseline with Sims-only
22

3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. ADMIN GET /participants request to the Central-Ledger to validate payerFspId. <-- sync response
6+
2. ADMIN GET /participants request to the Central-Ledger to validate payeeFspId. <-- sync response
7+
3. ORACLE GET /participants request to the Oracle to resolve FSPID for payeeId. <-- sync response
8+
4. FSPIOP GET /parties request to the ALS <-- async callback response
9+
5. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
10+
311
```conf
412
var-testid=1690367402771
513
params=&var-testid=1690367402771&from=1690367297867&to=1690368635328
@@ -46,8 +54,17 @@ params=&var-testid=1690367402771&from=1690367297867&to=1690368635328
4654

4755
## Observations
4856

49-
TBD
57+
- `Callback-Handler` Simulator Service is able to handle `400+ Ops/s` End-to-end, while sustaining an average duration of just over `2ms`. This is shown by the following dashboards/metrics:
58+
- [K6](./images/Official%20k6%20Test%20Result.png)
59+
- `Iteration Rate` (Mean) = `461 Ops/s`
60+
- `Ieration Duration (avg)` (Mean) = `2.22ms`
61+
- [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
62+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
63+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations`. Mean is `1.86ms`.
64+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observes `Ops/s` and the `request` where most of the duration is spend due to the Callback-Handler sending out the **Async** `FSPIOP PUT /parties` callback response.
65+
- [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
66+
- `Callback-Handler` services show no observable resource constraint from both a memory and cpu usage.
5067

5168
## Recommendations
5269

53-
TBD
70+
- Observe `Scenario #2+` and compare the `Callback-Handler`'s metrics against this **baseline** to determine if there are any issues with either the Mocked Simulators (i.e. `Callback-Handlers`) or the **Async** `FSPIOP PUT /parties` callback response.

20230726/s2-1690376653994/README.md

+29-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 2 - ALS Baseline with Sims-only
1+
# Scenario 2 - ALS Baseline with Sims
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
var-testid=1690376653994
@@ -50,8 +55,29 @@ ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.2
5055

5156
## Observations
5257

53-
TBD
58+
- `Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `10 Ops/s` End-to-end, while sustaining an average duration of just over `100ms`. This is shown by the following dashboards/metrics:
59+
- [K6](./images/Official%20k6%20Test%20Result.png)
60+
- `Iteration Rate` (Mean) = `10 Ops/s`
61+
- `Ieration Duration (avg)` (Mean) = `101ms`
62+
- [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
63+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
64+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
65+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
66+
- The ingress `Ops/s` for `op:admin_get_participants_endpoints` is very low due the ALS's caching mechanism.
67+
- [Account-Lookup-Service](./images/dashboard-account-lookup-service.png)
68+
- Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
69+
- Most of the `duration` is observed to be spend on the `validateParticipant` @ `30ms`
70+
- Most of the `Op/s` is observed to be spend on the `validateParticipant` @ `30 Op/s`, a three-fold increase over other egress metrics.
71+
- [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
72+
- `Account-Lookup-Service` is showing a `100%` CPU usage (equivalent to a single core of the host machine), indicating that it is most likely being CPU constrained.
73+
- `Callback-Handler` is within bounds of the `Scenario #1`.
5474

5575
## Recommendations
5676

57-
TBD
77+
- Investigate logic behind the `validateParticipant` egress implementation
78+
- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
79+
- Investigate `Account-Lookup-Service` high CPU usage by removing configurable factors that may impact CPU usage, i.e.
80+
- Logging
81+
- Event Audits
82+
- Increase `UV_THREADPOOL_SIZE` for IO threads.
83+
- Investigate enabling `HTTP Keep-Alive` for egress HTTP requests, especially the `validateParticipants`.

20230726/s6-1690380087112/README.md

+10-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 6 - ALS Baseline ALS Baseline with Sims-only, HTTP-Keep-Alive enabled
1+
# Scenario 6 - ALS Baseline ALS Baseline with Sims, HTTP-Keep-Alive enabled
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690380087112
@@ -54,8 +59,10 @@ ACCOUNT_LOOKUP_SERVICE_VERSION=local
5459

5560
## Observations
5661

57-
TBD
62+
- No observable difference between `Scenario #2`.
63+
- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
5864

5965
## Recommendations
6066

61-
TBD
67+
- Same as `Scenario #2`.
68+
- Consider re-running this scenario once an increase of through-put has been observed.

20230726/s7-1690407403663/README.md

+13-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 1 - ALS-bypass Baseline with Sims-only
1+
# Scenario 7 - ALS Baseline with Sims, UV_THREADS
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```yaml
49
testid:
@@ -8,7 +13,7 @@ testid:
813
params: &from=1690407309197&to=1690408662339
914
## Added for Test Scenario 7
1015
UV_THREADPOOL_SIZE:
11-
- 4 (default)
16+
- 4 ## default
1217
- 8
1318
- 16
1419
```
@@ -57,4 +62,10 @@ UV_THREADPOOL_SIZE:
5762

5863
## Observations
5964

65+
- No observable difference between `Scenario #2`.
66+
- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
67+
6068
## Recommendations
69+
70+
- Same as `Scenario #2`.
71+
- Consider re-running this scenario once an increase of through-put has been observed.

20230727/s10-1690466917636/README.md

+33-10
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 8 - ALS Baseline with Sims-only, multiple k6 VUs
1+
# Scenario 10 - ALS Baseline with Sims-only, Disabled JSON.stringify ALS
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690466917636
@@ -10,7 +15,7 @@ EVENT_SDK_ASYNC_OVERRIDE_EVENTS=""
1015
EVENT_SDK_LOG_FILTER=""
1116
## Added for Test Scenario 7
1217
UV_THREADPOOL_SIZE=16
13-
ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.2
18+
ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.3
1419
## Changes for Test Scenario 9
1520
Enabled in-memory storage for Mysql ALS with following config in docker-compose file
1621
## Changes for Test Scenario 10
@@ -49,16 +54,34 @@ Disabled JSON.stringify in logResponse function of ALS
4954

5055
## Snapshots
5156

52-
- [Docker]()
53-
- [K6]()
54-
- [Callback Handler Service]()
55-
- [Account Lookup Service]()
56-
- [Nodejs moja_als]()
57-
- [Nodejs cbs]()
58-
- [MySQL]()
57+
N/A
5958

6059
## Observations
6160

62-
- It seems JSON.stringify is a demanding operation and causing bottleneck for HTTP sync responses also.
61+
- `Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `100 Ops/s` End-to-end, while sustaining an average duration of around `14ms`. This is shown by the following dashboards/metrics:
62+
- [K6](./images/Official%20k6%20Test%20Result.png)
63+
- `Iteration Rate` (Mean) = `100 Ops/s`
64+
- `Ieration Duration (avg)` (Mean) = `14ms`
65+
- [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
66+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
67+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
68+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
69+
- [Account-Lookup-Service](./images/dashboard-account-lookup-service.png)
70+
- Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
71+
- The `validateParticipant`'s `duration` is @ `1.32ms`, vs `30ms`...which is in-line with other egress.
72+
- The `validateParticipant`'s `Op/s` is @ near `300 Op/s`, vs `30`...which is in-line with `Scenario #2` and the three-fold increase over the End-to-end `100 Op/s`.
73+
- [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
74+
- `Account-Lookup-Service` is showing a reduced CPU usage of `70%`, `30%` down from `100%`, indicating that the service is able to do more if K6 VUs are increased.
75+
- `Callback-Handler` is within bounds of the `Scenario #1`.
76+
77+
- `JSON.stringify` is a demanding operation which is causing a bottleneck on the NodeJS `Event-loop` impacting the End-to-end `Op/s` and `duration`.
78+
- Comparing the `Event-Loop Lag` on the [NodeJS Application Dashboard for ALS](./images/NodeJS%20Application%20Dashboard%20ALS.png) on this Scenario vs [Scenario #2](../../20230726/s2-1690376653994/images/NodeJS%20Application%20Dashboard-moja_als.png) we can see a huge difference between delay introduced by the `JSON.stringify` operation due to it blocking of the `Event-loop`:
79+
- `Scenario #2` - Mean `13.3 ms`, Max `24.2 ms`, Min `11.1 ms`
80+
- `Scenario #10` - Mean `2.41 ms`, Max `7.67 ms`, Min `1.92 ms`
6381

6482
## Recommendations
83+
84+
- Run `Scenarios #3 -> #4` (i.e. `#11 -> #14`)to determine if the scalability of the `Account-Lookup-Service` is linear.
85+
- Run `Scenarios #5 -> #9` to see if there are any observable differences due to the increase in End-to-end `Op/s` and reduced `duration`.
86+
- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
87+
- Monitor the `Event-Loop Lag` for a mean of `3 ms` or more, especially when a process seems to be CPU constrained. In such scenarios, it is recommend to profile the service to see if any obvious `Event-Loop` "blockers" can be identified.

20230727/s11-1690488145504/README.md

+29-9
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 11 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + 4x k6 VUs.
1+
# Scenario 11 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + 4x k6 VUs
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690488145504
@@ -51,16 +56,31 @@ Up k6s VUs to 4.
5156

5257
## Snapshots
5358

54-
- [Docker]()
55-
- [K6]()
56-
- [Callback Handler Service]()
57-
- [Account Lookup Service]()
58-
- [Nodejs moja_als]()
59-
- [Nodejs cbs]()
60-
- [MySQL]()
59+
N/A
6160

6261
## Observations
6362

64-
Ops/s have increased compared to scenario #10.
63+
- End-to-end `Ops/s` has increased by `20%` when compared to `Scenario #10`.
64+
- End-to-end `duration` has increased by `100%` when compared to `Scenario #10`.
65+
- `Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `120 Ops/s` End-to-end, while sustaining an average duration of around `30ms`. This is shown by the following dashboards/metrics:
66+
- [K6](./images/Official%20k6%20Test%20Result.png)
67+
- `Iteration Rate` (Mean) = `113 Ops/s`
68+
- `Ieration Duration (avg)` (Mean) = `27.2 ms`
69+
- [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
70+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
71+
- `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
72+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
73+
- [Account-Lookup-Service](./images/dashboard-account-lookup-service.png)
74+
- Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
75+
- The `validateParticipant` is in-line with observations.
76+
- [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
77+
- `Account-Lookup-Service` is showing increased CPU usage of just over `100%`, indicating that the service is near its limit.
78+
- `Callback-Handler` is within bounds of the `Scenario #1`.
79+
- Comparing the `Event-Loop Lag` on the [NodeJS Application Dashboard for ALS](./images/NodeJS%20Application%20Dashboard%20ALS.png) between previous scenarios shows no major difference:
80+
- `Scenario #10` - Mean `2.41 ms`, Max `7.67 ms`, Min `1.92 ms`
81+
- `Scenario #11` - Mean `2.36 ms`, Max `3.82 ms`, Min `1.86 ms`
6582

6683
## Recommendations
84+
85+
- Additional profiling of the `Account-Lookup-Service` is required to further identity unnecessary or un-optimized blocking operations that may impact the NodeJs `Event Loop`. Further removing/optimizing these issues should increase the overall End-to-end `through-put` while minimizing the `duration`.
86+
- Same as `Scenario #10` (minus this scenario).

20230727/s12-1690491686694/README.md

+10-9
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 12 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + ALS Scale 2.
1+
# Scenario 12 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + ALS Scale 2
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690491686694
@@ -51,16 +56,12 @@ Up ALS scaling to 2.
5156

5257
## Snapshots
5358

54-
- [Docker]()
55-
- [K6]()
56-
- [Callback Handler Service]()
57-
- [Account Lookup Service]()
58-
- [Nodejs moja_als]()
59-
- [Nodejs cbs]()
60-
- [MySQL]()
59+
N/A.
6160

6261
## Observations
6362

64-
K6s could not saturate 2 replicas of the ALS.
63+
- Similar observation to `Scenario 10` where 1x K6s VU could not saturate 2 replicas of the ALS.
6564

6665
## Recommendations
66+
67+
- Re-run same scenario with more K6s VUs to match the `Account-Lookup-Service`'s scaling factor.

20230727/s13-1690493569083/README.md

+12-8
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 13 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + + Scale 4 + 4x k6 VUs.
1+
# Scenario 13 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + + Scale 4 + 4x k6 VUs
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690493569083
@@ -52,14 +57,13 @@ Up k6s VUs to 4.
5257

5358
## Snapshots
5459

55-
- [Docker]()
56-
- [K6]()
57-
- [Callback Handler Service]()
58-
- [Account Lookup Service]()
59-
- [Nodejs moja_als]()
60-
- [Nodejs cbs]()
61-
- [MySQL]()
60+
N/A
6261

6362
## Observations
6463

64+
- End-to-end `200 Op/s` with `18.3 ms` achieved.
65+
- Scalability looks near linear.
66+
6567
## Recommendations
68+
69+
- Increase `Account-Lookup-Service` scaling factor to determine if linearly scalable.

20230727/s14-1690504862678/README.md

+6-8
Original file line numberDiff line numberDiff line change
@@ -54,15 +54,13 @@ Up k6s VUs to 6.
5454

5555
## Snapshots
5656

57-
- [Docker]()
58-
- [K6]()
59-
- [Callback Handler Service]()
60-
- [Account Lookup Service]()
61-
- [Nodejs moja_als]()
62-
- [Nodejs cbs]()
63-
- [MySQL]()
57+
N/A
6458

6559
## Observations
66-
Observed the iteration rate is 242 ops/sec max
60+
61+
- End-to-end max of `242 Op/s` with a mean of `23.0 ms` achieved --> Scalability is not linear.
6762

6863
## Recommendations
64+
65+
- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
66+
- Profile the `Account-Lookup-Service` to see if any further `Event-Loop` "blockers" can be identified.

20230727/s5-1690447500991/README.md

+12-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 1 - ALS-bypass Baseline with Sims-only
1+
# Scenario 5 - ALS Baseline with Sims, no logs/audit-events
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690447500991
@@ -56,4 +61,10 @@ ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.2
5661

5762
## Observations
5863

64+
- No observable difference between `Scenario #2`.
65+
- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
66+
5967
## Recommendations
68+
69+
- Same as `Scenario #2`.
70+
- Consider re-running this scenario once an increase of through-put has been observed.

20230727/s8-1690457241591/README.md

+12-8
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
1-
# Scenario 8 - ALS Baseline with Sims-only, multiple k6 VUs
1+
# Scenario 8 - ALS Baseline with Sims, multiple k6 VUs
2+
3+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4+
5+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
27

38
```conf
49
testid=1690457241591
@@ -47,16 +52,15 @@ Increased target VUs from 1 to 5
4752

4853
## Snapshots
4954

50-
- [Docker]()
5155
- [K6](https://snapshots.raintank.io/dashboard/snapshot/yCQaL9Qz7WcFDcH2v4Yik9vWR1WuO55f?orgId=2)
52-
- [Callback Handler Service]()
53-
- [Account Lookup Service]()
54-
- [Nodejs moja_als]()
55-
- [Nodejs cbs]()
56-
- [MySQL]()
5756

5857
## Observations
5958

60-
- Http response time (Sync response) increased proportional to number VUs which is weird.
59+
- Minimal observable difference between `Scenario #2`, same observations apply in addition too:
60+
- Http response time (Sync response) increased proportional to number VUs which is weird.
61+
- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
6162

6263
## Recommendations
64+
65+
- Same as `Scenario #2`.
66+
- Consider re-running this scenario once an increase of through-put has been observed.

0 commit comments

Comments
 (0)