You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(mojaloop/#3424): analyse als perf results - mojaloop/project#3424
- Added analysis for scenarios 1-14
feat(mojaloop/#3400): benchmarking performance for als - mojaloop/project#3400
- Updated comments based on PR review.
-`Callback-Handler` Simulator Service is able to handle `400+ Ops/s` End-to-end, while sustaining an average duration of just over `2ms`. This is shown by the following dashboards/metrics:
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
63
+
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations`. Mean is `1.86ms`.
64
+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observes `Ops/s` and the `request` where most of the duration is spend due to the Callback-Handler sending out the **Async**`FSPIOP PUT /parties` callback response.
-`Callback-Handler` services show no observable resource constraint from both a memory and cpu usage.
50
67
51
68
## Recommendations
52
69
53
-
TBD
70
+
- Observe `Scenario #2+` and compare the `Callback-Handler`'s metrics against this **baseline** to determine if there are any issues with either the Mocked Simulators (i.e. `Callback-Handlers`) or the **Async**`FSPIOP PUT /parties` callback response.
-`Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `10 Ops/s` End-to-end, while sustaining an average duration of just over `100ms`. This is shown by the following dashboards/metrics:
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
64
+
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
65
+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
66
+
- The ingress `Ops/s` for `op:admin_get_participants_endpoints` is very low due the ALS's caching mechanism.
-`Account-Lookup-Service` is showing a `100%` CPU usage (equivalent to a single core of the host machine), indicating that it is most likely being CPU constrained.
73
+
-`Callback-Handler` is within bounds of the `Scenario #1`.
54
74
55
75
## Recommendations
56
76
57
-
TBD
77
+
- Investigate logic behind the `validateParticipant` egress implementation
78
+
- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
79
+
- Investigate `Account-Lookup-Service` high CPU usage by removing configurable factors that may impact CPU usage, i.e.
80
+
- Logging
81
+
- Event Audits
82
+
- Increase `UV_THREADPOOL_SIZE` for IO threads.
83
+
- Investigate enabling `HTTP Keep-Alive` for egress HTTP requests, especially the `validateParticipants`.
Enabled in-memory storage for Mysql ALS with following config in docker-compose file
16
21
## Changes for Test Scenario 10
@@ -49,16 +54,34 @@ Disabled JSON.stringify in logResponse function of ALS
49
54
50
55
## Snapshots
51
56
52
-
-[Docker]()
53
-
-[K6]()
54
-
-[Callback Handler Service]()
55
-
-[Account Lookup Service]()
56
-
-[Nodejs moja_als]()
57
-
-[Nodejs cbs]()
58
-
-[MySQL]()
57
+
N/A
59
58
60
59
## Observations
61
60
62
-
- It seems JSON.stringify is a demanding operation and causing bottleneck for HTTP sync responses also.
61
+
-`Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `100 Ops/s` End-to-end, while sustaining an average duration of around `14ms`. This is shown by the following dashboards/metrics:
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
67
+
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
68
+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
- Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
71
+
- The `validateParticipant`'s `duration` is @ `1.32ms`, vs `30ms`...which is in-line with other egress.
72
+
- The `validateParticipant`'s `Op/s` is @ near `300 Op/s`, vs `30`...which is in-line with `Scenario #2` and the three-fold increase over the End-to-end `100 Op/s`.
-`Account-Lookup-Service` is showing a reduced CPU usage of `70%`, `30%` down from `100%`, indicating that the service is able to do more if K6 VUs are increased.
75
+
-`Callback-Handler` is within bounds of the `Scenario #1`.
76
+
77
+
-`JSON.stringify` is a demanding operation which is causing a bottleneck on the NodeJS `Event-loop` impacting the End-to-end `Op/s` and `duration`.
78
+
- Comparing the `Event-Loop Lag` on the [NodeJS Application Dashboard for ALS](./images/NodeJS%20Application%20Dashboard%20ALS.png) on this Scenario vs [Scenario #2](../../20230726/s2-1690376653994/images/NodeJS%20Application%20Dashboard-moja_als.png) we can see a huge difference between delay introduced by the `JSON.stringify` operation due to it blocking of the `Event-loop`:
79
+
-`Scenario #2` - Mean `13.3 ms`, Max `24.2 ms`, Min `11.1 ms`
80
+
-`Scenario #10` - Mean `2.41 ms`, Max `7.67 ms`, Min `1.92 ms`
63
81
64
82
## Recommendations
83
+
84
+
- Run `Scenarios #3 -> #4` (i.e. `#11 -> #14`)to determine if the scalability of the `Account-Lookup-Service` is linear.
85
+
- Run `Scenarios #5 -> #9` to see if there are any observable differences due to the increase in End-to-end `Op/s` and reduced `duration`.
86
+
- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
87
+
- Monitor the `Event-Loop Lag` for a mean of `3 ms` or more, especially when a process seems to be CPU constrained. In such scenarios, it is recommend to profile the service to see if any obvious `Event-Loop` "blockers" can be identified.
Copy file name to clipboardExpand all lines: 20230727/s11-1690488145504/README.md
+29-9
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,9 @@
1
-
# Scenario 11 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + 4x k6 VUs.
1
+
# Scenario 11 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + 4x k6 VUs
2
+
3
+
The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
4
+
5
+
1. FSPIOP GET /parties request to the ALS <-- async callback response
6
+
2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
2
7
3
8
```conf
4
9
testid=1690488145504
@@ -51,16 +56,31 @@ Up k6s VUs to 4.
51
56
52
57
## Snapshots
53
58
54
-
-[Docker]()
55
-
-[K6]()
56
-
-[Callback Handler Service]()
57
-
-[Account Lookup Service]()
58
-
-[Nodejs moja_als]()
59
-
-[Nodejs cbs]()
60
-
-[MySQL]()
59
+
N/A
61
60
62
61
## Observations
63
62
64
-
Ops/s have increased compared to scenario #10.
63
+
- End-to-end `Ops/s` has increased by `20%` when compared to `Scenario #10`.
64
+
- End-to-end `duration` has increased by `100%` when compared to `Scenario #10`.
65
+
-`Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `120 Ops/s` End-to-end, while sustaining an average duration of around `30ms`. This is shown by the following dashboards/metrics:
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
71
+
-`op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
72
+
- The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
-`Account-Lookup-Service` is showing increased CPU usage of just over `100%`, indicating that the service is near its limit.
78
+
-`Callback-Handler` is within bounds of the `Scenario #1`.
79
+
- Comparing the `Event-Loop Lag` on the [NodeJS Application Dashboard for ALS](./images/NodeJS%20Application%20Dashboard%20ALS.png) between previous scenarios shows no major difference:
80
+
-`Scenario #10` - Mean `2.41 ms`, Max `7.67 ms`, Min `1.92 ms`
81
+
-`Scenario #11` - Mean `2.36 ms`, Max `3.82 ms`, Min `1.86 ms`
65
82
66
83
## Recommendations
84
+
85
+
- Additional profiling of the `Account-Lookup-Service` is required to further identity unnecessary or un-optimized blocking operations that may impact the NodeJs `Event Loop`. Further removing/optimizing these issues should increase the overall End-to-end `through-put` while minimizing the `duration`.
Copy file name to clipboardExpand all lines: 20230727/s14-1690504862678/README.md
+6-8
Original file line number
Diff line number
Diff line change
@@ -54,15 +54,13 @@ Up k6s VUs to 6.
54
54
55
55
## Snapshots
56
56
57
-
-[Docker]()
58
-
-[K6]()
59
-
-[Callback Handler Service]()
60
-
-[Account Lookup Service]()
61
-
-[Nodejs moja_als]()
62
-
-[Nodejs cbs]()
63
-
-[MySQL]()
57
+
N/A
64
58
65
59
## Observations
66
-
Observed the iteration rate is 242 ops/sec max
60
+
61
+
- End-to-end max of `242 Op/s` with a mean of `23.0 ms` achieved --> Scalability is not linear.
67
62
68
63
## Recommendations
64
+
65
+
- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
66
+
- Profile the `Account-Lookup-Service` to see if any further `Event-Loop` "blockers" can be identified.
0 commit comments