feat(mojaloop/#3424): analyse als perf results

mdebarros · mdebarros · commit 6ccdcd6e0371 · 2023-08-01T15:17:06.000+02:00
feat(mojaloop/#3424): analyse als perf results - mojaloop/project#3424 - Added analysis for scenarios 1-14 feat(mojaloop/#3400): benchmarking performance for als - mojaloop/project#3400 - Updated comments based on PR review.
diff --git a/20230726/s1-1690367402771/README.md b/20230726/s1-1690367402771/README.md
@@ -1,5 +1,13 @@
 # Scenario 1 - ALS-bypass Baseline with Sims-only
 
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. ADMIN GET /participants request to the Central-Ledger to validate payerFspId. <-- sync response
+2. ADMIN GET /participants request to the Central-Ledger to validate payeeFspId. <-- sync response
+3. ORACLE GET /participants request to the Oracle to resolve FSPID for payeeId. <-- sync response
+4. FSPIOP GET /parties request to the ALS <-- async callback response
+5. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
+
 ```conf
 var-testid=1690367402771
 params=&var-testid=1690367402771&from=1690367297867&to=1690368635328
@@ -46,8 +54,17 @@ params=&var-testid=1690367402771&from=1690367297867&to=1690368635328
 
 ## Observations
 
-TBD
+- `Callback-Handler` Simulator Service is able to handle `400+ Ops/s` End-to-end, while sustaining an average duration of just over `2ms`. This is shown by the following dashboards/metrics:
+  - [K6](./images/Official%20k6%20Test%20Result.png)
+    - `Iteration Rate` (Mean) = `461 Ops/s`
+    - `Ieration Duration (avg)` (Mean) = `2.22ms`
+  - [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations`. Mean is `1.86ms`.
+    - The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observes `Ops/s` and the `request` where most of the duration is spend due to the Callback-Handler sending out the **Async** `FSPIOP PUT /parties` callback response.
+  - [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
+    - `Callback-Handler` services show no observable resource constraint from both a memory and cpu usage.
 
 ## Recommendations
 
-TBD
+- Observe `Scenario #2+` and compare the `Callback-Handler`'s metrics against this **baseline** to determine if there are any issues with either the Mocked Simulators (i.e. `Callback-Handlers`) or the **Async** `FSPIOP PUT /parties` callback response.
diff --git a/20230726/s2-1690376653994/README.md b/20230726/s2-1690376653994/README.md
@@ -1,4 +1,9 @@
-# Scenario 2 - ALS Baseline with Sims-only
+# Scenario 2 - ALS Baseline with Sims
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 var-testid=1690376653994
@@ -50,8 +55,29 @@ ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.2
 
 ## Observations
 
-TBD
+- `Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `10 Ops/s` End-to-end, while sustaining an average duration of just over `100ms`. This is shown by the following dashboards/metrics:
+  - [K6](./images/Official%20k6%20Test%20Result.png)
+    - `Iteration Rate` (Mean) = `10 Ops/s`
+    - `Ieration Duration (avg)` (Mean) = `101ms`
+  - [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
+    - The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
+    - The ingress `Ops/s` for `op:admin_get_participants_endpoints` is very low due the ALS's caching mechanism.
+  - [Account-Lookup-Service](./images/dashboard-account-lookup-service.png)
+    - Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
+    - Most of the `duration` is observed to be spend on the `validateParticipant` @ `30ms`
+    - Most of the `Op/s` is observed to be spend on the `validateParticipant` @ `30 Op/s`, a three-fold increase over other egress metrics.
+  - [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
+    - `Account-Lookup-Service` is showing a `100%` CPU usage (equivalent to a single core of the host machine), indicating that it is most likely being CPU constrained.
+    - `Callback-Handler` is within bounds of the `Scenario #1`.
 
 ## Recommendations
 
-TBD
+- Investigate logic behind the `validateParticipant` egress implementation
+- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
+- Investigate `Account-Lookup-Service` high CPU usage by removing configurable factors that may impact CPU usage, i.e.
+  - Logging
+  - Event Audits
+  - Increase `UV_THREADPOOL_SIZE` for IO threads.
+- Investigate enabling `HTTP Keep-Alive` for egress HTTP requests, especially the `validateParticipants`.
diff --git a/20230726/s6-1690380087112/README.md b/20230726/s6-1690380087112/README.md
@@ -1,4 +1,9 @@
-# Scenario 6 - ALS Baseline ALS Baseline with Sims-only, HTTP-Keep-Alive enabled
+# Scenario 6 - ALS Baseline ALS Baseline with Sims, HTTP-Keep-Alive enabled
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690380087112
@@ -54,8 +59,10 @@ ACCOUNT_LOOKUP_SERVICE_VERSION=local
 
 ## Observations
 
-TBD
+- No observable difference between `Scenario #2`.
+- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
 
 ## Recommendations
 
-TBD
+- Same as `Scenario #2`.
+- Consider re-running this scenario once an increase of through-put has been observed.
diff --git a/20230726/s7-1690407403663/README.md b/20230726/s7-1690407403663/README.md
@@ -1,4 +1,9 @@
-# Scenario 1 - ALS-bypass Baseline with Sims-only
+# Scenario 7 - ALS Baseline with Sims, UV_THREADS
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```yaml
 testid:
@@ -8,7 +13,7 @@ testid:
 params: &from=1690407309197&to=1690408662339
 ## Added for Test Scenario 7
 UV_THREADPOOL_SIZE: 
-  - 4 (default)
+  - 4 ## default
   - 8
   - 16
 ```
@@ -57,4 +62,10 @@ UV_THREADPOOL_SIZE:
 
 ## Observations
 
+- No observable difference between `Scenario #2`.
+- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
+
 ## Recommendations
+
+- Same as `Scenario #2`.
+- Consider re-running this scenario once an increase of through-put has been observed.
diff --git a/20230727/s10-1690466917636/README.md b/20230727/s10-1690466917636/README.md
@@ -1,4 +1,9 @@
-# Scenario 8 - ALS Baseline with Sims-only, multiple k6 VUs
+# Scenario 10 - ALS Baseline with Sims-only, Disabled JSON.stringify ALS
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690466917636
@@ -10,7 +15,7 @@ EVENT_SDK_ASYNC_OVERRIDE_EVENTS=""
 EVENT_SDK_LOG_FILTER=""
 ## Added for Test Scenario 7
 UV_THREADPOOL_SIZE=16
-ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.2
+ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.3
 ## Changes for Test Scenario 9
 Enabled in-memory storage for Mysql ALS with following config in docker-compose file
 ## Changes for Test Scenario 10
@@ -49,16 +54,34 @@ Disabled JSON.stringify in logResponse function of ALS
 
 ## Snapshots
 
-- [Docker]()
-- [K6]()
-- [Callback Handler Service]()
-- [Account Lookup Service]()
-- [Nodejs moja_als]()
-- [Nodejs cbs]()
-- [MySQL]()
+N/A
 
 ## Observations
 
-- It seems JSON.stringify is a demanding operation and causing bottleneck for HTTP sync responses also.
+- `Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `100 Ops/s` End-to-end, while sustaining an average duration of around `14ms`. This is shown by the following dashboards/metrics:
+  - [K6](./images/Official%20k6%20Test%20Result.png)
+    - `Iteration Rate` (Mean) = `100 Ops/s`
+    - `Ieration Duration (avg)` (Mean) = `14ms`
+  - [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
+    - The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
+  - [Account-Lookup-Service](./images/dashboard-account-lookup-service.png)
+    - Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
+    - The `validateParticipant`'s `duration` is @ `1.32ms`, vs `30ms`...which is in-line with other egress.
+    - The `validateParticipant`'s `Op/s` is @ near `300 Op/s`, vs `30`...which is in-line with `Scenario #2` and the three-fold increase over the End-to-end `100 Op/s`.
+  - [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
+    - `Account-Lookup-Service` is showing a reduced CPU usage of `70%`, `30%` down from `100%`, indicating that the service is able to do more if K6 VUs are increased.
+    - `Callback-Handler` is within bounds of the `Scenario #1`.
+
+- `JSON.stringify` is a demanding operation which is causing a bottleneck on the NodeJS `Event-loop` impacting the End-to-end `Op/s` and `duration`.
+  - Comparing the `Event-Loop Lag` on the [NodeJS Application Dashboard for ALS](./images/NodeJS%20Application%20Dashboard%20ALS.png) on this Scenario vs [Scenario #2](../../20230726/s2-1690376653994/images/NodeJS%20Application%20Dashboard-moja_als.png) we can see a huge difference between delay introduced by the `JSON.stringify` operation due to it blocking of the `Event-loop`:
+    - `Scenario #2` - Mean `13.3 ms`, Max `24.2 ms`, Min `11.1 ms`
+    - `Scenario #10` - Mean `2.41 ms`, Max `7.67 ms`, Min `1.92 ms`
 
 ## Recommendations
+
+- Run `Scenarios #3 -> #4` (i.e. `#11 -> #14`)to determine if the scalability of the `Account-Lookup-Service` is linear.
+- Run `Scenarios #5 -> #9` to see if there are any observable differences due to the increase in End-to-end `Op/s` and reduced `duration`.
+- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
+- Monitor the `Event-Loop Lag` for a mean of `3 ms` or more, especially when a process seems to be CPU constrained. In such scenarios, it is recommend to profile the service to see if any obvious `Event-Loop` "blockers" can be identified.
diff --git a/20230727/s11-1690488145504/README.md b/20230727/s11-1690488145504/README.md
@@ -1,4 +1,9 @@
-# Scenario 11 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + 4x k6 VUs.
+# Scenario 11 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + 4x k6 VUs
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690488145504
@@ -51,16 +56,31 @@ Up k6s VUs to 4.
 
 ## Snapshots
 
-- [Docker]()
-- [K6]()
-- [Callback Handler Service]()
-- [Account Lookup Service]()
-- [Nodejs moja_als]()
-- [Nodejs cbs]()
-- [MySQL]()
+N/A
 
 ## Observations
 
-Ops/s have increased compared to scenario #10.
+- End-to-end `Ops/s` has increased by `20%` when compared to `Scenario #10`.
+- End-to-end `duration` has increased by `100%` when compared to `Scenario #10`.
+- `Account-Lookup-Service` and the `Callback-Handler` Simulator Service are able to handle `120 Ops/s` End-to-end, while sustaining an average duration of around `30ms`. This is shown by the following dashboards/metrics:
+  - [K6](./images/Official%20k6%20Test%20Result.png)
+    - `Iteration Rate` (Mean) = `113 Ops/s`
+    - `Ieration Duration (avg)` (Mean) = `27.2 ms`
+  - [Callback Handler Svc](./images/Supporting%20Services%20-%20Callback%20Hander%20Service.png)
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Calculations Processed Per Second` Graph. Note the Mean includes the pre/post run.
+    - `op:fspiop_put_parties_end2end - success:true` - observe the `E2E, Request, Response Performance Timing Calculations` fall-inline with the observed duration.
+    - The `op:fspiop_put_parties_request` and `op:fspiop_put_parties_response` fall-inline with the observations.
+  - [Account-Lookup-Service](./images/dashboard-account-lookup-service.png)
+    - Egress and Ingress metrics for `getPartiesByTypeAndId` and `putPartiesByTypeAndId` are similar for both `duration` and `Op/s`.
+    - The `validateParticipant` is in-line with observations.
+  - [Docker Node Monitoring](./images/docker-prometheus-monitoring.png)
+    - `Account-Lookup-Service` is showing increased CPU usage of just over `100%`, indicating that the service is near its limit.
+    - `Callback-Handler` is within bounds of the `Scenario #1`.
+  - Comparing the `Event-Loop Lag` on the [NodeJS Application Dashboard for ALS](./images/NodeJS%20Application%20Dashboard%20ALS.png) between previous scenarios shows no major difference:
+    - `Scenario #10` - Mean `2.41 ms`, Max `7.67 ms`, Min `1.92 ms`
+    - `Scenario #11` - Mean `2.36 ms`, Max `3.82 ms`, Min `1.86 ms`
 
 ## Recommendations
+
+- Additional profiling of the `Account-Lookup-Service` is required to further identity unnecessary or un-optimized blocking operations that may impact the NodeJs `Event Loop`. Further removing/optimizing these issues should increase the overall End-to-end `through-put` while minimizing the `duration`.
+- Same as `Scenario #10` (minus this scenario).
diff --git a/20230727/s12-1690491686694/README.md b/20230727/s12-1690491686694/README.md
@@ -1,4 +1,9 @@
-# Scenario 12 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + ALS Scale 2.
+# Scenario 12 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + ALS Scale 2
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690491686694
@@ -51,16 +56,12 @@ Up ALS scaling to 2.
 
 ## Snapshots
 
-- [Docker]()
-- [K6]()
-- [Callback Handler Service]()
-- [Account Lookup Service]()
-- [Nodejs moja_als]()
-- [Nodejs cbs]()
-- [MySQL]()
+N/A.
 
 ## Observations
 
-K6s could not saturate 2 replicas of the ALS.
+- Similar observation to `Scenario 10` where 1x K6s VU could not saturate 2 replicas of the ALS.
 
 ## Recommendations
+
+- Re-run same scenario with more K6s VUs to match the `Account-Lookup-Service`'s scaling factor.
diff --git a/20230727/s13-1690493569083/README.md b/20230727/s13-1690493569083/README.md
@@ -1,4 +1,9 @@
-# Scenario 13 - ALS Baseline with Sims-only, Disabled JSON.stringify. ALS v14.2.3 + + Scale 4 + 4x k6 VUs.
+# Scenario 13 - ALS Baseline with Sims, Disabled JSON.stringify. ALS v14.2.3 + + Scale 4 + 4x k6 VUs
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690493569083
@@ -52,14 +57,13 @@ Up k6s VUs to 4.
 
 ## Snapshots
 
-- [Docker]()
-- [K6]()
-- [Callback Handler Service]()
-- [Account Lookup Service]()
-- [Nodejs moja_als]()
-- [Nodejs cbs]()
-- [MySQL]()
+N/A
 
 ## Observations
 
+- End-to-end `200 Op/s` with `18.3 ms` achieved.
+- Scalability looks near linear.
+
 ## Recommendations
+
+- Increase `Account-Lookup-Service` scaling factor to determine if linearly scalable.
diff --git a/20230727/s14-1690504862678/README.md b/20230727/s14-1690504862678/README.md
@@ -54,15 +54,13 @@ Up k6s VUs to 6.
 
 ## Snapshots
 
-- [Docker]()
-- [K6]()
-- [Callback Handler Service]()
-- [Account Lookup Service]()
-- [Nodejs moja_als]()
-- [Nodejs cbs]()
-- [MySQL]()
+N/A
 
 ## Observations
-Observed the iteration rate is 242 ops/sec max
+
+- End-to-end max of `242 Op/s` with a mean of `23.0 ms` achieved --> Scalability is not linear.
 
 ## Recommendations
+
+- Consider implementing a **caching** mechanism for `validateParticipant` egress as it is called TWICE for each leg of the Request and the Callback Response to validate the Payer and Payee FSP.
+- Profile the `Account-Lookup-Service` to see if any further `Event-Loop` "blockers" can be identified.
diff --git a/20230727/s5-1690447500991/README.md b/20230727/s5-1690447500991/README.md
@@ -1,4 +1,9 @@
-# Scenario 1 - ALS-bypass Baseline with Sims-only
+# Scenario 5 - ALS Baseline with Sims, no logs/audit-events
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690447500991
@@ -56,4 +61,10 @@ ACCOUNT_LOOKUP_SERVICE_VERSION=v14.2.2
 
 ## Observations
 
+- No observable difference between `Scenario #2`.
+- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
+
 ## Recommendations
+
+- Same as `Scenario #2`.
+- Consider re-running this scenario once an increase of through-put has been observed.
diff --git a/20230727/s8-1690457241591/README.md b/20230727/s8-1690457241591/README.md
@@ -1,4 +1,9 @@
-# Scenario 8 - ALS Baseline with Sims-only, multiple k6 VUs
+# Scenario 8 - ALS Baseline with Sims, multiple k6 VUs
+
+The End-to-end operation from the K6 test-runner included the following HTTP operations for each *iteration*:
+
+1. FSPIOP GET /parties request to the ALS <-- async callback response
+2. WS Subscription to the `Callback-Handler` Service for Callback Response notifications
 
 ```conf
 testid=1690457241591
@@ -47,16 +52,15 @@ Increased target VUs from 1 to 5
 
 ## Snapshots
 
-- [Docker]()
 - [K6](https://snapshots.raintank.io/dashboard/snapshot/yCQaL9Qz7WcFDcH2v4Yik9vWR1WuO55f?orgId=2)
-- [Callback Handler Service]()
-- [Account Lookup Service]()
-- [Nodejs moja_als]()
-- [Nodejs cbs]()
-- [MySQL]()
 
 ## Observations
 
-- Http response time (Sync response) increased proportional to number VUs which is weird.
+- Minimal observable difference between `Scenario #2`, same observations apply in addition too:
+  - Http response time (Sync response) increased proportional to number VUs which is weird.
+- Possibly no observable impact due to low through-put (i.e. `10 Op/s`).
 
 ## Recommendations
+
+- Same as `Scenario #2`.
+- Consider re-running this scenario once an increase of through-put has been observed.
diff --git a/20230727/s9-1690470790793/README.md b/20230727/s9-1690470790793/README.md
diff --git a/README.md b/README.md