Add E2E Promeheus metrics to applications #845

eero-t · 2024-11-01T10:23:04Z

Description

PR does following changes:

Adds request, first token and inter-token latency Prometheus metrics, for monitoring application performance
- To ServiceOrchectrator token processing, which seems only place where this could be done
Adds "inprogress" HTTP metrics that can be used for scaling application uservices based on incoming requests
- Needed because currently provided number of already processed HTTP requests tells just how much of them service currently can process, not how many additional ones it should be able to process
Fixes BaseStatistics method name typos

Issues

opea-project/GenAIExamples#391

Type of change

List the type of change like below. Please delete options that are not relevant.

New feature (non-breaking change which adds new functionality)

Dependencies

No new ones.

(prometheus_fastapi_instrumentator imported for HttpService already imported prometheus_client module to apps.)

Tests

Verified manually that produced metrics match ones from a benchmark that stresses the ChatQnA application.

Potential future changes (other PRs)

Add service monitors for all applications (to GenAIInfra repo)
Add dashboards for the new metrics (to GenAIInfra repo)
Add CLI options / env vars for disabling latency, inprogress and *_created metrics (prometheus_client.disable_created_metrics())?
Change ServiceOrchestrator object and all applications and tests creating them to provide unique name for the orchestrator instance, and use that as metric prefix. Instead of all orchestrator instances sharing the same set of megaservice_ prefixed singleton metrics...

Signed-off-by: Eero Tamminen <[email protected]>

eero-t · 2024-11-01T16:54:07Z

Regarding the duplicate inprogress metrics error in CI tests....

Creating multiple MicroServices creates multiple HTTPServices which creates multiple prometheus-fastapi-instrumentor instances.

While prometheus-fastapi-instrumentor handled that fine for ChatQnA and normal HTTP metrics, for some reason that was not the case for its inprogress metrics in CI.

=> I think MicroService constructor name argument (currently optional) needs to be mandatory, so that it can be used to make name of inprogress metric unique for each HTTPService instance. That requires small change to Gateway class, but otherwise I think everything else is fine with that.

Note: MicroService class is not subclassed, therefore its class name does not help, as it's always the same. So I think that part needs to be dropped, also because metric names cannot contain special chars like /.

PS. prometheus-fastapi-instrumentor requires HTTPService instance specific Starlette instance, so it cannot be made singleton, like I did for metrics that I'm directly adding.

Signed-off-by: Eero Tamminen <[email protected]>

Unlike apps, CI tests create multiple of them. Signed-off-by: Eero Tamminen <[email protected]>

eero-t · 2024-11-01T17:27:40Z

Rebased pre-commit changes to earlier commits, and pushed above described solution to the CI issue on enabling inprogress metrics.

I'm currently testing whether I could get somewhat similar metric (reliably!) also from ServiceOrchestrator::execute().

If that works, enabling the "inprogress" metrics for prometheus-fastapi-instrumentor can be dropped, as changes CI requires for that are a bit intrusive.

EDIT: And on further grepping tests seem to be testing on the unwanted /MicroService name suffix...

Creating multiple MicroService()s creates multiple HTTPService()s which creates multiple Prometheus fastapi instrumentor instances. While latter handled that fine for ChatQnA and normal HTTP metrics, that was not the case for its "inprogress" metrics in CI. Therefore MicroService constructor name argument is now mandatory, so that it can be used to make "inprogress" metrics for HTTPService instances unique. PS. instrumentor requires HTTPService instance specific Starlette instance, so it cannot be made singleton. Signed-off-by: Eero Tamminen <[email protected]>

Signed-off-by: Eero Tamminen <[email protected]>

for more information, see https://pre-commit.ci

codecov · 2024-11-01T18:00:38Z

Codecov Report

Attention: Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
comps/cores/mega/base_statistics.py	50.00%	2 Missing ⚠️

Files with missing lines	Coverage Δ
comps/cores/mega/gateway.py	`30.24% <ø> (ø)`
comps/cores/mega/http_service.py	`77.46% <100.00%> (+0.65%)`	⬆️
comps/cores/mega/micro_service.py	`91.13% <ø> (ø)`
comps/cores/mega/orchestrator.py	`92.14% <100.00%> (+1.18%)`	⬆️
comps/cores/mega/base_statistics.py	`42.10% <50.00%> (ø)`

Spycsh · 2024-11-04T01:57:24Z

LGTM

This reverts commit a6998a1.

eero-t · 2024-11-04T11:46:18Z

@Spycsh, @lvliang-intel Any suggestions where the new metrics should be documented; in GenAIExamples, or GenAIInfra repo?

Or is it enough to add add Prometheus serviceMonitors to Helm chats for (rest of) the OPEA applications, and some Grafana dashboards for them?

Spycsh · 2024-11-05T01:56:07Z

Hi @eero-t , GenAIEval https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/grafana does track some Prometheus metrics and provide the naive measurement of first token latency and avg token latency, which are on the client side instead of through Prometheus. Welcome to add some documents there in the future.

eero-t · 2024-11-05T10:02:02Z

Hi @eero-t , GenAIEval https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/grafana does track some Prometheus metrics and provide the naive measurement of first token latency and avg token latency, which are on the client side instead of through Prometheus. Welcome to add some documents there in the future.

Eval repo is for evaluating and benchmarking, whereas metrics provided by the service "frontend", are (also) for operational monitoring, normal, every day usage of the service.

I think most appropriate place would be the Infra repo, as it already includes monitoring support both with Helm charts [1], and separate manifest files + couple of Grafana dashboards [2], but that's rather Kubernetes specific.

[1] https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/monitoring.md
[2] https://github.com/opea-project/GenAIInfra/blob/main/kubernetes-addons/Observability/README.md

Spycsh · 2024-11-06T01:45:24Z

Hi @eero-t , GenAIEval https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/grafana does track some Prometheus metrics and provide the naive measurement of first token latency and avg token latency, which are on the client side instead of through Prometheus. Welcome to add some documents there in the future.

Eval repo is for evaluating and benchmarking, whereas metrics provided by the service "frontend", are (also) for operational monitoring, normal, every day usage of the service.

I think most appropriate place would be the Infra repo, as it already includes monitoring support both with Helm charts [1], and separate manifest files + couple of Grafana dashboards [2], but that's rather Kubernetes specific.

[1] https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/monitoring.md [2] https://github.com/opea-project/GenAIInfra/blob/main/kubernetes-addons/Observability/README.md

Sure. Thanks for pointing out.

Fix typos in BaseStatistics method names

b9caf9a

Signed-off-by: Eero Tamminen <[email protected]>

eero-t requested a review from lvliang-intel as a code owner November 1, 2024 10:23

eero-t force-pushed the e2e-metrics branch 3 times, most recently from 748b6fa to 64eca15 Compare November 1, 2024 13:44

lvliang-intel requested a review from Spycsh November 1, 2024 14:26

eero-t added 3 commits November 1, 2024 18:56

Add HttpService "inprogress" (pending) request count metrics

100a0be

Signed-off-by: Eero Tamminen <[email protected]>

Add E2E Prometheus metrics to ServiceOrchestrator

a94d30f

Signed-off-by: Eero Tamminen <[email protected]>

Fix: support metrics with multiple ServiceOrchestrator instances

2e2f4a8

Unlike apps, CI tests create multiple of them. Signed-off-by: Eero Tamminen <[email protected]>

eero-t force-pushed the e2e-metrics branch from 64eca15 to fb5132e Compare November 1, 2024 17:21

eero-t force-pushed the e2e-metrics branch from fb5132e to 1c828f7 Compare November 1, 2024 17:38

eero-t added 2 commits November 1, 2024 19:55

Fix: update test_token_generator()

ce3e4ef

Signed-off-by: Eero Tamminen <[email protected]>

eero-t force-pushed the e2e-metrics branch from ecebaaf to ce3e4ef Compare November 1, 2024 17:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

ccc5a38

for more information, see https://pre-commit.ci

lvliang-intel approved these changes Nov 3, 2024

View reviewed changes

Spycsh approved these changes Nov 4, 2024

View reviewed changes

Spycsh merged commit a6998a1 into opea-project:main Nov 4, 2024
13 checks passed

Spycsh added a commit that referenced this pull request Nov 4, 2024

Revert "Add E2E Promeheus metrics to applications (#845)"

43e6824

This reverts commit a6998a1.

Spycsh mentioned this pull request Nov 4, 2024

fix prometheus invalid metric name #849

Merged

4 tasks

eero-t mentioned this pull request Nov 4, 2024

[ChatQnA] Provide E2E performance metrics opea-project/GenAIExamples#391

Closed

eero-t deleted the e2e-metrics branch November 4, 2024 11:47

eero-t mentioned this pull request Nov 4, 2024

Add monitoring (ServiceMonitors) for rest of OPEA applications opea-project/GenAIInfra#524

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add E2E Promeheus metrics to applications #845

Add E2E Promeheus metrics to applications #845

eero-t commented Nov 1, 2024 •

edited

Loading

eero-t commented Nov 1, 2024

eero-t commented Nov 1, 2024

codecov bot commented Nov 1, 2024 •

edited

Loading

Spycsh commented Nov 4, 2024

eero-t commented Nov 4, 2024

Spycsh commented Nov 5, 2024

eero-t commented Nov 5, 2024

Spycsh commented Nov 6, 2024

Add E2E Promeheus metrics to applications #845

Add E2E Promeheus metrics to applications #845

Conversation

eero-t commented Nov 1, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

Potential future changes (other PRs)

eero-t commented Nov 1, 2024

eero-t commented Nov 1, 2024

codecov bot commented Nov 1, 2024 • edited Loading

Codecov Report

Spycsh commented Nov 4, 2024

eero-t commented Nov 4, 2024

Spycsh commented Nov 5, 2024

eero-t commented Nov 5, 2024

Spycsh commented Nov 6, 2024

eero-t commented Nov 1, 2024 •

edited

Loading

codecov bot commented Nov 1, 2024 •

edited

Loading