-
Notifications
You must be signed in to change notification settings - Fork 617
Description
What is the bug?
Bug: Mimir 2.16.0 didn't work well with nginx-ingress controller. We ran into issue when we upgraded from version 2.14.0 >> 2.16.0 Mimir. For few hours it worked but later started failing, nginx controller keep on telling mimir response is slow, sometime host not available, queue full etc. On mimir side we did scaled distributors almost doubled from current but no changes to the behaviour. Later we rollback to same version and it almost 40 ~ 50min to settle down and data started flowing in. Since we use full LGTM stack deployment, this mimir issue impacted other components as well like Grafana, tempo, loki was not accessible as Ingress controller was failing.
How to reproduce it?
Not possible for me to reproduce because our telemetry is huge at this point, however when we tested the same version in smaller LGTM stack there we didn't run into any issues. But later it started failing in bigger environment.
What did you think would happen?
I feel there is something wrong with Mimir 2.16.0 version which nginx is not liking or vice-versa.
What was your environment?
We have LGTM hosted on AWS EKS cluster, and its big cluster where we get telemetry (all k8s metrics, jvm, custom app metrics) from 600+ EKS + AKS cluster. And this LGTM is huge.
Has anyone encountered issue with Mimir 2.16.0 version? What I meant we were running on 2.14.0 version and did upgrade to 2.16.0 version but ran into issue, it started breaking nginx ingress controller pods (version v1.11.3). These nginx controller started reporting various errors (below) and it stopped passing any data through it. With all this we tried scaling our distributors but that also didn't help as it was complaining mimir backend has problem. Later reverted back to previous version and it started working again.
Describe ingress pod
these are the errors in ingress controller pod
Normal Scheduled 3m41s default-scheduler Successfully assigned observability/ingress-nginx-controller-557466d75f-mn6tc to ip-100-78-117-207.ec2.internal
Normal RELOAD 3m39s nginx-ingress-controller NGINX reload triggered due to a change in configuration
Normal Killing 2m40s kubelet Container controller failed liveness probe, will be restarted
Warning Unhealthy 40s (x12 over 2m30s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Normal Pulled 35s (x2 over 3m41s) kubelet Container image "registry.k8s.io/ingress-nginx/controller:v1.11.3 @sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7" already present on machine
Normal Created 34s (x2 over 3m41s) kubelet Created container: controller
Normal Started 34s (x2 over 3m40s) kubelet Started container controller
Normal RELOAD 33s nginx-ingress-controller NGINX reload triggered due to a change in configuration
Warning Unhealthy 20s (x6 over 3m20s) kubelet Liveness probe failed: Get "http://100.78.117.155:10254/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 11s (x7 over 3m18s) kubelet Readiness probe failed: Get "http://100.78.117.155:10254/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Buffer
2025/07/14 09:59:27 [warn] 37#37: *617 a client request body is buffered to a temporary file /tmp/nginx/client-body/0000000615, client: 35.170.215.251, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", host: "mimir-oe-dev-central."
2025/07/14 09:59:27 [warn] 37#37: *617 a client request body is buffered to a temporary file /tmp/nginx/client-body/0000000616, client: 35.170.215.251, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", host: "mimir-oe-dev-central."
2025/07/14 09:59:28 [error] 37#37: *128 upstream timed out (110: Operation timed out) while connecting to upstream, client: 3.85.251.185, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", upstream: "http://100.78.117.175:8080/api/v1/push", host: "mimir-oe-dev-central."
Batch full error
2025/07/14 11:26:12 [warn] 36#36: *3781 [lua] monitor.lua:101: call(): omitting metrics for the request, current batch is full while logging request, client: 135.233.18.9, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", upstream: "http://100.78.117.175:8080/api/v1/push", host: "mimir-oe-dev-central."
135.233.18.9 - mimir_dev_central_user [14/Jul/2025:11:26:12 +0000] "POST /api/v1/push HTTP/2.0" 200 0 "-" "Alloy/v1.8.3 (linux; helm)" 890 12.682 [observability-mimir-nginx-80] [] 100.78.117.175:8080 0 12.682 200 e35b3d6e2315a04a2c3beb62325a13b3
499
54.195.30.160 - mimir_dev_central_user [14/Jul/2025:14:23:10 +0000] "POST /api/v1/push HTTP/2.0" 499 0 "-" "Alloy/v1.8.3 (linux; helm)" 842 26.058 [observability-mimir-nginx-80] [] 100.78.117.116:8080 0 26.058 - 5b23dc3eb896bee4d0c6c5087a86292c
3.120.63.58 - mimir_dev_central_user [14/Jul/2025:14:23:10 +0000] "POST /api/v1/push HTTP/2.0" 499 0 "-" "Alloy/v1.8.3 (linux; helm)" 14 26.058 [observability-mimir-nginx-80] [] 100.78.117.116:8080 0 26.058 - 576299318cdd36bd59b93d398d0aa94c
Any additional context to share?
We have LGTM hosted on AWS EKS cluster, and its big cluster where we get telemetry (all k8s metrics, jvm, custom app metrics) from 600+ EKS + AKS cluster. And this LGTM is huge.
Has anyone encountered issue with Mimir 2.16.0 version? What I meant we were running on 2.14.0 version and did upgrade to 2.16.0 version but ran into issue, it started breaking nginx ingress controller pods (version v1.11.3). These nginx controller started reporting various errors (below) and it stopped passing any data through it. With all this we tried scaling our distributors but that also didn't help as it was complaining mimir backend has problem. Later reverted back to previous version and it started working again.
Describe ingress pod
these are the errors in ingress controller pod
Normal Scheduled 3m41s default-scheduler Successfully assigned observability/ingress-nginx-controller-557466d75f-mn6tc to ip-100-78-117-207.ec2.internal
Normal RELOAD 3m39s nginx-ingress-controller NGINX reload triggered due to a change in configuration
Normal Killing 2m40s kubelet Container controller failed liveness probe, will be restarted
Warning Unhealthy 40s (x12 over 2m30s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Normal Pulled 35s (x2 over 3m41s) kubelet Container image "registry.k8s.io/ingress-nginx/controller:v1.11.3 @sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7" already present on machine
Normal Created 34s (x2 over 3m41s) kubelet Created container: controller
Normal Started 34s (x2 over 3m40s) kubelet Started container controller
Normal RELOAD 33s nginx-ingress-controller NGINX reload triggered due to a change in configuration
Warning Unhealthy 20s (x6 over 3m20s) kubelet Liveness probe failed: Get "http://100.78.117.155:10254/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 11s (x7 over 3m18s) kubelet Readiness probe failed: Get "http://100.78.117.155:10254/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Buffer
2025/07/14 09:59:27 [warn] 37#37: *617 a client request body is buffered to a temporary file /tmp/nginx/client-body/0000000615, client: 35.170.215.251, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", host: "mimir-oe-dev-central."
2025/07/14 09:59:27 [warn] 37#37: *617 a client request body is buffered to a temporary file /tmp/nginx/client-body/0000000616, client: 35.170.215.251, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", host: "mimir-oe-dev-central."
2025/07/14 09:59:28 [error] 37#37: *128 upstream timed out (110: Operation timed out) while connecting to upstream, client: 3.85.251.185, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", upstream: "http://100.78.117.175:8080/api/v1/push", host: "mimir-oe-dev-central."
Batch full error
2025/07/14 11:26:12 [warn] 36#36: *3781 [lua] monitor.lua:101: call(): omitting metrics for the request, current batch is full while logging request, client: 135.233.18.9, server: mimir-oe-dev-central., request: "POST /api/v1/push HTTP/2.0", upstream: "http://100.78.117.175:8080/api/v1/push", host: "mimir-oe-dev-central."
135.233.18.9 - mimir_dev_central_user [14/Jul/2025:11:26:12 +0000] "POST /api/v1/push HTTP/2.0" 200 0 "-" "Alloy/v1.8.3 (linux; helm)" 890 12.682 [observability-mimir-nginx-80] [] 100.78.117.175:8080 0 12.682 200 e35b3d6e2315a04a2c3beb62325a13b3
499
54.195.30.160 - mimir_dev_central_user [14/Jul/2025:14:23:10 +0000] "POST /api/v1/push HTTP/2.0" 499 0 "-" "Alloy/v1.8.3 (linux; helm)" 842 26.058 [observability-mimir-nginx-80] [] 100.78.117.116:8080 0 26.058 - 5b23dc3eb896bee4d0c6c5087a86292c
3.120.63.58 - mimir_dev_central_user [14/Jul/2025:14:23:10 +0000] "POST /api/v1/push HTTP/2.0" 499 0 "-" "Alloy/v1.8.3 (linux; helm)" 14 26.058 [observability-mimir-nginx-80] [] 100.78.117.116:8080 0 26.058 - 576299318cdd36bd59b93d398d0aa94c