Thanos-Querier " component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901" #7971

inboxamitraj · 2024-12-09T08:38:03Z

Thanos, Prometheus and Golang version used:
thanos, version 0.33.0 (branch: HEAD, revision: 0f025f1)
build user: root@9fb4d3c2f7ec
build date: 20231217-16:49:45
go version: go1.21.5
platform: linux/amd64
tags: netgo

AWS EKS version: 1.29
Istio version: 1.20.2

note- same error exists with old Thanos v0.26.0 and latest Thanos v0.37.0 version as well.

Object Storage Provider:

--store=test-cluster-metrics-grpc-consumer.test:10901
--store=test-component-metrics-grpc-consumer.test:10901
--store=thanos-store-test.monitoring.svc.cluster.local:10901

What happened:
we have a Thanos-Querier running on our Centralised Cluster A which queries thanos-store and thanos-sidecar running on other EKS clusters

we recently upgraded Istio to 1.20 from 1.19 since then Grafana does not show latest prometheus metrics on other clusters. however we can see prometheus metrics before 2 hours. This means- Thanos-Querier can talk to Thanos-Store but not able to connect to live Thanos-SideCar which has latest prometheus metrics.

What you expected to happen:
Centralised Grafana should show latest prometheus metrics from other clusters.

How to reproduce it (as minimally and precisely as possible):
Two EKS 1.29 Clusters
Istio 1.20
Prometheus Operator running on both the Clusters and one of the cluster working as Centralised Cluster.

Full logs to relevant components:

level=warn ts=2024-12-09T08:31:15.205305755Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901
level=warn ts=2024-12-09T08:31:20.206042557Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-cluster-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-cluster-metrics-grpc-consumer.test:10901
level=warn ts=2024-12-09T08:31:20.206102661Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901
level=warn ts=2024-12-09T08:31:25.206932058Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901
level=warn ts=2024-12-09T08:31:25.206936638Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-cluster-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-cluster-metrics-grpc-consumer.test:10901

Anything else we need to know:
What we have done so far-

added --query.grpc.client.serverName in the thanos-querier deploymement
increased timeouts in thanos-querier
attempted with latest Thanos v0.37.0 image.

Thanos Querier Deployment in Cluster A

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-querier
  namespace: monitoring
spec:
  template:
    metadata:
      labels:
        sidecar.istio.io/inject: "true"
        tls: permissive
    spec:
      containers:
      - args:
        - query
        - --http-address=0.0.0.0:9090
        - --store=test-cluster-metrics-grpc-consumer.test:10901
        - --store=test-component-metrics-grpc-consumer.test:10901
        - --store=thanos-store-test.monitoring.svc.cluster.local:10901
        name: thanos-querier
      volumes:
      - name: cache-volume
        emptyDir: {}
      - name: cstack-thanos-s3
        secret:
          defaultMode: 420
          secretName: cstack-thanos-sealed-s3-test

Service Entry in Cluster A

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: test-cluster-metrics-grpc-consumer-import
  namespace: monitoring
spec:
  endpoints:
  - address: test-e01-shared-mesh.istio-eastwest-internal
    labels:
      security.istio.io/tlsMode: istio
  exportTo:
  - .
  hosts:
  - test-cluster-metrics-grpc-consumer.test
  location: MESH_INTERNAL
  ports:
  - name: grpc-10901
    number: 10901
    protocol: GRPC
  resolution: DNS

Destination Rule in Cluster A

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: 
  name: test-cluster-metrics-grpc-consumer-import
  namespace: monitoring
spec:
  exportTo:
  - .
  host: test-cluster-metrics-grpc-consumer.test
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
      sni: outbound_.10901_._.cluster-metrics-grpc-provider.monitoring.svc.cluster.local

Attached- Grafana screenshot - we cannot see the latest metrics, however metrics older than 2 hours can be seen

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-09T08:41:38Z

Hey, @inboxamitraj! I'm here to help you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.

I found a similar open issue that might be relevant to your situation: Thanos-Querier error "rpc error: code = DeadlineExceeded desc = context deadline exceeded" with istio cross-cluster communication. This issue is still open, and no solution has been found in the comments ^[1].

Regarding your problem, the 'DeadlineExceeded' error typically indicates that the gRPC call is taking longer than the allowed timeout. Here are a few suggestions to address this issue:

Increase Timeout Settings: You've mentioned increasing timeouts in Thanos-Querier, but ensure that both --query.timeout and --store.response-timeout are set to sufficiently high values to accommodate network delays, especially in a cross-cluster setup. For example:
```
--query.timeout=5m
--store.response-timeout=1m
```
Check Istio Configuration: Ensure that your Istio configuration allows for the necessary gRPC traffic between clusters. The ServiceEntry and DestinationRule configurations seem correct, but verify that the sni value in the DestinationRule matches the expected format and that the tls mode is appropriate for your setup.
Network Policies: Verify that there are no network policies or firewall rules blocking the gRPC traffic between the clusters.
Istio Version Compatibility: Since the issue started after upgrading Istio, consider checking for any known compatibility issues between Thanos and Istio 1.20. Although I couldn't find specific compatibility issues in the repository, reviewing Istio's release notes for any breaking changes might be helpful.
Debugging: Use Istio's tools to debug the traffic flow and ensure that the requests are reaching the intended destinations without being blocked or delayed.

These steps might help in resolving the 'DeadlineExceeded' errors and restoring the expected functionality of your Thanos-Querier setup.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

dosubot · 2024-12-09T08:47:37Z

You're welcome! If you encounter any further issues or need additional assistance, feel free to reach out. Good luck with the troubleshooting!

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

inboxamitraj · 2024-12-09T09:11:54Z

@dosu
I tried to add below arguments, but no luck. I still get the same error

--query.timeout=5m
--store.response-timeout=1m

dosubot · 2024-12-09T09:13:45Z

Since increasing the timeout settings didn't resolve the issue, consider the following additional steps:

gRPC Retry Configuration: Ensure that the gRPC retry settings are properly configured. Thanos uses a retry policy with a maximum of 3 attempts and an initial backoff of 0.1 seconds, which might not be sufficient for your setup. You may need to adjust these settings if possible ^[1].
Istio mTLS Configuration: Double-check the mTLS settings in Istio. Ensure that the DestinationRule and ServiceEntry configurations are correctly set up for mutual TLS, and that the SNI value is accurate. Any mismatch can cause communication issues.
Envoy Sidecar Logs: Check the logs of the Envoy sidecar proxies in your Istio setup. They might provide more insight into whether the requests are being blocked or delayed.
Network Latency and Bandwidth: Investigate any potential network latency or bandwidth issues between the clusters. High latency or limited bandwidth can cause gRPC calls to exceed their deadlines.
Istio Debugging Tools: Utilize Istio's debugging tools, such as istioctl proxy-config and istioctl analyze, to diagnose and resolve any configuration issues that might be affecting traffic flow.

These steps should help further diagnose and potentially resolve the gRPC communication issues you're experiencing with Thanos-Querier in your multi-cluster setup.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

dosubot bot added bug component: query labels Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos-Querier " component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901" #7971

Thanos-Querier " component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901" #7971

inboxamitraj commented Dec 9, 2024

dosubot bot commented Dec 9, 2024

dosubot bot commented Dec 9, 2024

inboxamitraj commented Dec 9, 2024

dosubot bot commented Dec 9, 2024

Thanos-Querier " component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901" #7971

Thanos-Querier " component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from test-component-metrics-grpc-consumer.test:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=test-component-metrics-grpc-consumer.test:10901" #7971

Comments

inboxamitraj commented Dec 9, 2024

dosubot bot commented Dec 9, 2024

dosubot bot commented Dec 9, 2024

inboxamitraj commented Dec 9, 2024

dosubot bot commented Dec 9, 2024