Description
Thanos with Memcached enabled plus MiniO as Long-term
Thanos, Prometheus and Golang version used:
Object Storage Provider: S3 MiniO
What happened:
I have configured my Thanos alongside Memcached but I am not able to fix the error about my query search when I need search more than 2 days. I am getting the error below
receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
My Thanos Store:
args:
- store
- '--log.level=info'
- '--log.format=logfmt'
- '--data-dir=/var/thanos/store'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--objstore.config=$(OBJSTORE_CONFIG)'
- '--ignore-deletion-marks-delay=24h'
- '--block-sync-concurrency=120'
- '--sync-block-duration=60m'
- '--index-cache-size=4096MB'
- '--chunk-pool-size=4GB'
- '--store.grpc.series-max-concurrency=300'
- '--consistency-delay=30m'
- |-
--index-cache.config="config":
"addresses":
- "thanos-memcached-service.thanos:11211"
"dns_provider_update_interval": "60s"
"max_async_buffer_size": 0
"max_async_concurrency": 1000
"max_get_multi_batch_size": 0
"max_get_multi_concurrency": 0
"max_idle_connections": 400
"max_item_size": 0
"timeout": "180s"
"type": "MEMCACHED"
- |-
--store.caching-bucket.config="blocks_iter_ttl": "720h"
"chunk_object_attrs_ttl": "720h"
"chunk_subrange_size": 128000
"chunk_subrange_ttl": "720h"
"config":
"addresses":
- "thanos-memcached-service.thanos:11211"
"dns_provider_update_interval": "60s"
"max_async_buffer_size": 0
"max_async_concurrency": 1000
"max_get_multi_batch_size": 0
"max_get_multi_concurrency": 0
"max_idle_connections": 400
"max_item_size": 0
"timeout": "180s"
"max_chunks_get_range_requests": 3
"metafile_content_ttl": "720h"
"metafile_doesnt_exist_ttl": "1h"
"metafile_exists_ttl": "720h"
"metafile_max_size": "4MiB"
"type": "MEMCACHED"
- |-
--tracing.config="config":
"sampler_param": 2
"sampler_type": "ratelimiting"
"service_name": "thanos-store"
"type": "JAEGER"
My Thanos Frontend
args:
- query-frontend
- '--enable-auto-gomemlimit'
- '--log.level=info'
- '--log.format=logfmt'
- '--query-frontend.compress-responses'
- '--http-address=0.0.0.0:9090'
- >-
--query-frontend.downstream-url=http://thanos-query.thanos.svc.cluster.local.:9090
- '--query-range.split-interval=24h'
- '--labels.split-interval=12h'
- '--query-range.max-retries-per-request=100'
- '--labels.max-retries-per-request=25'
- '--query-frontend.log-queries-longer-than=0'
- '--query-range.max-query-parallelism=120'
- '--query-frontend.vertical-shards=0'
- '--cache-compression-type='
- '--query-frontend.downstream-tripper-config={"response_header_timeout": "5m", "max_idle_conns_per_host": 100}'
- |-
--query-range.response-cache-config="config":
"addresses":
- "thanos-memcached-service.thanos:11211"
"dns_provider_update_interval": "30s"
"max_async_buffer_size": 0
"max_async_concurrency": 1000
"max_get_multi_batch_size": 0
"max_get_multi_concurrency": 0
"max_idle_connections": 400
"timeout": "180s"
"expiration": "720h"
"type": "MEMCACHED"
- |-
--labels.response-cache-config="config":
"addresses":
- "thanos-memcached-service.thanos:11211"
"dns_provider_update_interval": "30s"
"max_async_buffer_size": 0
"max_async_concurrency": 1000
"max_get_multi_batch_size": 0
"max_get_multi_concurrency": 0
"max_idle_connections": 400
"timeout": "180s"
"expiration": "720h"
"type": "MEMCACHED"
- |-
--tracing.config="config":
"sampler_param": 2
"sampler_type": "ratelimiting"
"service_name": "thanos-query-frontend"
"type": "JAEGER"
My Prometheus:
containers:
- args:
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--storage.tsdb.retention.time=12h'
- '--config.file=/etc/prometheus/config_out/prometheus.env.yaml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--web.route-prefix=/'
- '--web.config.file=/etc/prometheus/web_config/web-config.yaml'
- '--storage.tsdb.max-block-duration=2h'
- '--storage.tsdb.min-block-duration=2h'
- '--web.max-connections=8096'
- '--query.max-concurrency=60'
image: 'prom/prometheus:v2.49.1'
What you expected to happen:
My Prometheus have 6h of retention but if I try search more than this am getting the error mentioned
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
Anything else we need to know:
ts=2024-08-22T04:15:02.506236929Z caller=memcached_client.go:438 level=warn name=index-cache msg="failed to fetch items from memcached" numKeys=1 firstKey=EP:01J5TQ7GTAK7JFP1SDHAZQABMB:NskVASoO0H1CJRIx74k3hIBPzIM6wCRkKvWOjc9V3Dg:dss err="write tcp 10.233.66.17:47668->10.233.31.160:11211: write: connection timed out"
Environment:
- OS (e.g. from /etc/os-release): RedHat 8.5
- Kernel (e.g.
uname -a
): 4.8 - Others: Kubernetes
-->
Could you please help me to understand what I did wrong?