Skip to content

Bug: StoreGateway Series route having high latency even with low server and query timeouts #12180

@rishabhkumar92

Description

@rishabhkumar92

What is the bug?

We recently saw Storegateways high latency on Series route (route="/gatewaypb.StoreGateway/Series") even though both querier.timeout and -server.http-write-timeout is set to 1m. Latency is sometime going upto even 8 minutes which caused all the slots being taken up by store gateways expensive queries.

Image

How to reproduce it?

  1. We are on Mimir 2.16.1
  2. Issue some really expensive long term queries which are computationally very heavy
  3. You will see p99 latency violating timeouts

What did you think would happen?

timeout should kick in if query is taking more than a min (which is what's set).

What was your environment?

Kubernetes

Any additional context to share?

Here is how the config looks like

target: store-gateway
multitenancy_enabled: true
no_auth_tenant: anonymous
shutdown_delay: 0s
max_separate_metrics_groups_per_user: 1000
enable_go_runtime_metrics: true
api:
    skip_label_name_validation_header_enabled: false
    skip_label_count_validation_header_enabled: false
    alertmanager_http_prefix: /api/prom/alertmanager
    prometheus_http_prefix: /prometheus
server:
    http_listen_network: tcp
    http_listen_address: ""
    http_listen_port: 8000
    http_listen_conn_limit: 0
    grpc_listen_network: tcp
    grpc_listen_address: ""
    grpc_listen_port: 9095
    grpc_listen_conn_limit: 0
    proxy_protocol_enabled: false
    tls_cipher_suites: ""
    tls_min_version: ""
    http_tls_config:
        cert: ""
        key: null
        client_ca: ""
        cert_file: ""
        key_file: ""
        client_auth_type: ""
        client_ca_file: ""
    grpc_tls_config:
        cert: ""
        key: null
        client_ca: ""
        cert_file: ""
        key_file: ""
        client_auth_type: ""
        client_ca_file: ""
    register_instrumentation: true
    report_grpc_codes_in_instrumentation_label_enabled: true
    graceful_shutdown_timeout: 30s
    http_server_read_timeout: 30s
    http_server_read_header_timeout: 0s
    http_server_write_timeout: 1m0s
    http_server_idle_timeout: 2m0s
    http_log_closed_connections_without_response_enabled: false
    grpc_server_max_recv_msg_size: 209715200
    grpc_server_max_send_msg_size: 104857600
    grpc_server_max_concurrent_streams: 100
    grpc_server_max_connection_idle: 2562047h47m16.854775807s
    grpc_server_max_connection_age: 2562047h47m16.854775807s
    grpc_server_max_connection_age_grace: 2562047h47m16.854775807s
    grpc_server_keepalive_time: 2h0m0s
    grpc_server_keepalive_timeout: 20s
    grpc_server_min_time_between_pings: 10s
    grpc_server_ping_without_stream_allowed: true
    grpc_server_num_workers: 100
    grpc_server_stats_tracking_enabled: true
    grpc_server_recv_buffer_pools_enabled: false
    log_format: json
    log_level: info
    l...
querier:
    query_store_after: 12h0m0s
    store_gateway_client:
        tls_enabled: false
        tls_cert_path: ""
        tls_key_path: ""
        tls_ca_path: ""
        tls_server_name: ""
        tls_insecure_skip_verify: false
        tls_cipher_suites: ""
        tls_min_version: ""
        cluster_validation:
            label: ""
    shuffle_sharding_ingesters_enabled: true
    prefer_availability_zone: ""
    streaming_chunks_per_ingester_series_buffer_size: 256
    streaming_chunks_per_store_gateway_series_buffer_size: 256
    minimize_ingester_requests: true
    minimize_ingester_requests_hedging_delay: 3s
    query_engine: prometheus
    enable_query_engine_fallback: true
    filter_queryables_enabled: false
    max_concurrent: 20
    timeout: 1m0s
    max_samples: 50000000
    default_evaluation_interval: 1m0s
    lookback_delta: 5m0s
    mimir_query_engine:
        enable_aggregation_operations: true
        enable_binary_logical_operations: true
        enable_one_to_many_and_many_to_one_binary_operations: true
        enable_scalars: true
        enable_scalar_scalar_binary_comparison_operations: true
        enable_subqueries: true
        enable_vector_scalar_binary_comparison_operations: true
        enable_vector_vector_binary_comparison_operations: true
        disabled_aggregations: ""
        disabled_functions: ""

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions