Skip to content

Blocks storage unable to ingest samples older than 1h after an outage #116

@grafanabot

Description

@grafanabot

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:

  • Multiple Prometheus servers remote writing to the same Cortex tenant
  • Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
  • When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

@bwplotka You may be interested, given I think this issue affects Thanos receive too.

Submitted by: pracucci
Cortex Issue Number: 2366

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions