Blocks storage unable to ingest samples older than 1h after an outage

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:
- Multiple Prometheus servers remote writing to the same Cortex tenant
- Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
- When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

_@bwplotka You may be interested, given I think this issue affects Thanos receive too._

Submitted by: pracucci
Cortex Issue Number: 2366

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blocks storage unable to ingest samples older than 1h after an outage #116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Blocks storage unable to ingest samples older than 1h after an outage #116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions