WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1

### What did you do?

The customer upgraded 9 GEM ingesters from v1.7.0 to v2.0.1, the corresponding Prometheus version are:

v1.7.0: github.com/grafana/prometheus-private v0.0.0-20211105104652-a882d28d367e
v2.0.1: github.com/grafana/mimir-prometheus v0.0.0-20220210151959-f8e3195f7500

(Note that prometheus-private has been renamed to mimir-prometheus, so that's the same repo)

### What did you expect to see?

We expected the new version to replay the WAL successfully.

### What did you see instead? Under which circumstances?

After updating the 9 ingesters, 2 of them encountered a WAL corruption during startup. I have checked the diff between the two used Prometheus versions, and the only modifications to the WAL which I can see are in [this PR](https://github.com/prometheus/prometheus/pull/9856/), but I don't know how this change would lead to a corruption during replay.

The customer also reported that the WAL replay took much longer than usually on the other 7 ingesters which came back successfully, I'm not sure if that's relevant at all to this issue.

I only have screenshots of the logs.

This is the WAL corruption in the ingester log:

![image](https://user-images.githubusercontent.com/195371/173873089-9a3133b9-4704-4980-a873-891d5c85ea09.png)

This is from the log of one of the ingesters which came up successfully, note that the reported replay duration is `2h22min`, the customer said that usually restarting an ingester took `5-10min`, not `2h22min`:

![image](https://user-images.githubusercontent.com/195371/173873166-022e16c6-b5ec-4053-b58d-e44255d2df70.png)

Side-note: This could also simply indicate an issue with the used storage volumes, but I still wanted to submit this issue to check if somebody might be aware of a change that could lead to this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions