Description
What did you do?
The customer upgraded 9 GEM ingesters from v1.7.0 to v2.0.1, the corresponding Prometheus version are:
v1.7.0: github.com/grafana/prometheus-private v0.0.0-20211105104652-a882d28d367e
v2.0.1: github.com/grafana/mimir-prometheus v0.0.0-20220210151959-f8e3195f7500
(Note that prometheus-private has been renamed to mimir-prometheus, so that's the same repo)
What did you expect to see?
We expected the new version to replay the WAL successfully.
What did you see instead? Under which circumstances?
After updating the 9 ingesters, 2 of them encountered a WAL corruption during startup. I have checked the diff between the two used Prometheus versions, and the only modifications to the WAL which I can see are in this PR, but I don't know how this change would lead to a corruption during replay.
The customer also reported that the WAL replay took much longer than usually on the other 7 ingesters which came back successfully, I'm not sure if that's relevant at all to this issue.
I only have screenshots of the logs.
This is the WAL corruption in the ingester log:
This is from the log of one of the ingesters which came up successfully, note that the reported replay duration is 2h22min
, the customer said that usually restarting an ingester took 5-10min
, not 2h22min
:
Side-note: This could also simply indicate an issue with the used storage volumes, but I still wanted to submit this issue to check if somebody might be aware of a change that could lead to this.