You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similar to zalando/postgres-operator#2168, there is an issue with a restore with point-in-time-recovery in future on PG13+ due to implementation of #580.
With the correct settings:
logging_collector: 'on'
log_destination: stderr,csvlog
The log is correctly read by maybe_pg_upgrade.py, BUT it happens that sometimes a log containing Hot standby mode is disabled. due to some random connection (from our pg exporter... and from spilo/patroni itself? see log below) is inserted at the same time and makes the check fail.
Hot standby mode being disabled is a feature: Patroni disable hot_standby during PITR due to potential errors: patroni/patroni#1891
The exact cause to why it works on small databases or fails on our bigger databases with same settings is unknown yet, I think our biggest databases are just... slower, making this a lot more likely to happen between interesting logs.
The fact that we rely on the last 5 lines of log to decide if it is ok seems very dangerous.
During recovery, we had to... edit live the spilo code to check the last 15 lines instead of 5 to make it work during the second try.
I think a short-term solution would be to have better check that "grep in the last 5 lines", and a long-term solution would allow a restore without mandatory time target.
Extract from csv logs:
2024-10-22 16:51:06.215 UTC,,,539,,6717d728.21b,1889,,2024-10-22 16:47:36 UTC,,0,LOG,00000,"restored log file ""0000000100002E880000003C"" from archive",,,,,,,,,"","startup",,0
2024-10-22 16:51:06.310 UTC,"postgres","postgres",87962,"[local]",6717d7fa.1579a,1,"",2024-10-22 16:51:06 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
2024-10-22 16:51:06.479 UTC,,,539,,6717d728.21b,1890,,2024-10-22 16:47:36 UTC,,0,LOG,00000,"redo done at 2E88/3C000060 system usage: CPU: user: 91.75 s, system: 36.75 s, elapsed: 207.81 s",,,,,,,,,"","startup",,0
2024-10-22 16:51:06.479 UTC,,,539,,6717d728.21b,1891,,2024-10-22 16:47:36 UTC,,0,LOG,00000,"last completed transaction was at log time 2024-10-22 15:07:01.344774+00",,,,,,,,,"","startup",,0
2024-10-22 16:51:06.479 UTC,,,539,,6717d728.21b,1892,,2024-10-22 16:47:36 UTC,,0,FATAL,XX000,"recovery ended before configured recovery target was reached",,,,,,,,,"","startup",,0
2024-10-22 16:51:07.317 UTC,"postgres","postgres",87967,"[local]",6717d7fb.1579f,1,"",2024-10-22 16:51:07 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
2024-10-22 16:51:07.932 UTC,,,536,,6717d727.218,5,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"startup process (PID 539) exited with exit code 1",,,,,,,,,"","postmaster",,0
2024-10-22 16:51:07.932 UTC,,,536,,6717d727.218,6,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0
2024-10-22 16:51:08.027 UTC,"postgres_exporter","wiremind",87968,"127.0.0.1:41510",6717d7fc.157a0,1,"",2024-10-22 16:51:08 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:08.027 UTC,"postgres_exporter","wiremind",87969,"127.0.0.1:41520",6717d7fc.157a1,1,"",2024-10-22 16:51:08 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:08.323 UTC,"postgres","postgres",87971,"[local]",6717d7fc.157a3,1,"",2024-10-22 16:51:08 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:09.331 UTC,"postgres","postgres",87973,"[local]",6717d7fd.157a5,1,"",2024-10-22 16:51:09 UTC,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,"","client backend",,0
2024-10-22 16:51:09.654 UTC,,,536,,6717d727.218,7,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"shutting down due to startup process failure",,,,,,,,,"","postmaster",,0
2024-10-22 16:51:14.214 UTC,,,536,,6717d727.218,8,,2024-10-22 16:47:35 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster",,0
for reference, here is our postgres operator Postgresql resource:
Similar to zalando/postgres-operator#2168, there is an issue with a restore with point-in-time-recovery in future on PG13+ due to implementation of #580.
With the correct settings:
The log is correctly read by maybe_pg_upgrade.py, BUT it happens that sometimes a log containing
Hot standby mode is disabled.
due to some random connection (from our pg exporter... and from spilo/patroni itself? see log below) is inserted at the same time and makes the check fail.Hot standby mode being disabled is a feature: Patroni disable hot_standby during PITR due to potential errors: patroni/patroni#1891
The exact cause to why it works on small databases or fails on our bigger databases with same settings is unknown yet, I think our biggest databases are just... slower, making this a lot more likely to happen between interesting logs.
The fact that we rely on the last 5 lines of log to decide if it is ok seems very dangerous.
During recovery, we had to... edit live the spilo code to check the last 15 lines instead of 5 to make it work during the second try.
I think a short-term solution would be to have better check that "grep in the last 5 lines", and a long-term solution would allow a restore without mandatory time target.
Extract from csv logs:
for reference, here is our postgres operator Postgresql resource:
The text was updated successfully, but these errors were encountered: