Description
Numerous test_physical_replication_config_mismatch_too_many_known_xids failures (occurred mostly on relatively slow aarch64, but the last one is from x64):
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10153/12340853499/index.html#testresult/d2d742bbd5eafbc2/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10150/12324065238/index.html#/testresult/62cbf9b86ee197f6
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10139/12356033335/index.html#/testresult/ba041a552edad115
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9976/12347310946/index.html#/testresult/bec99e17e8048588
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10131/12322904295/index.html#/testresult/110ba3a81c35c6d7
....
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10126/12322979734/index.html#/testresult/e8a5900970131270
with the error:
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 19647 failed: FATAL: sorry, too many clients already
show that the test can't survive a slow backend shutdown.
I could reproduce this locally, and with verbose logging, I see:
PG:2024-12-16 19:24:15.009 GMT [postgres][1826200:17][client backend] [[unknown]] LOG: statement: SHOW neon.max_file_cache_size
PG:2024-12-16 19:24:15.009 GMT [postgres][1826091:219][client backend] [compute_ctl:activity_monitor] DEBUG: [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.010 GMT [postgres][1826091:220][client backend] [compute_ctl:activity_monitor] DEBUG: [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:221][client backend] [compute_ctl:activity_monitor] DEBUG: [NEON_SMGR] Swap file cache page
PG:2024-12-16 19:24:15.011 GMT [][1826021:134][postmaster] DEBUG: forked new backend, pid=1826217 socket=9
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:222][client backend] [compute_ctl:activity_monitor] DEBUG: [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:18][client backend] [[unknown]] DEBUG: shmem_exit(0): 4 before_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [[unknown]][1826217:1][not initialized] [[unknown]] LOG: connection received: host=127.0.0.1 port=49494
PG:2024-12-16 19:24:15.011 GMT [postgres][1826217:2][client backend] [[unknown]] FATAL: sorry, too many clients already
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:19][client backend] [[unknown]] DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:20][client backend] [[unknown]] DEBUG: proc_exit(0): 2 callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:21][client backend] [[unknown]] LOG: disconnection: session time: 0:00:00.031 user=cloud_admin database=postgres host=127.0.0.1 port=49488
This test sets max_connections=2, but one connection is permanently occupied by compute_ctl:activity_monitor and another one can be still occupied by a backend (1826200, in this case), that just executed psql query, like in Endpoint.start():
self.log_config_value("shared_buffers")
self.log_config_value("neon.max_file_cache_size")
self.log_config_value("neon.file_cache_size_limit")
The issue can be reproduced easily with this sleep:
@@ -107,6 +107,7 @@ proc_exit(int code)
if (MyProcPid != (int) getpid())
elog(PANIC, "proc_exit() called in child process");
+pg_usleep(100000);
/* Clean up everything that must be cleaned up */
proc_exit_prepare(code);