Skip to content

test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

Closed
@alexanderlaw

Description

@alexanderlaw

Numerous test_physical_replication_config_mismatch_too_many_known_xids failures (occurred mostly on relatively slow aarch64, but the last one is from x64):
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10153/12340853499/index.html#testresult/d2d742bbd5eafbc2/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10150/12324065238/index.html#/testresult/62cbf9b86ee197f6
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10139/12356033335/index.html#/testresult/ba041a552edad115
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9976/12347310946/index.html#/testresult/bec99e17e8048588
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10131/12322904295/index.html#/testresult/110ba3a81c35c6d7
....
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10126/12322979734/index.html#/testresult/e8a5900970131270
with the error:
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 19647 failed: FATAL: sorry, too many clients already

show that the test can't survive a slow backend shutdown.

I could reproduce this locally, and with verbose logging, I see:

PG:2024-12-16 19:24:15.009 GMT [postgres][1826200:17][client backend] [[unknown]] LOG:  statement: SHOW neon.max_file_cache_size
PG:2024-12-16 19:24:15.009 GMT [postgres][1826091:219][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.010 GMT [postgres][1826091:220][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:221][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] Swap file cache page
PG:2024-12-16 19:24:15.011 GMT [][1826021:134][postmaster] DEBUG:  forked new backend, pid=1826217 socket=9
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:222][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:18][client backend] [[unknown]] DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [[unknown]][1826217:1][not initialized] [[unknown]] LOG:  connection received: host=127.0.0.1 port=49494
PG:2024-12-16 19:24:15.011 GMT [postgres][1826217:2][client backend] [[unknown]] FATAL:  sorry, too many clients already
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:19][client backend] [[unknown]] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:20][client backend] [[unknown]] DEBUG:  proc_exit(0): 2 callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:21][client backend] [[unknown]] LOG:  disconnection: session time: 0:00:00.031 user=cloud_admin database=postgres host=127.0.0.1 port=49488

This test sets max_connections=2, but one connection is permanently occupied by compute_ctl:activity_monitor and another one can be still occupied by a backend (1826200, in this case), that just executed psql query, like in Endpoint.start():

        self.log_config_value("shared_buffers")
        self.log_config_value("neon.max_file_cache_size")
        self.log_config_value("neon.file_cache_size_limit")

The issue can be reproduced easily with this sleep:

@@ -107,6 +107,7 @@ proc_exit(int code)
     if (MyProcPid != (int) getpid())
         elog(PANIC, "proc_exit() called in child process");
 
+pg_usleep(100000);
     /* Clean up everything that must be cleaned up */
     proc_exit_prepare(code);

Metadata

Metadata

Assignees

Labels

a/test/flakyArea: related to flaky testsc/computeComponent: compute, excluding postgres itselft/bugIssue Type: Bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions