Db migrations getting stuck for a while with no logs until failure. What can be wrong? #45446

andrii-korotkov-verkada · 2025-01-07T01:42:35Z

andrii-korotkov-verkada
Jan 7, 2025

My setup is ArgoCD for deployment, Terraform for secrets, db, permissions etc. configuration, AWS for hosting EKS and Aurora Postgres database, Airflow image based on version 2.10.3 and Python 3.10. I'm using PgBouncer, which I've configured to sync before the db migrations job. I see some logs for pgbouncer which show what seems like some successful connections, probably from the metrics exporter:

2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: (nodb)/(nouser)@127.0.0.1:57012 ignoring startup parameter: extra_float_digits=2
2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: (nodb)/(nouser)@127.0.0.1:57012 got var: user=postgres
2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: (nodb)/(nouser)@127.0.0.1:57012 got var: datestyle=ISO, MDY
2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: (nodb)/(nouser)@127.0.0.1:57012 got var: database=pgbouncer
2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: (nodb)/(nouser)@127.0.0.1:57012 got var: client_encoding=UTF8
2025-01-07 01:21:25.340 UTC [1] LOG C-0x7fc222b94930: pgbouncer/[email protected]:57012 login attempt: db=pgbouncer user=postgres tls=no
2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: pgbouncer/[email protected]:57012 P: got connection: 127.0.0.1:57012 -> 127.0.0.1:6543
2025-01-07 01:21:25.340 UTC [1] DEBUG C-0x7fc222b94930: pgbouncer/[email protected]:57012 C: selected SASL mechanism: SCRAM-SHA-256

My metadata connection string is configured as (some variables replaced with values)

postgresql://postgres:<password-placeholder>@airflow-pgbouncer.vairflow:6543/airflow-metadata?sslmode=disable

For metadata connection there are presumable crashes among PgBouncer logs

2025-01-07 01:25:30.407 UTC [1] LOG S-0x7fc222b42040: airflow-metadata/postgres@<ip-address-placeholder>:5432 new connection to server (from 10.34.119.135:41096)
2025-01-07 01:25:40.408 UTC [1] LOG S-0x7fc222b42040: airflow-metadata/postgres@<ip-address-placeholder>:5432 closing because: server conn crashed? (age=10s)

Here's pgbouncer.ini config setup (replaced some variables with values)

  secret_string = join("\n", [
    "[databases]",
    "airflow-metadata = host=<rds-cluster-endpoint-placeholder> dbname=vairflow port=5432 pool_size=10 ",
    "airflow-result-backend = host=<rds-cluster-endpoint-placeholder> dbname=vairflow port=5432 pool_size=5 ",
    "",
    "[pgbouncer]",
    "pool_mode = transaction",
    "listen_port = 6543",
    "listen_addr = *",
    "auth_type = scram-sha-256",
    "auth_file = /etc/pgbouncer/users.txt",
    "stats_users = postgres",
    "ignore_startup_parameters = extra_float_digits",
    "max_client_conn = 100",
    "verbose = 1",
    "log_disconnections = 1",
    "log_connections = 1",
    "",
    "server_tls_sslmode = disable",
    "server_tls_ciphers = normal",
  ])

Here's users.txt setup (not sure duplication is needed, but it was there in helm templates)

  secret_string = join("\n", [
    "\"postgres\" \"<password-placeholder>\"",
    "\"postgres\" \"<password-placeholder>\""
  ])

I've also found these logs in PgBouncer

2025-01-07 01:31:02.733 UTC [1] LOG C-0x7fc222b94680: airflow-metadata/postgres@<ip-address-placeholder>:51429 closing because: client_login_timeout (server down) (age=60s)
2025-01-07 01:31:02.733 UTC [1] WARNING C-0x7fc222b94680: airflow-metadata/postgres@<ip-address-placeholder>:51429 pooler error: client_login_timeout (server down)

Similar client login timeout eventually appears when the db migrations job pod fails and is recreated/restarted:

....................
ERROR! Maximum number of retries (20) reached.

Last check result:
$ airflow db check
/home/airflow/.local/lib/python3.10/site-packages/airflow/configuration.py:859 FutureWarning: section/key [core/sql_alchemy_conn] has been deprecated, you should use[database/sql_alchemy_conn] instead. Please update your `conf.get*` call to use the new name
/home/airflow/.local/lib/python3.10/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
[2025-01-07T00:24:07.482+0000] {cli_action_loggers.py:177} WARNING - Failed to log action (psycopg2.OperationalError) connection to server at "airflow-pgbouncer.vairflow" (<ip-address-placeholder>), port 6543 failed: FATAL:  client_login_timeout (server down)
<stacktrace-placeholder>
psycopg2.OperationalError: connection to server at "airflow-pgbouncer.vairflow" (<ip-address-placeholder>), port 6543 failed: FATAL:  client_login_timeout (server down)

When checking on the DB side, metrics show some activity but 0 db connections.
I'd be grateful for some hints about what can be wrong, in particular whether it's DB issue, pgbouncer issues, migrations issue or something else. Thanks.

Answered by andrii-korotkov-verkada

Jan 7, 2025

I've made some progress. Apparently I didn't have an ingress security group rule for the DB (which is unclear why it's needed, since security group of pods and DB are the same and port is the same).

Now there's another issue according to pgbouncer logs:

2025-01-07 18:05:42.039 UTC [1] DEBUG S-0x7fc214953110: airflow-metadata/postgres@(bad-af):0 dns_callback: inet4: <ip-address-placeholder>:5432
2025-01-07 18:05:42.039 UTC [1] DEBUG S-0x7fc214953110: airflow-metadata/postgres@1<ip-address-placeholder>:5432 launching new connection to server
2025-01-07 18:05:42.039 UTC [1] DEBUG launch_new_connection: already progress
2025-01-07 18:05:42.039 UTC [1] DEBUG S-0x7fc214953110: airflow-metadata/…

View full answer

potiuk · 2025-01-07T07:42:50Z

potiuk
Jan 7, 2025
Collaborator

Have you followed https://airflow.apache.org/docs/helm-chart/stable/index.html#installing-the-chart-with-argo-cd-flux-rancher-or-terraform ?

4 replies

andrii-korotkov-verkada Jan 7, 2025
Author

I've apparently followed it partially, disabling helm hooks and custom env, but not running a job as a sync hook. I'll do the later as well, but don't believe it's a cause of the issue, since I'd often delete existing job before trying to sync again, it would create a pod but just look like it doesn't run.

andrii-korotkov-verkada Jan 7, 2025
Author

Yeah, making job a hook doesn't help, the relevant errors from PgBouncer logs seem to be

2025-01-07 16:46:47.700 UTC [1] LOG C-0x7fc2149a5930: airflow-metadata/postgres@<ip-address-placeholder>:39583 closing because: client_login_timeout (server down) (age=60s)
2025-01-07 16:46:47.700 UTC [1] WARNING C-0x7fc2149a5930: airflow-metadata/postgres@<ip-address-placeholder>:39583 pooler error: client_login_timeout (server down)

The job itself is again producing no logs.

andrii-korotkov-verkada Jan 7, 2025
Author

I've made some progress. Apparently I didn't have an ingress security group rule for the DB (which is unclear why it's needed, since security group of pods and DB are the same and port is the same).

Now there's another issue according to pgbouncer logs:

2025-01-07 18:05:42.039 UTC [1] DEBUG S-0x7fc214953110: airflow-metadata/postgres@(bad-af):0 dns_callback: inet4: <ip-address-placeholder>:5432
2025-01-07 18:05:42.039 UTC [1] DEBUG S-0x7fc214953110: airflow-metadata/postgres@1<ip-address-placeholder>:5432 launching new connection to server
2025-01-07 18:05:42.039 UTC [1] DEBUG launch_new_connection: already progress
2025-01-07 18:05:42.039 UTC [1] DEBUG S-0x7fc214953110: airflow-metadata/postgres@<ip-address-placeholder>:5432 S: connect ok
2025-01-07 18:05:42.039 UTC [1] LOG S-0x7fc214953110: airflow-metadata/postgres@<ip-address-placeholder>:5432 new connection to server (from <ip-address-placeholder>:40000)
2025-01-07 18:05:42.039 UTC [1] DEBUG launch_new_connection: already progress
2025-01-07 18:05:42.044 UTC [1] WARNING server login failed: FATAL no pg_hba.conf entry for host "<ip-address-placeholder>", user "postgres", database "vairflow", no encryption
2025-01-07 18:05:42.044 UTC [1] LOG C-0x7fc2149a5930: airflow-metadata/[email protected]:58659 closing because: no pg_hba.conf entry for host "<ip-address-placeholder>", user "postgres", database "vairflow", no encryption (age=8s)
2025-01-07 18:05:42.044 UTC [1] WARNING C-0x7fc2149a5930: airflow-metadata/postgres@<ip-address-placeholder>:58659 pooler error: no pg_hba.conf entry for host "<ip-address-placeholder>", user "postgres", database "vairflow", no encryption
2025-01-07 18:05:42.044 UTC [1] LOG S-0x7fc214953110: airflow-metadata/postgres@<ip-address-placeholder>:5432 closing because: login failed (age=0s)

Answer selected by andrii-korotkov-verkada

andrii-korotkov-verkada Jan 7, 2025
Author

I've resolved that too by requiring ssl on pgbouncer -> db connection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Db migrations getting stuck for a while with no logs until failure. What can be wrong? #45446

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Db migrations getting stuck for a while with no logs until failure. What can be wrong? #45446

Uh oh!

andrii-korotkov-verkada Jan 7, 2025

Replies: 1 comment · 4 replies

Uh oh!

potiuk Jan 7, 2025 Collaborator

Uh oh!

andrii-korotkov-verkada Jan 7, 2025 Author

Uh oh!

andrii-korotkov-verkada Jan 7, 2025 Author

Uh oh!

andrii-korotkov-verkada Jan 7, 2025 Author

Uh oh!

Uh oh!

andrii-korotkov-verkada Jan 7, 2025 Author

andrii-korotkov-verkada
Jan 7, 2025

Replies: 1 comment 4 replies

potiuk
Jan 7, 2025
Collaborator

andrii-korotkov-verkada Jan 7, 2025
Author

andrii-korotkov-verkada Jan 7, 2025
Author

andrii-korotkov-verkada Jan 7, 2025
Author

andrii-korotkov-verkada Jan 7, 2025
Author