Description
Elixir version
1.17.3
Database and Version
PostgreSQL 17.4
Postgrex Version
0.19.3
Current behavior
Hi team,
We think that Postgrex.ReplicationConnection.start_link
with sync_connect: true
is getting stuck sometimes.
Sadly I'm not entirely sure how to reproduce this issue.
Here is the current_stacktrace
of a DynamicSupervisor we are using to start Postgrex.ReplicationConnection
. The line 124 below is this one:
https://github.com/supabase/realtime/blob/b128eb09ea6e891c59f8a5d6481af484122d5115/lib/realtime/tenants/replication_connection.ex#L124 (case Postgrex.ReplicationConnection.start_link(__MODULE__, attrs, connection_opts) do
)
iex> Process.info(pid, :current_stacktrace)
{:current_stacktrace,
[
{:proc_lib, :sync_start, 2, [file: ~c"proc_lib.erl", line: 434]},
{Realtime.Tenants.ReplicationConnection, :start_link, 1,
[file: ~c"lib/realtime/tenants/replication_connection.ex", line: 124]},
{DynamicSupervisor, :start_child, 3,
[file: ~c"lib/dynamic_supervisor.ex", line: 795]},
{DynamicSupervisor, :handle_start_child, 2,
[file: ~c"lib/dynamic_supervisor.ex", line: 781]},
{:gen_server, :try_handle_call, 4, [file: ~c"gen_server.erl", line: 2381]},
{:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 2410]},
{:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 329]}
]}
This supervisor was stuck waiting for many many hours.
I can see that Postgrex.ReplicationConnection
calls the module init
but our init
is very lightweight:
But there is also the handle_connect
etc which can also add up until it finally succeeds connecting?
Expected behavior
The expected behaviour is to respect the timeout set to establish the connection.
While I don't know how the connection was not set-up it would be good to be able to set the gen_statem
init timeout regardless if this is an issue with Postgrex connection/handshake setup or not. Having the option to pass a gen_statem timeout would be great and even better if the default value was not :infinity
as it will have ultimately a max amount of time that it will be waiting to start.
This would also protect people from having their Module.init/1
taking a long time and blocking the process from starting which is out of Postgrex.ReplicationConnection
's control.
Hope this makes sense.
Thanks in advance!
I'm happy to send a PR with this change with some guidance on how we would like to expose this. If we go this route would we want a generic way to send gen_statem
start_link options or simply the :timeout
(:init_timeout
) option?