Skip to content

ReplicationConnection.start_link with sync_connect: true might block forever #746

Open
@edgurgel

Description

@edgurgel

Elixir version

1.17.3

Database and Version

PostgreSQL 17.4

Postgrex Version

0.19.3

Current behavior

Hi team,

We think that Postgrex.ReplicationConnection.start_link with sync_connect: true is getting stuck sometimes.
Sadly I'm not entirely sure how to reproduce this issue.

Here is the current_stacktrace of a DynamicSupervisor we are using to start Postgrex.ReplicationConnection. The line 124 below is this one:

https://github.com/supabase/realtime/blob/b128eb09ea6e891c59f8a5d6481af484122d5115/lib/realtime/tenants/replication_connection.ex#L124 (case Postgrex.ReplicationConnection.start_link(__MODULE__, attrs, connection_opts) do)

iex> Process.info(pid, :current_stacktrace)
{:current_stacktrace,
 [
   {:proc_lib, :sync_start, 2, [file: ~c"proc_lib.erl", line: 434]},
   {Realtime.Tenants.ReplicationConnection, :start_link, 1,
    [file: ~c"lib/realtime/tenants/replication_connection.ex", line: 124]},
   {DynamicSupervisor, :start_child, 3,
    [file: ~c"lib/dynamic_supervisor.ex", line: 795]},
   {DynamicSupervisor, :handle_start_child, 2,
    [file: ~c"lib/dynamic_supervisor.ex", line: 781]},
   {:gen_server, :try_handle_call, 4, [file: ~c"gen_server.erl", line: 2381]},
   {:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 2410]},
   {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 329]}
 ]}

This supervisor was stuck waiting for many many hours.

I can see that Postgrex.ReplicationConnection calls the module init but our init is very lightweight:

https://github.com/supabase/realtime/blob/v2.36.17/lib/realtime/tenants/replication_connection.ex#L133-L147

But there is also the handle_connect etc which can also add up until it finally succeeds connecting?

Expected behavior

The expected behaviour is to respect the timeout set to establish the connection.

While I don't know how the connection was not set-up it would be good to be able to set the gen_statem init timeout regardless if this is an issue with Postgrex connection/handshake setup or not. Having the option to pass a gen_statem timeout would be great and even better if the default value was not :infinity as it will have ultimately a max amount of time that it will be waiting to start.

This would also protect people from having their Module.init/1 taking a long time and blocking the process from starting which is out of Postgrex.ReplicationConnection's control.

Hope this makes sense.

Thanks in advance!

I'm happy to send a PR with this change with some guidance on how we would like to expose this. If we go this route would we want a generic way to send gen_statem start_link options or simply the :timeout (:init_timeout) option?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions