Allow supervisor to recover after crash #519

darinwilson · 2025-02-19T00:37:13Z

This is a first pass at a fix for #512: allowing Solid Queue to recover if the database goes offline (or if it fails for any other reason).

In the case of the database going away, there are a few possible scenarios that can cause the supervisor to fail, but the most common is:

worker tries and fails to poll the database
worker tries to deregister (via after_shutdown in Registrable) and fails
Process#deregister re-raises any exceptions that come up during deregistration, so worker crashes
supervisor detects the crashed process and tries to replace it, but fails
supervisor also inherits from Process so when it fails, it calls deregister just like the worker did
nothing else is around to catch the exception raised in deregister so Solid Queue terminates completely

After a restart, the maintenance tasks performed by the supervisor do a good job of cleaning up the loose ends left behind, so it seemed like the cleanest approach was just to let the supervisor crash, then spin up a new instance. This is handled by a new Launcher class that wraps Supervisor#start in a retry block with exponential backoff.

This also adds a new config parameter max_restart_attempts that allows the user to limit the number of restart attempts. If nil, it will retry forever, and if 0, it won't try at all. (I made 0 the default since that's the current behavior.)

I tested with Postgres and MySQL, but didn't really know how to test SQLite or if it even made sense to. Again, this is just a first attempt - happy to try a different approach if this doesn't seem quite right.

Thanks!

darinwilson · 2025-02-19T00:41:42Z

Looking into the test failures

darinwilson · 2025-02-20T03:41:10Z

All checks but 1 are passing, and the failure was due to an error from Docker:

/usr/bin/docker create --name 067820fef0e2421091a40c8747b66a14_mysql8031_27b123 --label 42a304 --network github_network_4a20af9528964765a5e3250c78009207 --network-alias mysql -p 33060:3306 --health-cmd "mysql -h localhost -e \"select now()\"" --health-interval 1s --health-timeout 5s --health-retries 30 -e "MYSQL_ALLOW_EMPTY_PASSWORD=yes" -e GITHUB_ACTIONS=true -e CI=true mysql:8.0.31
  b6e254956276fbbbdaa878d38f3c826cbef2afd738c7a383b6e6380450611489
  /usr/bin/docker start b6e254956276fbbbdaa878d38f3c826cbef2afd738c7a383b6e6380450611489
  Error response from daemon: driver failed programming external connectivity on endpoint 067820fef0e2421091a40c8747b66a14_mysql8031_27b123 (6238b6123635c6941643dbf6ac903990818d03ff85b58fd928b89f0bb53a9856): Error starting userland proxy: listen tcp4 0.0.0.0:33060: bind: address already in use
  Error: failed to start containers: b6e254956276fbbbdaa878d38f3c826cbef2afd738c7a383b6e6380450611489
  Error: Docker start fail with exit code 1

I tried to see if there was a way to re-run the job, but couldn't find anything. Am I missing something obvious? 😅

Wrap supervisor in a retry block so it can recover from crashes

dde76c7

Fix intermittently failing test

ae1f003

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow supervisor to recover after crash #519

Allow supervisor to recover after crash #519

darinwilson commented Feb 19, 2025

darinwilson commented Feb 19, 2025

darinwilson commented Feb 20, 2025

Allow supervisor to recover after crash #519

Are you sure you want to change the base?

Allow supervisor to recover after crash #519

Conversation

darinwilson commented Feb 19, 2025

darinwilson commented Feb 19, 2025

darinwilson commented Feb 20, 2025