-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Hello BullMQ 👋
We've been using BullMQ Pro (currently at 7.34.0) in production since December, and BullMQ (currently at 5.52.0) for over a year prior to that, with repeatable jobs as a core part of our implementation the entire time.
Starting a few weeks ago, workers would sometimes stop working off jobs. We have two applications, each with three environments (integration, staging, and production). The workers would stop, seemingly randomly, on any of the environments (production has two worker instances, which should make it much less likely to have workers on both instances stop, but all workers did stop once on one of the production applications).
We scoured the BullMQ documentation and added all the logging we could find (worker events, queue events, uncaughtException, and unhandledRejection) and added NODE_DEBUG=bull
; however, the logs never indicated anything wrong, and we never saw any new debug logs after activating NODE_DEBUG=bull
.
On staging, the only activity when the workers stopped working was triggered by Repeatable Jobs, and the last job to run was a simple one that runs every five minutes to send an event to Datadog (so we can be alerted if background jobs aren't running).
Switching from Repeatable Jobs to Job Schedulers had been on our to-do list for some time, and since we had nothing else to go on to solve the issue, we prioritized that and deployed on Saturday.
Since then, workers have been running fine; whereas, previously, they were stopping on at least one environment within a day of the latest deployment (usually within 12 hours). Granted, it's still only been a few days, so it's perhaps too soon to say conclusively, but it's a lot longer than we were getting away with last week, so I wanted to raise this here to see if it makes sense that switching from Repeatable Jobs to Job Scheduler would solve our issue or if it might shed some light on what is going on.