Skip to content

[ENT Batches] - Problem with Batches getting stuck and never completing #486

@Tylerian

Description

@Tylerian

Checklist

  • Which Faktory package and version?
    Faktory v1.9.0
  • Which Faktory worker package and version?
    faktory_workers_go v1.9.0
  • Please include any relevant worker configuration
Workers = 3
Concurrency = 2
  • Please include any relevant error messages or stacktraces
Client
--
Unable to report JID Qtjl6r_Ifxh32kUQ result to Faktory: read tcp 172.16.114.109:56062->10.100.183.42:7419: i/o timeout

Server
--
Unable to process timed job: cannot retry reservation: Job not found Qtjl6r_Ifxh32kUQ
No such job to acknowledge Qtjl6r_Ifxh32kUQ

Are you using an old version?
No

Have you checked the changelogs to see if your issue has been fixed in a later version?
Yes

Context

We're running a bulk process with Faktory which triggers millions of individual Jobs wrapped in Batches to split the work into manageable chunks.

Problem

Sometimes the Batches UI page shows 1 pending Job which is neither running nor waiting to be processed in the queue, leaving the Batch stuck and never completing. The success/complete callbacks on the Batch aren't being called neither.

Screenshot 2024-09-02 at 14 37 21

When finding for logs, there is little to be seen. The most I've managed to find are networking error logs like the following:

  • Worker logs:
    Unable to report JID Qtjl6r_Ifxh32kUQ result to Faktory: read tcp 172.16.114.109:56062->10.100.183.42:7419: i/o timeout

  • Server logs:
    Unable to process timed job: cannot retry reservation: Job not found Qtjl6r_Ifxh32kUQ
    No such job to acknowledge Qtjl6r_Ifxh32kUQ

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions