Skip to content

Handle dropped webhooks for scaling up new runners #271

@ZainRizvi

Description

@ZainRizvi

There’s a issue with our runner scale up code that if the initial webhook for starting the job gets dropped then the system will never try to provision a fresh runner for that job.

In the Meta fleet that hasn’t been a noticeable issue since we will have enough runners of every instance type running or in standby that it still leaves a few machines available to service the dropped requests. However, the LF fleet is has fewer jobs requested of it right now (so fewer runners that might be just finishing up a job and be ready to service a request) and it also has a no idle fleet (to reduce costs), which results in this behavior.

To fix this: we should start regularly checking GitHub for all queued jobs and ensure we are provisioning enough runners to handle all those requests

Example of jobs that hit this issue: https://github.com/pytorch/pytorch/actions/runs/10779508314/job/29935842319

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions