Handle dropped webhooks for scaling up new runners

There’s a issue with our runner scale up code that if the initial webhook for starting the job gets dropped then the system will never try to provision a fresh runner for that job.

In the Meta fleet that hasn’t been a noticeable issue since we will have enough runners of every instance type running or in standby that it still leaves a few machines available to service the dropped requests.  However, the LF fleet is has fewer jobs requested of it right now (so fewer runners that might be just finishing up a job and be ready to service a request) and it also has a no idle fleet (to reduce costs), which results in this behavior.

To fix this: we should start regularly checking GitHub for all queued jobs and ensure we are provisioning enough runners to handle all those requests

Example of jobs that hit this issue: https://github.com/pytorch/pytorch/actions/runs/10779508314/job/29935842319

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle dropped webhooks for scaling up new runners #271

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle dropped webhooks for scaling up new runners #271

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions