Skip to content

Conversation

@dejanzele
Copy link
Member

What type of PR is this?

Enhancement / Bug fix

What this PR does / why we need it

Moves gang scheduling metadata from string annotations to typed protobuf fields, improving type safety and clarifying the scheduler<>executor contract.

New types

Added three new types to replace annotation-based gang scheduling:

  • GangInfo represents the user's gang configuration (gang ID, cardinality, and node uniformity label name). Users submit this in the new SubmitJob.gang field.
  • GangPlacement extends GangInfo with the scheduler's placement decision. After the scheduler decides where to place the gang, it adds the node_uniformity_label_value field. This complete placement info gets sent to executors.
  • SchedulingMetadata is a container message in JobRunLease that holds the GangPlacement. This gives us room to add other scheduling decisions in the future.
Server

The server now handles gang metadata in two phases:

  • During job submission, it validates the new gang field or converts legacy gang annotations into SchedulingMetadata. The buildSchedulingMetadata() function handles both cases, so old and new clients both work.
  • During scheduling, the scheduler populates the node_uniformity_label_value based on where it decides to place the gang. It then sends the complete SchedulingMetadata to the executor via JobRunLease.
Executor

Executor now receives fully-populated SchedulingMetadata from the scheduler and uses it directly to build environment variables for the Armada jobs(pods).

Example

queue: test-queue
jobSetId: gang-example
jobs:
  - priority: 0
    namespace: default
    gang:
      gangId: my-distributed-training
      cardinality: 3
      nodeUniformityLabelName: kubernetes.io/hostname
    podSpec:
      containers:
      - name: main
         image: alpine
         command: ["sleep", "5"]
         resources:
           requests:
             cpu: 1
             memory: 1Gi
           limits:
             cpu: 1
             memory: 1Gi

@dejanzele dejanzele force-pushed the feat/refactor-scheduling-metadata branch 9 times, most recently from 4a4a113 to 67fe139 Compare November 26, 2025 18:53
@dejanzele dejanzele force-pushed the feat/refactor-scheduling-metadata branch from 67fe139 to e4c0963 Compare November 26, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant