Skip to content

Conversation

@vittoriasalim
Copy link
Contributor

@vittoriasalim vittoriasalim commented Mar 24, 2025

This pull request introduces job scheduling latency benchmark using ClusterLoader2 (CL2) and KWOK, as a performance evaluation for job controller in Kubernetes.

  • Added JobLifecycleLatency measurement in [modules/python/clusterloader2/utils.py]
  • Added performance evaluation pipeline for job scheduling.
  • Added cl2 config file for job scheduling performance test
  • Added cl2 job template for job execution process.
  • Added job scheduling evaluation scenarios. To provide environment-specific inputs for job-scheduling tests on AWS and Azure.
  • Introduced mock-node KWOK topology for simulating job workloads
  • Define CL2 engine steps to support job scheduling configuration.

Dashboard on Current Job-Scheduling Latency Benchmark:
Kusto Dashboard Link

  • As suggested by Robbie, set up cron job to monitor regression every 2 days, this will be monitored for at least 2 weeks before merging at job-scheduling.yml
  • Added unit test at test_default_cli,py as suggest by @sumanthreddy29

Pipeline
https://dev.azure.com/akstelescope/telescope/_build/results?buildId=19154&view=results

@vittoriasalim
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@vittoriasalim vittoriasalim self-assigned this Mar 24, 2025
@vittoriasalim vittoriasalim requested a review from wonderyl March 24, 2025 02:57
@vittoriasalim vittoriasalim changed the title Add CCP Job Scheduling Latency Benchmark CCP Job Scheduling Latency Benchmark Mar 24, 2025
@vittoriasalim vittoriasalim added the enhancement New feature or request label Mar 24, 2025
Copy link
Collaborator

@wonderyl wonderyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting what I got so far, will continuing review a bit later.

Copy link
Collaborator

@wonderyl wonderyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done my first round of review.

@vittoriasalim vittoriasalim force-pushed the vitto/kwok-cl2 branch 2 times, most recently from d4c5c05 to 1c9310a Compare April 2, 2025 02:35
@vittoriasalim
Copy link
Contributor Author

@wonderyl
I have made the suggested corrections:

  • Renamed test_cli.py to test_default_cli.py.
  • Redesigned test_default_cli.py based on the suggestion to check the behavior of the function and avoid using useless assertions.
  • Renamed uniform1qps to uniform{$qps}qps, since it is no longer 1qps.
  • Renamed job scheduler to job scheduling.

@vittoriasalim
Copy link
Contributor Author

@wonderyl Converted mktemp() to mkstemp()

wonderyl
wonderyl previously approved these changes May 1, 2025
ignore: |
modules/python/clusterloader2/**/*.yaml
modules/python/clusterloader2/**/*.yml
scenarios/perf-eval/job-scheduling/**/*.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files should be moved to clusterloader2 instead of adding ignore here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the thought process. Previously cl2 configs are under modules folder. The problem with that is, to write a new config for a new pipeline, we need to copy/past a whole folder of code. This not only include the config files, but also the python files as well. This is very poor in terms of usability. So we propose a new solution, moving config files under scenarios, and created a default python module, that can be reused. Then creating a new pipeline doesn't require to copy the python files anymore.


from kubernetes_client import KubernetesClient
from utils import get_measurement, parse_xml_to_json, run_cl2_command

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily true because different test requires only a specific set of arguments and configs, so combining this all together like this make it hard for new person to come and modify code if they only want to use a subset of arguments and configs. It's better to have base class like Sumanth suggests so other test can extend, which is not something this script is doing. So for this test, I'd suggest just add whatever argument you need for your test, and in another PR, create a base class for ClusterLoader2 and refactor

@@ -0,0 +1,62 @@
name: job-scheduling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to clusterloader2 folder for re-use

@@ -0,0 +1,29 @@
apiVersion: batch/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to clusterloader2 folder for re-use

@@ -0,0 +1,6 @@
steps:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this file needed? Are you using karpenter or create new nodepool and nodeclaim?

default: "v0.6.1"

steps:
- script: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be done in python script?

branches:
include:
- main
- vitto/kwok-cl2 # to be removed after the PR is merged
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private branch scheduled should be removed before the merge. If not, you have to create new PR again to remove this branch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest leaving this comment unresolved till you resolve all other comments and remove this change after you get an approval for this PR.

@sumanthreddy29 sumanthreddy29 requested a review from Copilot May 1, 2025 16:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a job scheduling latency benchmark using ClusterLoader2 (CL2) and KWOK, along with the necessary configuration templates, pipelines, and updated CLI functionalities to support performance evaluation for Kubernetes job controllers.

  • Adds new CL2 step templates for executing, collecting, validating, and cleaning up benchmark runs
  • Introduces job scheduling configurations and a job template for workload simulation
  • Updates CLI tools, unit tests, and Kubernetes client to support the new benchmarking workflow

Reviewed Changes

Copilot reviewed 15 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
steps/topology/kwok/execute-clusterloader2.yml Added execution step parameters and template for CL2 jobs
steps/topology/kwok/collect-clusterloader2.yml Added collection step parameters and template for CL2 jobs
steps/engine/clusterloader2/default/validate.yml Introduced node count validation step
steps/engine/clusterloader2/default/execute.yml Added benchmark execution commands with cloud-specific env handling
steps/engine/clusterloader2/default/collect.yml Added collection script to gather CL2 results
steps/engine/clusterloader2/default/cleanup.yml Added cleanup steps for NodeClaims and NodePools
scenarios/perf-eval/job-scheduling/config/job_template.yaml New job template for scheduling performance tests
scenarios/perf-eval/job-scheduling/config/config.yaml New configuration file incorporating job-scheduling parameters and measurement settings
pipelines/perf-eval/Controller/job-scheduling.yml Added pipeline scheduling and variable settings for performance tests
modules/python/tests/test_default_cli.py Added unit tests for newly introduced CLI commands and validations
modules/python/clusterloader2/utils.py Added JOB_LIFECYCLE_LATENCY_PREFIX mapping for measurement retrieval
modules/python/clusterloader2/default/cli.py Extended CLI tool to support both pod and job workloads
modules/python/clients/kubernetes_client.py Updated Kubernetes client with additional taint key for KWOK topology
Files not reviewed (7)
  • .yamllint: Language not supported
  • modules/python/tests/mock_data/default/report/JobLifecycleLatency_JobLifecycleLatency_job-scheduling: Language not supported
  • modules/python/tests/mock_data/default/report/junit.xml: Language not supported
  • scenarios/perf-eval/job-scheduling/terraform-inputs/aws.tfvars: Language not supported
  • scenarios/perf-eval/job-scheduling/terraform-inputs/azure.tfvars: Language not supported
  • scenarios/perf-eval/job-scheduling/terraform-test-inputs/aws.json: Language not supported
  • scenarios/perf-eval/job-scheduling/terraform-test-inputs/azure.json: Language not supported
Comments suppressed due to low confidence (1)

modules/python/tests/test_default_cli.py:83

  • The error message uses 'nodes' in the singular case; consider updating it to use 'node' when 1 is ready for consistency.
self.assertIn("Only 1 nodes are ready, expected 2 nodes!", str(context.exception))

@vittoriasalim vittoriasalim deleted the vitto/kwok-cl2 branch June 19, 2025 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants