-
Notifications
You must be signed in to change notification settings - Fork 18
CCP Job Scheduling Latency Benchmark #575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
wonderyl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posting what I got so far, will continuing review a bit later.
wonderyl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done my first round of review.
d4c5c05 to
1c9310a
Compare
|
@wonderyl
|
|
@wonderyl Converted mktemp() to mkstemp() |
| ignore: | | ||
| modules/python/clusterloader2/**/*.yaml | ||
| modules/python/clusterloader2/**/*.yml | ||
| scenarios/perf-eval/job-scheduling/**/*.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those files should be moved to clusterloader2 instead of adding ignore here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the thought process. Previously cl2 configs are under modules folder. The problem with that is, to write a new config for a new pipeline, we need to copy/past a whole folder of code. This not only include the config files, but also the python files as well. This is very poor in terms of usability. So we propose a new solution, moving config files under scenarios, and created a default python module, that can be reused. Then creating a new pipeline doesn't require to copy the python files anymore.
|
|
||
| from kubernetes_client import KubernetesClient | ||
| from utils import get_measurement, parse_xml_to_json, run_cl2_command | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily true because different test requires only a specific set of arguments and configs, so combining this all together like this make it hard for new person to come and modify code if they only want to use a subset of arguments and configs. It's better to have base class like Sumanth suggests so other test can extend, which is not something this script is doing. So for this test, I'd suggest just add whatever argument you need for your test, and in another PR, create a base class for ClusterLoader2 and refactor
| @@ -0,0 +1,62 @@ | |||
| name: job-scheduling | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be moved to clusterloader2 folder for re-use
| @@ -0,0 +1,29 @@ | |||
| apiVersion: batch/v1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be moved to clusterloader2 folder for re-use
| @@ -0,0 +1,6 @@ | |||
| steps: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this file needed? Are you using karpenter or create new nodepool and nodeclaim?
| default: "v0.6.1" | ||
|
|
||
| steps: | ||
| - script: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be done in python script?
| branches: | ||
| include: | ||
| - main | ||
| - vitto/kwok-cl2 # to be removed after the PR is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private branch scheduled should be removed before the merge. If not, you have to create new PR again to remove this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest leaving this comment unresolved till you resolve all other comments and remove this change after you get an approval for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a job scheduling latency benchmark using ClusterLoader2 (CL2) and KWOK, along with the necessary configuration templates, pipelines, and updated CLI functionalities to support performance evaluation for Kubernetes job controllers.
- Adds new CL2 step templates for executing, collecting, validating, and cleaning up benchmark runs
- Introduces job scheduling configurations and a job template for workload simulation
- Updates CLI tools, unit tests, and Kubernetes client to support the new benchmarking workflow
Reviewed Changes
Copilot reviewed 15 out of 22 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| steps/topology/kwok/execute-clusterloader2.yml | Added execution step parameters and template for CL2 jobs |
| steps/topology/kwok/collect-clusterloader2.yml | Added collection step parameters and template for CL2 jobs |
| steps/engine/clusterloader2/default/validate.yml | Introduced node count validation step |
| steps/engine/clusterloader2/default/execute.yml | Added benchmark execution commands with cloud-specific env handling |
| steps/engine/clusterloader2/default/collect.yml | Added collection script to gather CL2 results |
| steps/engine/clusterloader2/default/cleanup.yml | Added cleanup steps for NodeClaims and NodePools |
| scenarios/perf-eval/job-scheduling/config/job_template.yaml | New job template for scheduling performance tests |
| scenarios/perf-eval/job-scheduling/config/config.yaml | New configuration file incorporating job-scheduling parameters and measurement settings |
| pipelines/perf-eval/Controller/job-scheduling.yml | Added pipeline scheduling and variable settings for performance tests |
| modules/python/tests/test_default_cli.py | Added unit tests for newly introduced CLI commands and validations |
| modules/python/clusterloader2/utils.py | Added JOB_LIFECYCLE_LATENCY_PREFIX mapping for measurement retrieval |
| modules/python/clusterloader2/default/cli.py | Extended CLI tool to support both pod and job workloads |
| modules/python/clients/kubernetes_client.py | Updated Kubernetes client with additional taint key for KWOK topology |
Files not reviewed (7)
- .yamllint: Language not supported
- modules/python/tests/mock_data/default/report/JobLifecycleLatency_JobLifecycleLatency_job-scheduling: Language not supported
- modules/python/tests/mock_data/default/report/junit.xml: Language not supported
- scenarios/perf-eval/job-scheduling/terraform-inputs/aws.tfvars: Language not supported
- scenarios/perf-eval/job-scheduling/terraform-inputs/azure.tfvars: Language not supported
- scenarios/perf-eval/job-scheduling/terraform-test-inputs/aws.json: Language not supported
- scenarios/perf-eval/job-scheduling/terraform-test-inputs/azure.json: Language not supported
Comments suppressed due to low confidence (1)
modules/python/tests/test_default_cli.py:83
- The error message uses 'nodes' in the singular case; consider updating it to use 'node' when 1 is ready for consistency.
self.assertIn("Only 1 nodes are ready, expected 2 nodes!", str(context.exception))
Co-authored-by: Sumanth Reddy CH <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Sumanth Reddy CH <[email protected]>
This pull request introduces job scheduling latency benchmark using ClusterLoader2 (CL2) and KWOK, as a performance evaluation for job controller in Kubernetes.
JobLifecycleLatencymeasurement in [modules/python/clusterloader2/utils.py]Dashboard on Current Job-Scheduling Latency Benchmark:
Kusto Dashboard Link
Pipeline
https://dev.azure.com/akstelescope/telescope/_build/results?buildId=19154&view=results