Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Limit number of concurrent jobs & launches. #4248

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Nov 2, 2024

Fixes #4243.

This PR adds memory limitations for the number of concurrently running jobs, and CPU limitations for the number of concurrent sky launch by the jobs controller.

I followed SkyServe's implementation to only apply CPU limit to concurrent launches, as IIRC sky.launch consumes more compute than memory. Also, only apply memory limits to the number of concurrent jobs as ray jobs consume more memory.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Comment on lines +490 to +494
if (len(managed_job_state.get_nonterminal_job_ids_by_name(None)) >
managed_job_utils.NUM_JOBS_THRESHOLD):
raise exceptions.ManagedJobUserCancelledError(
'Too many concurrent managed jobs are running. '
'Please try again later or cancel some jobs.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this does not make sense, as users should expect the managed jobs being submitted queued on the controller when there is not enough resources on controller, instead of erroring out. Should we instead changing the task CPU requirement based on the memory on controller?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Jobs] Managed job controller process taking too much memory during peak time
2 participants