Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add target_job_queue to SubmitAnyscaleJob operator #52

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

RCdeWit
Copy link

@RCdeWit RCdeWit commented Dec 19, 2024

As described here: #44
Adds two optional parameters to submit to existing job queues. Does not implement creation of new job queues.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RCdeWit, thanks a lot for contributing these changes.

Please, add tests and update the docs with this feature, explaining the current limitation that the queues should be previously created.

Also, should we update the RolloutAnyscaleService operator so that we allow the creation of these queues and users don't have to manually create those? Does this version of the Anyscale SDK support this?

@RCdeWit
Copy link
Author

RCdeWit commented Dec 20, 2024

Hi @tatiana, I'm happy to update the docs.

I'm struggling somewhat with how I should implement the test; my knowledge of PyTest is very surface level. Please advice on whether my current attempt makes sense.

Also, should we update the RolloutAnyscaleService operator so that we allow the creation of these queues and users don't have to manually create those? Does this version of the Anyscale SDK support this?

It would have to be the SubmitAnyscaleJob operator, not the service one. Currently, the only way to create a new job queue is to create a job and attach it to a new queue. I.e., if the JobQueueConfig contains a JobQueueSpec, it'll create a new job queue.

Would it make sense to incorporate this into the SubmitAnyscaleJob operator as well? It might be a bit overburdened.

@tatiana tatiana self-assigned this Dec 30, 2024
@marwan116 marwan116 self-requested a review January 5, 2025 16:07
@marwan116
Copy link
Collaborator

marwan116 commented Jan 5, 2025

@RCdeWit I gave this some thought and I think it makes sense to extend the SubmitAnyscaleJob operator to accept a JobQueueConfig. i.e. exposing the same interface that our SDK does for submitting jobs.

Placing ourselves in the position of an end-user, they would need to submit a job via the SDK to create a new job queue but this means:

  • they won't have the control that the operator provides in terms of polling the job for its status
  • it will be quite wasteful to submit a job just to create a job queue especially for large clusters

i.e. If we had an SDK that looks like anyscale.create_job_queue, then I would agree, that extending the SubmitAnyscaleJob responsibility boundary wouldn't be a good practice but we don't.

I'm struggling somewhat with how I should implement the test; my knowledge of PyTest is very surface level. Please advice on whether my current attempt makes sense.

In terms of testing, here are general best practices

  • Tests should provide a good coverage of the codebase.
  • When performing a change, it's good to add tests that cover the new functionality.

Now, in the case of this PR, we are simply extending the SubmitAnyscaleJob operator to accept a JobQueueConfig.

So, there are two places where we can add tests:

  • The SubmitAnyscaleJob operator unit tests
  • The integration test that runs the SubmitAnyscaleJob operator

To find the unit tests, look under:

  • tests/operators/ for operator unit tests
  • tests/hooks/ for hook unit tests
  • tests/triggers/ for trigger unit tests

Given the hook behavior of fetching the job status doesn't change, no need to add tests for the hook. Same applies to the trigger.

Looking at the SubmitAnyscaleJob operator unit tests: all we are testing is that it correctly delegating to the underlying hook/trigger. So, while we can extend the unit test to run subtests with different JobQueueConfig inputs, it's not necessary.

To find the integration test, look under:

  • tests/dags/test_dag_example.py

You will see that the integration test is running by executing the dags under example_dags.

There are two dags under example_dags:

  • anyscale_service.py
  • anyscale_job.py

The anyscale_service.py dag is a simple example of deploying a service, waiting for it to be ready and terminating it. This tests the service operator and is not relevant to this PR.

The anyscale_job.py is an example of submitting a job. I propose you extend this dag to test the new functionality by:

  • submitting a job with a JobQueueConfig
  • submitting a job with the same JobQueueConfig and checking that it re-uses the existing job queue

The nice thing is the example dag will also serve as documentation for the new functionality (see how docs/index.rst refers to the script via .. literalinclude:: ../example_dags/anyscale_job.py).

I have created a branch that implements all the above changes, feel free to either:

  • copy the changes from there if you agree with them, or
  • close out this PR and I can open a PR based on my branch

let me know which is most convenient for you

@RCdeWit
Copy link
Author

RCdeWit commented Jan 6, 2025

Thanks, @marwan116, much appreciated!

I have copied over the changes from your separate branch. TBH, commits directly to this branch would've been more convenient. But it works now. 😄

@asorrentino-agreena
Copy link

Hi guys, thanks for the effort of putting this together! We would greatly benefits from having this feature available as we are encountering this exact scenario. Especially with the suggestions from @marwan116 everything would be much easier, do you have updates on when this feature will be available? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants