-
Notifications
You must be signed in to change notification settings - Fork 790
Add e2e test for train API #2199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for train API #2199
Conversation
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Pull Request Test Coverage Report for Build 12449362407Details
💛 - Coveralls |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich I've separated the e2e test for train API and now it works. Please review when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this overall lgtm, just small comment.
/assign @deepanker13 @kubeflow/wg-training-leads @Electronic-Waste
/lgtm |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
I've updated the Kubernetes version to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. I left some comments for you @helenxie-bit
strategy: | ||
fail-fast: false | ||
matrix: | ||
kubernetes-version: ["v1.31.4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the Kubernetes version to be aligned with other ci tests? Like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to save compute resources, I think for now we can just run this test on a single k8s version, since we run the rests E2E tests on the all versions.
WDYT @Electronic-Waste ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Maybe we can select one k8s version from this list:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me change the version to v1.30.6
. And we can update it if needed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we support 1.28-1.31, I would suggest that we run our integration tests on 1.29, 1.30, 1.31, we can update it in the following PR.
For the train
API tests, I think running it on 1.31 should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. So I think we will still keep the v1.31.4
version.
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @helenxie-bit!
/lgtm
/approve
/hold cancel
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* add e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix peft import error Signed-off-by: helenxie-bit <[email protected]> * update settings of the job Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix error detection Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix NoneType error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * test bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * add storage_config Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce pvc size Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * fix 'set_device' error Signed-off-by: helenxie-bit <[email protected]> * add timeout error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * update e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * add num_labels Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change sequence of e2e tests Signed-off-by: helenxie-bit <[email protected]> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function-add check disk Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker directory Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <[email protected]> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix parameter of namespace Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce resources Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * remove go setup Signed-off-by: helenxie-bit <[email protected]> * adjust the version of k8s Signed-off-by: helenxie-bit <[email protected]> * move test file to new place Signed-off-by: helenxie-bit <[email protected]> * fix typos Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update install packages Signed-off-by: helenxie-bit <[email protected]> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <[email protected]> * separate step of loading images Signed-off-by: helenxie-bit <[email protected]> * check disk space after loading image Signed-off-by: helenxie-bit <[email protected]> * clean up and check disk space Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <[email protected]> * move working directory Signed-off-by: helenxie-bit <[email protected]> * delete moving working directory Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * use 'docker system prune' Signed-off-by: helenxie-bit <[email protected]> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <[email protected]> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <[email protected]> * check for timeout error Signed-off-by: helenxie-bit <[email protected]> * fix name of trainer image Signed-off-by: helenxie-bit <[email protected]> * fix env of building storage initializer image Signed-off-by: helenxie-bit <[email protected]> * clean format Signed-off-by: helenxie-bit <[email protected]> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <[email protected]> * Update name of fileholder Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * move test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
* add e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix peft import error Signed-off-by: helenxie-bit <[email protected]> * update settings of the job Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix error detection Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix NoneType error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * test bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * add storage_config Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce pvc size Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * fix 'set_device' error Signed-off-by: helenxie-bit <[email protected]> * add timeout error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * update e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * add num_labels Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change sequence of e2e tests Signed-off-by: helenxie-bit <[email protected]> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function-add check disk Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker directory Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <[email protected]> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix parameter of namespace Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce resources Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * remove go setup Signed-off-by: helenxie-bit <[email protected]> * adjust the version of k8s Signed-off-by: helenxie-bit <[email protected]> * move test file to new place Signed-off-by: helenxie-bit <[email protected]> * fix typos Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update install packages Signed-off-by: helenxie-bit <[email protected]> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <[email protected]> * separate step of loading images Signed-off-by: helenxie-bit <[email protected]> * check disk space after loading image Signed-off-by: helenxie-bit <[email protected]> * clean up and check disk space Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <[email protected]> * move working directory Signed-off-by: helenxie-bit <[email protected]> * delete moving working directory Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * use 'docker system prune' Signed-off-by: helenxie-bit <[email protected]> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <[email protected]> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <[email protected]> * check for timeout error Signed-off-by: helenxie-bit <[email protected]> * fix name of trainer image Signed-off-by: helenxie-bit <[email protected]> * fix env of building storage initializer image Signed-off-by: helenxie-bit <[email protected]> * clean format Signed-off-by: helenxie-bit <[email protected]> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <[email protected]> * Update name of fileholder Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * move test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
* add e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix peft import error Signed-off-by: helenxie-bit <[email protected]> * update settings of the job Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix error detection Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix NoneType error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * test bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * add storage_config Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce pvc size Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * fix 'set_device' error Signed-off-by: helenxie-bit <[email protected]> * add timeout error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * update e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * add num_labels Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change sequence of e2e tests Signed-off-by: helenxie-bit <[email protected]> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function-add check disk Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker directory Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <[email protected]> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix parameter of namespace Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce resources Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * remove go setup Signed-off-by: helenxie-bit <[email protected]> * adjust the version of k8s Signed-off-by: helenxie-bit <[email protected]> * move test file to new place Signed-off-by: helenxie-bit <[email protected]> * fix typos Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update install packages Signed-off-by: helenxie-bit <[email protected]> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <[email protected]> * separate step of loading images Signed-off-by: helenxie-bit <[email protected]> * check disk space after loading image Signed-off-by: helenxie-bit <[email protected]> * clean up and check disk space Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <[email protected]> * move working directory Signed-off-by: helenxie-bit <[email protected]> * delete moving working directory Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * use 'docker system prune' Signed-off-by: helenxie-bit <[email protected]> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <[email protected]> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <[email protected]> * check for timeout error Signed-off-by: helenxie-bit <[email protected]> * fix name of trainer image Signed-off-by: helenxie-bit <[email protected]> * fix env of building storage initializer image Signed-off-by: helenxie-bit <[email protected]> * clean format Signed-off-by: helenxie-bit <[email protected]> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <[email protected]> * Update name of fileholder Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * move test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
* add e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix peft import error Signed-off-by: helenxie-bit <[email protected]> * update settings of the job Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix error detection Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix NoneType error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * test bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * add storage_config Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce pvc size Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * fix 'set_device' error Signed-off-by: helenxie-bit <[email protected]> * add timeout error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * update e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * add num_labels Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change sequence of e2e tests Signed-off-by: helenxie-bit <[email protected]> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function-add check disk Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker directory Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <[email protected]> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix parameter of namespace Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce resources Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * remove go setup Signed-off-by: helenxie-bit <[email protected]> * adjust the version of k8s Signed-off-by: helenxie-bit <[email protected]> * move test file to new place Signed-off-by: helenxie-bit <[email protected]> * fix typos Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update install packages Signed-off-by: helenxie-bit <[email protected]> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <[email protected]> * separate step of loading images Signed-off-by: helenxie-bit <[email protected]> * check disk space after loading image Signed-off-by: helenxie-bit <[email protected]> * clean up and check disk space Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <[email protected]> * move working directory Signed-off-by: helenxie-bit <[email protected]> * delete moving working directory Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * use 'docker system prune' Signed-off-by: helenxie-bit <[email protected]> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <[email protected]> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <[email protected]> * check for timeout error Signed-off-by: helenxie-bit <[email protected]> * fix name of trainer image Signed-off-by: helenxie-bit <[email protected]> * fix env of building storage initializer image Signed-off-by: helenxie-bit <[email protected]> * clean format Signed-off-by: helenxie-bit <[email protected]> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <[email protected]> * Update name of fileholder Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * move test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
What this PR does / why we need it:
Add an e2e test in the
test_e2e_train_api.py
for the train API.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: