-
Notifications
You must be signed in to change notification settings - Fork 91
feat(truss): support training checkpoints in truss-transfer #2048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| S3, | ||
| #[serde(rename = "azure")] | ||
| Azure, | ||
| #[serde(rename = "baseten_training")] | ||
| BasetenTraining, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parse baseten_training repos from Python bindings
A new ResolutionType::BasetenTraining variant was introduced, but the Python TryFrom<PyModelRepo> matcher in bindings.rs still only accepts http, gcs, s3, or azure. Passing kind="baseten_training" to create_basetenpointer_from_models will therefore still raise Unknown kind and prevent creating manifests for training checkpoints from Python callers. Please extend the binding match to include the new variant.
Useful? React with 👍 / 👎.
| &destination, | ||
| pointer.size, | ||
| &pointer.runtime_secret_name, | ||
| false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training checkpoints never read training secret path on download
The S3 download path now takes a use_training_secrets flag, but download_file_with_cache always passes false (lines 114–118), so download_s3_to_path will read credentials from /secrets even for manifests generated from training buckets whose AWS keys live under /aws-secrets. Downloading a Baseten training checkpoint will therefore fail with "Failed to read AWS credentials from not existing file" unless the secrets are duplicated in /secrets. The download branch needs to propagate the training flag when the pointer originates from a training source.
Useful? React with 👍 / 👎.
|
@rcano-baseten just on mobile, but a bit confused. Isn’t all you want a Union of secret \ access key? |
|
@rcano-baseten I think its complicated, and does not even stop there. Wiring tough training_secret_bool: true/false seems to complicated, and creates weird side effects, like you need to have s3 to use training, and access key is allowed. I opened a separate PR that just generally allows Additionally, I noticed s3 bucket always requires a region, otherwise it will default to us-east. I opened a small pr here: |
| # This file is automatically @generated by Cargo. | ||
| # It is not intended for manual editing. | ||
| version = 4 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be updated!
🚀 What
Supports training checkpoints in truss transfer as a source
💻 How
🔬 Testing