#206 - Sync nvlink feature branch with main#472
Merged
ecolternv merged 35 commits intofeature/nvlink-supportfrom Feb 23, 2026
Merged
#206 - Sync nvlink feature branch with main#472ecolternv merged 35 commits intofeature/nvlink-supportfrom
ecolternv merged 35 commits intofeature/nvlink-supportfrom
Conversation
* Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <[email protected]>
* Handle non-aws s3-compatible data-auth * Remove OSMO_SKIP_DATA_AUTH from quickstart * Method rename * Fix tests
* add envoy lua filter to for id_token refresh * add filter to service and router * add validate token * only remove auth cookies * allow user configuration
* Improve Concurrent Log Upload * Fix lint * Remove fsync
* Compress Backend Job when Service sends to Backend Worker * Fix lint * Fix lint * address comments
* Service Config History and Editor * Add yup and react-hook-forms
* Physical AI Workflow Series: Nut Pouring * Update copyright and whitespaces * Remove internal * Update mimic generation workflow with Swift storage and file injection - Add nutpour_gr1t2_base_env_cfg.py for GR1T2 nut pouring task - Configure Swift input/output storage for datasets - Add MimicGen dataset generation command with 1000 trials - Inject custom environment config via file injection * Minor cleanup * Add starting dataset to instructions * Update top level readme * Add copyright * add copyright --------- Co-authored-by: Saurav Nanda <[email protected]>
The CronJob volume was referencing the ConfigMap as
'{{backend_name}}-{{test_name}}-config' but the actual ConfigMap is
created with the name '{{configmap_name}}' (which is '{test_name}-config').
This mismatch would cause the CronJob to fail to mount the ConfigMap.
Use the configmap_name template variable that is already passed to the
Jinja context to ensure consistency.
* Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes
* Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test
* Update links to cookbook instead of workflows in Github * Update more links
* Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <[email protected]> --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: RyaliNvidia <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <[email protected]> Co-authored-by: ethany-nv <[email protected]> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <[email protected]> Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> Co-authored-by: ethany-nv <[email protected]>
* design for oauth2 proxy * format * update design after POC * update format * update design per envoy / UI changes
- Add group templates as a new type of config - Allow group templates to be assigned to pools - When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed - Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client.
* Sync main into feature/PROJ-148-auth-rework (#258) * allow flexible squid proxy replicas (#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * #244 - Use last login url if url is not specified (#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (#249) --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: ethany-nv <[email protected]> Co-authored-by: RyaliNvidia <[email protected]> Co-authored-by: patclarknvidia <[email protected]> * * Add authz sidecar service with Go implementation This commit adds the authorization sidecar service including: - Go-based authz server implementing Envoy External Authorization - PostgreSQL client for role/policy storage - Role caching for performance optimization - Action registry for path-to-action mapping - Comprehensive test suite - Python test service for integration testing - Documentation and quickstart guide * * Begin resource action model * Server validates both legacy and new * Update logic for action registry * Sync main into feature/PROJ-148-auth-rework (#298) * allow flexible squid proxy replicas (#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * #244 - Use last login url if url is not specified (#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (#247) * Allow PR checks to run on release branches (#264) * Database Pooling in Postgres Singleton Across Services (#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (#272) * #148 - Auth Project Design Documents (#165) * add args to postgres (#282) * #267 - cloud deployment scripts (#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Fix conflicts --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: ethany-nv <[email protected]> Co-authored-by: RyaliNvidia <[email protected]> Co-authored-by: patclarknvidia <[email protected]> Co-authored-by: Ethan Look-Potts <[email protected]> Co-authored-by: xutongNV <[email protected]> Co-authored-by: Allen Greaves <[email protected]> * Remove action permissions from pool config (#307) * Sync main into feature/PROJ-148-auth-rework (#322) * allow flexible squid proxy replicas (#241) * allow flexible squid proxy replicas * fix * Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167) * Improving Performance for Uploading Workflow Artifacts in Worker Jobs * Cleanup * Add progress writing after upload * Add dependency in Bazel BUILD * Add type to mypy requirements * Update mypy requirements * Add to mypy_cli BUILD * Fix lint * Comment * Use constant to define semaphor and storage client executor count * #244 - Use last login url if url is not specified (#245) * Use last login url if url is not specified * print message * Cannot select any text inside modals or slideouts (#248) * Video html element not changin when selecting different video files in the UI for OSMO dataset (#249) * sync-feature-branches: fix no conflict case, allow single branch to be synced (#252) * Fix sync-feature-branches with no merge conflicts * Allow a single branch to be specified for sync-feature-branches * Perform operations as OSMO CI Bot * Add external label when the PR is created * extract issue number * add test cases (#247) * Allow PR checks to run on release branches (#264) * Database Pooling in Postgres Singleton Across Services (#251) * Initial commit for database pooling * Update set_session * Fix lint * Update PostgresConnector to have semaphor to control connections * Lint fix * Fix number of maxconn for test * Address comments * Add Go Postgres utils (#272) * #148 - Auth Project Design Documents (#165) * add args to postgres (#282) * #267 - cloud deployment scripts (#268) * script to create azure resources and deploy * Remove auto-generated values files from tracking - Added .gitignore to ignore values/, *.env files - Removed values/*.yaml files from git (auto-generated during deployment) * add aws script * add aws script * add copyright * update copyright * Support for Azure workload identity in AKS and Arc clusters (#141) * feat(src): add Azure service account and extra pod labels configuration - implement service account creation with customizable name and annotations - enhance service templates to support extra pod labels for various services - update Azure backend to utilize DefaultAzureCredential for authentication - add tests for Azure credential extraction and client creation * feat(src): extract account key from connection string for Azure Blob Storage - add function to extract AccountKey from connection string - update AzureBlobStorageClient to handle different credential types * feat(test): add tests for account key extraction from Azure connection strings * chore: clean up linting issues for tests * refactor(src): update data credential types in PostgresConnector and TaskGroup - change StaticDataCredential to DataCredential in get_all_data_creds method - update fetch_creds function signature to use DataCredential * feat(src): update Azure client creation to include storage account and account URL - remove deprecated storage account extraction function - modify create_client to accept storage_account and account_url parameters - update AzureBlobStorageClientFactory to use new parameters - adjust tests to reflect changes in client creation 🔒 - Generated by Copilot * refactor(src): mark storage_account parameter as unused in create_client function 🔧 - Generated by Copilot * refactor(src): remove unused storage_account parameter from client creation 🔧 - Generated by Copilot * Add new project proposal to describe nvlink + topology aware scheduling (#211) * Add new project proposal to describe nvlink + topology aware scheduling * Split design into two docs * Finish docs and add some updates from feedback * Add some open items * OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315) * add redis utlis, update postgres utils (#313) * add redis utlis, update postgres utils * add deps * Fix missing seperator in the test runner roles (#320) * fix * remove * fix --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: ethany-nv <[email protected]> Co-authored-by: RyaliNvidia <[email protected]> Co-authored-by: patclarknvidia <[email protected]> Co-authored-by: Ethan Look-Potts <[email protected]> Co-authored-by: xutongNV <[email protected]> Co-authored-by: Allen Greaves <[email protected]> Co-authored-by: ecolternv <[email protected]> Co-authored-by: tdewanNvidia <[email protected]> * Connect envoy with authz sidecar (#319) * Connect the authz sidecar to envoy * update sidecar * fix typo * add extra env * uncomment * #290 - Add attribute fetching for workflow pool matching (#338) * update * fix * fix * Merge main into feature branch (#452) * fix: pass node_condition_prefix to backend-worker deployment (#448) * #148 - Add user mapping into OSMO (#418) * Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <[email protected]> --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: RyaliNvidia <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <[email protected]> Co-authored-by: ethany-nv <[email protected]> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <[email protected]> Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> Co-authored-by: ethany-nv <[email protected]> * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * Merge main into feature --------- Co-authored-by: tdewanNvidia <[email protected]> Co-authored-by: OSMO CI Bot <[email protected]> Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> Co-authored-by: ethany-nv <[email protected]> * #407 - Authz sidecar sends accessible pools to service (#455) * update cache specification * lint * authz sidecar sends info to service * Add logging to the go code * comments * Merge main into feature (#463) * fix: pass node_condition_prefix to backend-worker deployment (#448) * #148 - Add user mapping into OSMO (#418) * Sync main into feature/PROJ-148-user-mapping (#375) * fix: update extraArgs to render string properly (#362) * Remove auth router in agent service (#371) * Various fixes to stabilize GitHub actions on self-hosted nodes (#366) * Add subagents to help debug CI * Pin digests in Github actions * Add safe Bazel and workspace cleanup to ci-internal Add filesystem cleanup steps that don't interfere with concurrent jobs. Testcontainers handles Docker resource cleanup automatically via ryuk. Changes: - Add Bazel cache cleanup to prevent unbounded local cache growth - Add workspace cleanup for pytest and temporary files - Keep concurrency control per PR - Rely on Testcontainers for Docker resource cleanup Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add 30-minute timeout to ci-internal job Prevent jobs from running indefinitely if tests hang. 30 minutes provides sufficient time for normal test execution while ensuring hung jobs don't block the runner. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Use bazel clean --expunge to prevent unbounded cache growth Change from 'bazel clean' to 'bazel clean --expunge' to remove repository cache in addition to build outputs. This prevents unbounded growth of external dependencies on self-hosted runners. Uses synchronous --expunge (not --expunge_async) since we're in a Docker container that will terminate after the job completes. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Add resource limits to Docker-in-Docker service Limit DinD service to 4GB memory and 2 CPUs to prevent runaway resource consumption and OOM conditions on self-hosted runners. These limits provide sufficient resources for Testcontainers while leaving headroom for the main job container and preventing memory leaks from exhausting runner resources. Co-Authored-By: Claude Sonnet 4.5 <[email protected]> * Tune Bazel CI config * Testcontainers resource limit * Clean up testcontainers networkedcontainer list * Shutdown bazel at the end of job * Close docker client in test utils * SandboxedWorker shutdown in tests * Add docker clean up * Add clean up * Add node dep * Add docker deps * Use the right image * Tune bazel in CI * Remove golang.org/x/crypto from root module * Pylint suppress * Fix redis closure in tests * Fix jinja_sandbox test * Clean up in jinja_sandbox test * Fix jinja_sandbox test and lint * Enhance cleanup * Fix pr-checks yaml --------- Co-authored-by: Claude Sonnet 4.5 <[email protected]> --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: RyaliNvidia <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> * #294 #295 - Add user table and user-role mapping (#373) --------- Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> * #339 - Add documentation for creating PAT and service accounts (#395) * Rename pat (#428) * #403 - Set optional default admin during service creation + fix PAT wording (#404) * use base64 for access tokens (#437) * Merge main into feature (#441) * Support Non-AWS S3 storage without environment variables (#421) * Add override_url to data credentials and remove cache_config * Update local run and quickstart * Make sure local dev works * Revert unnecessary changes * Update from workflows/ to cookbook/ (#419) * Update from workflows/ to cookbook/ * Update README * Update links from workflows/ to cookbook/ * Update paths * Update README * Revert to pass link test * Update links to cookbook instead of workflows in Github (#432) * Update links to cookbook instead of workflows in Github * Update more links --------- Co-authored-by: Fernando L <[email protected]> Co-authored-by: ethany-nv <[email protected]> * Modify profile list api to show role info (#442) --------- Co-authored-by: OSMO CI Bot <[email protected]> Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> Co-authored-by: ethany-nv <[email protected]> * Add override_url when forwarding default_credential to StaticDataCredential (#444) * Oauth2 proxy design (#436) * design for oauth2 proxy * format * update design after POC * update format * update design per envoy / UI changes * #356 - Group Template Implementation (#454) - Add group templates as a new type of config - Allow group templates to be assigned to pools - When a workflow is submitted, group templates will be instantiated per group, and when it is completed, they will be destroyed - Cleanup logic for creating kubernetes objects in the backend by using kubernetes dynamic client. * Client install location is determined by service (#447) * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * dupe * remove --------- Co-authored-by: tdewanNvidia <[email protected]> Co-authored-by: OSMO CI Bot <[email protected]> Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]> Co-authored-by: ethany-nv <[email protected]> Co-authored-by: ecolternv <[email protected]> * Revert "Remove auth router in agent service (#371)" (#390) This reverts commit fac90f2. * Add new fields in the envoy logs (#466) * Revert "#356 - Group Template Implementation (#454)" (#459) This reverts commit a483a88. * fix --------- Co-authored-by: OSMO CI Bot <[email protected]> Co-authored-by: Vivian Pan <[email protected]> Co-authored-by: ethany-nv <[email protected]> Co-authored-by: patclarknvidia <[email protected]> Co-authored-by: Ethan Look-Potts <[email protected]> Co-authored-by: xutongNV <[email protected]> Co-authored-by: Allen Greaves <[email protected]> Co-authored-by: ecolternv <[email protected]> Co-authored-by: tdewanNvidia <[email protected]> Co-authored-by: Fernando L <[email protected]> Co-authored-by: Claude Sonnet 4.5 <[email protected]>
* chore: set appVersion to 6.1 across all helm charts * chore: bump helm chart versions to 1.1.0
|
cypres
approved these changes
Feb 23, 2026
elookpotts-nvidia
approved these changes
Feb 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Sync nvlink feature branch with latests from amin
Issue #206
Checklist