-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat(backend): Replace MLMD with KFP Server APIs #12430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mlmd-removal
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
48b441e to
bac691f
Compare
|
Upgrade Test failures are expected until we add migration logic (to follow this PR). Note also UI changes are not included in this, those too - will follow this pr. |
|
First off, this is amazing! Not sure where you find the time 😂 A couple questions because this overlaps with an area of interest. My understanding is that this PR is reporting / updating the status of tasks (components) directly from the launcher such as here. So to check my understanding, this means that we are moving completely away from the persistence agent, correct? I have been running into issues with the persistence agent at scale & with short lived workflows so I am excited about new approaches. Secondly, I see the added RPCs to update task state. Are these the counter part to the ones used by the V1 persistence agent to populate |
|
Insanely impressive, @HumairAK! I look forward to going through it in-depth. Please let us know if there any specific areas you want us to sequence first / prioritize with our reviews.
^ This will be critical for existing workloads. |
|
For your first point, PA is still required to report the overall status of the Run. It monitors the Argo WF resource and we still require this to report on failures not encountered during driver/launcher runs (e.g. pod schedule failures, etc.). So we still require an external monitoring of a run. I will also be moving the update status propagation logic to the api server in this PR after some offline discussions with Matt/Nelesh. For your second point, the tasks table in v1 is being removed it is only used for caching today and it is not utilized by any other APIs. It is a bit abused and part of an incomplete implementation of a different approach that was intended by previous maintainers. As such this change will be part of the next KFP major version bump (3.0). All the data required for KFP runs in tasks table is persisted in mlmd, and we can use this for migration (namely just cache fingerprints).
@droctothorpe as per our discussion today, I would suggest you review the higher level changes first, e.g. Proto files, Gorm Models, Authorization and related changes - consideration for things like migration etc. |
HumairAK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| message InputOutputs { | ||
| message IOParameter { | ||
| google.protobuf.Value value = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm if MLMD parameters storage had any size restrictions, if so we should continue to validate that restriction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After conferring with a few llm's, the only restriction seems to be is on DB schemas.
MLMD stores custom properties as a row per key/value pair. Here is an example schema for execution:
Details
CREATE TABLE `ExecutionProperty` (
`execution_id` int NOT NULL,
`name` varchar(255) NOT NULL,
`is_custom_property` tinyint(1) NOT NULL,
`int_value` int DEFAULT NULL,
`double_value` double DEFAULT NULL,
`string_value` mediumtext,
`byte_value` mediumblob,
`proto_value` mediumblob,
`bool_value` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`execution_id`,`name`,`is_custom_property`),
KEY `idx_execution_property_int` (`name`,`is_custom_property`,`int_value`),
KEY `idx_execution_property_double` (`name`,`is_custom_property`,`double_value`),
KEY `idx_execution_property_string` (`name`,`is_custom_property`,`string_value`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ciWe use a type:Json for storing all custom properties, which doesn't have any size restrictions. The Keys for mlmd have a 255 char limit, we could enforce that, but I don't think it's necessary - though I don't have a strong opinion here.
| rpc UpdateTasksBulk(UpdateTasksBulkRequest) returns (UpdateTasksBulkResponse) { | ||
| option (google.api.http) = { | ||
| post: "/apis/v2beta1/tasks:batchUpdate" | ||
| body: "*" | ||
| }; | ||
| option (grpc.gateway.protoc_gen_openapiv2.options.openapiv2_operation) = { | ||
| operation_id: "batch_update_tasks" | ||
| summary: "Updates multiple tasks in bulk." | ||
| tags: "RunService" | ||
| }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of bulk operations, make individual calls to update status from launcher/driver, and move status/artifact propagations within api server. There is concern around race conditions, we will need to update tasks in this order:
For an update task request:
- Update Task
- Fetch run.
- Propagate statuses up the dag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is UpdateTasksBulk to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of aggregating to default roles, create a new SA for driver/launcher to utilize for making calls to API Server. Have sync.py ensure this SA, and the required rbac is created in kubeflow profiles.
|
Thanks for the response @HumairAK!
Interesting, good to know!
The other place I have seen it seen it used previously was in the
That part went over my head lol. So I am mostly concerned with the ability to get run / component information (status / runtime) primarily through the SDK. At the moment this depends on the PA (only partially for V2) and is why I am asking about these components. As mentioned, I have noticed some instability when handling many workflows. Since you expect the PA to exist in V3 too, want to make sure we are able to scale that properly. Since I do not know the timeline for V3, maybe it is worthwhile implementing something in V2 to help us with this in the meantime. Potentially building in some metrics and suggested scaling behavior of the PA deployment or similar. Any suggestions where I should continue discussion on this? Any existing similar issues / threads you are familiar with? |
bac691f to
9763470
Compare
Remove ML Metadata (MLMD) service dependency and implement artifact and task tracking directly in the KFP database via the API server. This architectural change eliminates the external MLMD service (metadata-grpc, metadata-writer) and consolidates all metadata operations through the KFP API. Major changes: - Add v2beta1 artifact service API with storage layer implementation - Extend run service with task CRUD endpoints and ViewMode - Extend run response object with detailed task information - Refactor driver/launcher to use KFP API client instead of MLMD client - Remove all MLMD-related deployments and manifests - Remove object store session info storage in metadata layer - Add comprehensive test coverage for new storage and API layers This simplifies deployment, reduces operational complexity, and provides better control over metadata storage performance and schema. Signed-off-by: Humair Khan <[email protected]> # Conflicts: # backend/api/v2beta1/go_client/run.pb.go # backend/api/v2beta1/go_client/run.pb.gw.go # backend/api/v2beta1/go_client/run_grpc.pb.go # backend/api/v2beta1/swagger/kfp_api_single_file.swagger.json # backend/metadata_writer/src/metadata_helpers.py # backend/src/apiserver/resource/resource_manager.go # backend/src/v2/cmd/driver/main.go # backend/src/v2/compiler/argocompiler/container.go # backend/src/v2/compiler/argocompiler/importer.go # backend/src/v2/driver/driver.go # backend/src/v2/driver/driver_test.go # backend/src/v2/metadata/env.go # manifests/kustomize/env/cert-manager/base-tls-certs/kfp-api-cert.yaml # manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-writer-deployment.yaml # test_data/compiled-workflows/components_with_optional_artifacts.yaml # test_data/compiled-workflows/modelcar.yaml # test_data/compiled-workflows/pipeline_with_dynamic_importer_metadata.yaml # test_data/compiled-workflows/pipeline_with_google_artifact_type.yaml # test_data/compiled-workflows/pipeline_with_importer.yaml # test_data/compiled-workflows/pipeline_with_importer_and_gcpc_types.yaml # test_data/compiled-workflows/pipeline_with_string_machine_fields_task_output.yaml # test_data/compiled-workflows/pythonic_artifact_with_single_return.yaml # test_data/compiled-workflows/ray_job_integration_compiled.yaml
9763470 to
cb02722
Compare
It's used to populate the run details field for the runs object, but it's mostly just a copy of the run's associated Argo workflow status field (node statuses). We will likely drop this field next major version upgrade.
Our current intent is to get rid of PA as we see it as an unnecessary overhead for merely run status reporting, either we consolidate this logic into the KFP server or a separate dedicated controller that uses controller runtime, either way we'll certainly keep scalability in mind. |
Signed-off-by: Humair Khan <[email protected]>
…s for test files Signed-off-by: Humair Khan <[email protected]>
Signed-off-by: Humair Khan <[email protected]>
Signed-off-by: Humair Khan <[email protected]>
Signed-off-by: Humair Khan <[email protected]>
Signed-off-by: Humair Khan <[email protected]>
|
GPT5.1 Codex review: ### Review Findings
1. **Artifact-task records always marked as plain outputs**
`CreateArtifact` and `CreateArtifactsBulk` ignore the `request.type` field and hardcode every `ArtifactTask` as `IOType_OUTPUT`, even when the caller explicitly sets `IOType_ITERATOR_OUTPUT` for loop iterations or other specialized output modes. This drops iterator semantics, so parent DAGs can no longer distinguish per-iteration outputs and downstream resolvers will treat every propagated artifact as a flat output.
```87:95:backend/src/apiserver/server/artifact_server.go
artifactTask := &apiv2beta1.ArtifactTask{
ArtifactId: artifact.UUID,
TaskId: task.UUID,
RunId: request.GetRunId(),
Type: apiv2beta1.IOType_OUTPUT,
Producer: producer,
Key: request.GetProducerKey(),
}The same hardcoding occurs in the bulk path ( resourceAttributes := &authorizationv1.ResourceAttributes{
Namespace: namespace,
Verb: common.RbacResourceVerbGet,
Group: common.RbacPipelinesGroup,
Version: common.RbacPipelinesVersion,
Resource: common.RbacResourceTypeRuns,
}
err := s.resourceManager.IsAuthorized(ctx, resourceAttributes)Please change - apiGroups:
- [pipelines.kubeflow.org](http://pipelines.kubeflow.org/)
resources:
- runs
verbs:
- get
- list
- readArtifactPlease extend the aggregated roles (both “view” and “edit” flavors) with Open Questions / Follow-ups
Suggested Next Steps
|
|
Claude 4.5 review: PR #12430: MLMD Removal - Action ItemsPR: #12430 🚨 Critical Issues (Must Fix Before Merge)1. Terminal State Enforcement MissingPriority: 🔴 CRITICAL ProblemThe design requires preventing task updates when the parent run is in a terminal state (SUCCEEDED, FAILED, or CANCELED). This check is not implemented. ImpactLaunchers could update tasks after a run completes, leading to inconsistent state. Required FixFile: Location: Add this code before the authorization check: func (s *RunServer) UpdateTask(ctx context.Context, request *apiv2beta1.UpdateTaskRequest) (*apiv2beta1.PipelineTaskDetail, error) {
taskID := request.GetTaskId()
// Get existing task
existingTask, err := s.resourceManager.GetTask(taskID)
if err != nil {
return nil, util.Wrap(err, "Failed to get existing task for authorization")
}
// ✅ ADD THIS: Check if run is in terminal state
run, err := s.resourceManager.GetRun(existingTask.RunUUID)
if err != nil {
return nil, util.Wrap(err, "Failed to get run to check terminal state")
}
terminalStates := []model.RuntimeState{
model.RuntimeStateSucceeded,
model.RuntimeStateFailed,
model.RuntimeStateCanceled,
}
for _, terminalState := range terminalStates {
if run.State == terminalState {
return nil, util.NewInvalidInputError(
"Cannot update task %s: parent run %s is in terminal state %s",
taskID, existingTask.RunUUID, terminalState,
)
}
}
// Continue with existing authorization and update logic...
}Also apply to: Required TestFile: func TestUpdateTask_TerminalState_Rejected(t *testing.T) {
// Setup
clientManager, resourceManager := setupTestEnv()
runSrv := NewRunServer(resourceManager, nil)
// Create run and task
run := createTestRun(t, resourceManager, "test-run")
task := createTestTask(t, runSrv, run.UUID, "test-task")
// Mark run as SUCCEEDED (terminal state)
resourceManager.UpdateRun(run.UUID, &model.Run{State: model.RuntimeStateSucceeded})
// Attempt to update task - should fail
_, err := runSrv.UpdateTask(context.Background(), &apiv2beta1.UpdateTaskRequest{
TaskId: task.GetTaskId(),
Task: &apiv2beta1.PipelineTaskDetail{
TaskId: task.GetTaskId(),
State: apiv2beta1.PipelineTaskDetail_FAILED,
},
})
// Assert: Update should be rejected
assert.Error(t, err)
assert.Contains(t, err.Error(), "terminal state")
}
|
| Issue | Priority | Effort | Files to Modify | Tests Required |
|---|---|---|---|---|
| 1. Terminal State | 🔴 Critical | 4h | run_server.go |
run_server_tasks_test.go |
| 2. Cache Fingerprint | 🟡 Medium | 2h | launcher_v2.go |
launcher_v2_test.go |
| 3. Exit Handler | 🟡 Medium | 3h | dag.go |
dag_test.go |
| 4. Documentation | 🟢 Low | 1h | design-details.md |
N/A |
| Total | 10h | 4 files | 3 test files |
✅ Merge Recommendations
For mlmd-removal Branch
Status:
Requirements:
- ✅ Must fix: Issue 1 (Terminal State Enforcement)
Timeline: 1 day
For master Branch
Status: 🚫 Not Ready
Requirements:
- ✅ Must fix: Issue 1 (Terminal State Enforcement)
⚠️ Should fix: Issue 2 (Cache Fingerprint)⚠️ Should fix: Issue 3 (Exit Handler)- 📝 Should update: Issue 4 (Documentation)
Timeline: 2-3 days
🎯 Next Steps
-
Immediate (Before merging to
mlmd-removal):- Implement terminal state enforcement
- Add terminal state tests
- Test manually with concurrent runs
-
Before merging to
master:- Clear cache fingerprint on failure
- Add exit handler detection
- Update design documentation
- Run full integration test suite
- Verify all new tests pass
-
Post-merge (Follow-up PRs as planned):
- Migration implementation
- Frontend changes
📞 Contact
For questions or clarifications about these action items, refer to the detailed review in BACKEND_VERIFICATION_CHECKLIST.md.
Reviewer: AI Assistant
Date: 2025-11-20
Signed-off-by: Humair Khan <[email protected]>
Signed-off-by: Humair Khan <[email protected]>
zazulam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial light pass
| // limitations under the License. | ||
|
|
||
| // Package common provides common utilities for the KFP v2 driver. | ||
| package common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filename typo
| func getSubTasks( | ||
| currentTask *apiv2beta1.PipelineTaskDetail, | ||
| allRuntasks []*apiv2beta1.PipelineTaskDetail, | ||
| flattenedTasks map[string]*apiv2beta1.PipelineTaskDetail, | ||
| ) (map[string]*apiv2beta1.PipelineTaskDetail, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move all functions that are reused for parameter resolution to the utils file
| rpc UpdateTasksBulk(UpdateTasksBulkRequest) returns (UpdateTasksBulkResponse) { | ||
| option (google.api.http) = { | ||
| post: "/apis/v2beta1/tasks:batchUpdate" | ||
| body: "*" | ||
| }; | ||
| option (grpc.gateway.protoc_gen_openapiv2.options.openapiv2_operation) = { | ||
| operation_id: "batch_update_tasks" | ||
| summary: "Updates multiple tasks in bulk." | ||
| tags: "RunService" | ||
| }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is UpdateTasksBulk to be removed?
Description of your changes:
This PR removes MLMD as per the KEP here
Resolves: #11760
Overview
Core Change: Replaced MLMD (ML Metadata) service with direct database storage via KFP API server.
This is a major architectural shift that eliminates the external ML Metadata service dependency and consolidates all artifact and task metadata operations directly into the KFP API server with MySQL/database backend.
Components Removed
MLMD Service Infrastructure
backend/metadata_writer/)backend/src/v2/metadata/)Deployment Changes
Components Added
New API Layer
Artifact Service API (
backend/api/v2beta1/artifact.proto)CRUD Operations:
CreateArtifact- Create single artifactGetArtifact- Retrieve artifact by IDListArtifacts- Query artifacts with filteringBatchCreateArtifacts- Bulk artifact creationArtifact Task Operations:
CreateArtifactTask- Track artifact usage in tasksListArtifactTasks- Query artifact-task relationshipsBatchCreateArtifactTasks- Bulk task-artifact linkingGenerated Clients:
Extended Run Service API (
backend/api/v2beta1/run.proto)New Task Endpoints:
CreateTask- Create pipeline task execution recordGetTask- Retrieve task detailsListTasks- Query tasks with filteringUpdateTask- Update task status/metadataBatchUpdateTasks- Efficient bulk task updatesViewMode Feature:
BASIC- Minimal response (IDs, status, timestamps)RUNTIME_ONLY- Include runtime details without full specFULL- Complete task/run details with specStorage Layer
Artifact Storage (
backend/src/apiserver/storage/artifact_store.go)Artifact Task Store (
backend/src/apiserver/storage/artifact_task_store.go)Enhanced Task Store (
backend/src/apiserver/storage/task_store.go)API Server Implementation
Artifact Server (
backend/src/apiserver/server/artifact_server.go)Extended Run Server (
backend/src/apiserver/server/run_server.go)Client Infrastructure
KFP API Client (
backend/src/v2/apiclient/)Driver/Launcher Refactoring
Parameter/Artifact Resolution (
backend/src/v2/driver/resolver/)resolve.go(~1,100 lines removed)parameters.go- Parameter resolution (~560 lines)artifacts.go- Artifact resolution (~314 lines)resolve.go- Orchestration (~90 lines)Driver Changes (
backend/src/v2/driver/)Launcher Changes (
backend/src/v2/cmd/launcher-v2/)Batch Updater (
backend/src/v2/component/batch_updater.go)Testing Infrastructure
Test Data Pipelines (
backend/src/v2/driver/test_data/)cache_test.yaml- Cache hit/miss scenarioscomponentInput.yaml- Input parameter testingk8s_parameters.yaml- Kubernetes-specific featuresoneof_simple.yaml- Conditional executionnested_naming_conflicts.yaml- Name resolution edge casesTest Coverage
Utility Additions
Scope Path (
backend/src/common/util/scope_path.go)Proto Helpers (
backend/src/common/util/proto_helpers.go)YAML Parser (
backend/src/common/util/yaml_parser.go)Key Behavioral Changes
Artifact Tracking
Task State Management
Performance Optimizations
API Response Size
ListRunswithVIEW_MODE=DEFAULT: ~80% smaller payloadsMigration Considerations
Database Schema
artifacts,artifact_taskstaskstable with new columnsBackwards Compatibility
Deployment
Testing Strategy
Unit Tests
Integration Tests
Golden File Updates
Files Changed Summary
Breakdown
Risks & Considerations
Testing
Performance
Operational
Recommended Follow-up
Conclusion
This is an architectural improvement that:
Checklist: