Kubeflow SDK

Overview

Kubeflow SDK is a unified Python SDK that streamlines the user experience for AI Practitioners to interact with various Kubeflow projects. It provides simple, consistent APIs across the Kubeflow ecosystem, enabling users to focus on building ML applications rather than managing complex infrastrutcure.

Kubeflow SDK Benefits

Unified Experience: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
Simplified AI Workflows: Abstract away Kubernetes complexity, allowing AI practitioners to work in familiar Python environments
Seamless Integration: Designed to work together with all Kubeflow projects for end-to-end ML pipelines
Local Development: First-class support for local development requiring only pip installation

Get Started

Install Kubeflow SDK

pip install -U kubeflow

Run your first PyTorch distributed job

from kubeflow.trainer import TrainerClient, CustomTrainer, TrainJobTemplate

def get_torch_dist(learning_rate: str, num_epochs: str):
    import os
    import torch
    import torch.distributed as dist

    dist.init_process_group(backend="gloo")
    print("PyTorch Distributed Environment")
    print(f"WORLD_SIZE: {dist.get_world_size()}")
    print(f"RANK: {dist.get_rank()}")
    print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}")

    lr = float(learning_rate)
    epochs = int(num_epochs)
    loss = 1.0 - (lr * 2) - (epochs * 0.01)

    if dist.get_rank() == 0:
        print(f"loss={loss}")

# Create the TrainJob template
template = TrainJobTemplate(
    runtime=TrainerClient().get_runtime("torch-distributed"),
    trainer=CustomTrainer(
        func=get_torch_dist,
        func_args={"learning_rate": "0.01", "num_epochs": "5"},
        num_nodes=3,
        resources_per_node={"cpu": 2},
    ),
)

# Create the TrainJob
job_id = TrainerClient().train(**template)

# Wait for TrainJob to complete
TrainerClient().wait_for_job_status(job_id)

# Print TrainJob logs
print("\n".join(TrainerClient().get_job_logs(name=job_id)))

Optimize hyperparameters for your training

from kubeflow.optimizer import OptimizerClient, Search, TrialConfig

# Create OptimizationJob with the same template
optimization_id = OptimizerClient().optimize(
    trial_template=template,
    trial_config=TrialConfig(num_trials=10, parallel_trials=2),
    search_space={
        "learning_rate": Search.loguniform(0.001, 0.1),
        "num_epochs": Search.choice([5, 10, 15]),
    },
)

print(f"OptimizationJob created: {optimization_id}")

Local Development

Kubeflow Trainer client supports local development without needing a Kubernetes cluster.

Available Backends

KubernetesBackend (default) - Production training on Kubernetes
ContainerBackend - Local development with Docker/Podman isolation
LocalProcessBackend - Quick prototyping with Python subprocesses

Quick Start: Install container support: pip install kubeflow[docker] or pip install kubeflow[podman]

from kubeflow.trainer import TrainerClient, ContainerBackendConfig, CustomTrainer

# Switch to local container execution
client = TrainerClient(backend_config=ContainerBackendConfig())

# Your training runs locally in isolated containers
job_id = client.train(trainer=CustomTrainer(func=train_fn))

Supported Kubeflow Projects

Project	Status	Version Support	Description
Kubeflow Trainer	✅ Available	v2.0.0+	Train and fine-tune AI models with various frameworks
Kubeflow Katib	✅ Available	v0.19.0+	Hyperparameter optimization
Kubeflow Pipelines	🚧 Planned	TBD	Build, run, and track AI workflows
Kubeflow Model Registry	🚧 Planned	TBD	Manage model artifacts, versions and ML artifacts metadata
Kubeflow Spark Operator	🚧 Planned	TBD	Manage Spark applications for data processing and feature engineering

Community

Getting Involved

Slack: Join our #kubeflow-ml-experience Slack channel
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings
GitHub: Discussions, issues and contributions at kubeflow/sdk

Contributing

Kubeflow SDK is a community project and is still under active development. We welcome contributions! Please see our CONTRIBUTING Guide for details.

Documentation

Design Document: Kubeflow SDK design proposal
Component Guides: Individual component documentation
DeepWiki: AI-powered repository documentation

✨ Contributors

We couldn't have done it without these incredible people:

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.github		.github
CHANGELOG		CHANGELOG
docs		docs
examples		examples
kubeflow		kubeflow
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
RELEASE.md		RELEASE.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kubeflow SDK

Overview

Kubeflow SDK Benefits

Get Started

Install Kubeflow SDK

Run your first PyTorch distributed job

Optimize hyperparameters for your training

Local Development

Available Backends

Supported Kubeflow Projects

Community

Getting Involved

Contributing

Documentation

✨ Contributors

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 49

Languages

License

kubeflow/sdk

Folders and files

Latest commit

History

Repository files navigation

Kubeflow SDK

Overview

Kubeflow SDK Benefits

Get Started

Install Kubeflow SDK

Run your first PyTorch distributed job

Optimize hyperparameters for your training

Local Development

Available Backends

Supported Kubeflow Projects

Community

Getting Involved

Contributing

Documentation

✨ Contributors

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 49

Languages

Packages