ZenML ❤️ OpenPipe: Fine-Tune LLMs with MLOps Best Practices

ZenML + OpenPipe brings production-grade MLOps to your LLM fine-tuning workflows

🌟 What is This Repository?

This repository provides a powerful integration between ZenML and OpenPipe, combining ZenML's production-grade MLOps orchestration with OpenPipe's specialized LLM fine-tuning capabilities.

Perfect for teams who need to:

Create reproducible LLM fine-tuning pipelines
Track all datasets, models, and experiments
Deploy fine-tuned models to production with confidence
Apply MLOps best practices to LLM workflows

🚀 Quickstart

Installation

# Clone the repository
git clone https://github.com/zenml-io/zenml-openpipe.git
cd zenml-openpipe

# Install dependencies
pip install -r requirements.txt

Set Up Your Environment

OpenPipe Account: Sign up for OpenPipe to get your API key
ZenML: You can use ZenML in two ways:
- Open Source: pip install "zenml[server]" and follow self-hosting instructions
- ZenML Pro (optional): Sign up for a managed experience with additional features

Run Your First Pipeline

# Set your OpenPipe API key
export OPENPIPE_API_KEY=opk-your-api-key

# Run the pipeline with the toy dataset
python run.py

Once the pipeline completes, OpenPipe automatically deploys your fine-tuned model and makes it available through their API. You can immediately use your model with a simple API call:

curl https://api.openpipe.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer opk-your-api-key" \
  -d '{
    "model": "customer_service_assistant",
    "messages": [
      {"role": "system", "content": "You are a helpful customer service assistant."},
      {"role": "user", "content": "How do I reset my password?"}
    ]
  }'

For Python applications, you can use the OpenPipe Python SDK:

# pip install openpipe

from openpipe import OpenAI

client = OpenAI(
  openpipe={"api_key": "opk-your-api-key"}
)

completion = client.chat.completions.create(
    model="openpipe:customer_service_assistant",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful customer service assistant for Ultra electronics products."
        },
        {
            "role": "user",
            "content": "Can I trade in my old device for a new UltraPhone X?"
        }
    ],
    temperature=0,
    openpipe={
        "tags": {
            "prompt_id": "counting",
            "any_key": "any_value"
        }
    },
)

print(completion.choices[0].message)

When you need to update your model with new data, simply run the pipeline again, and OpenPipe will automatically retrain and redeploy the updated model.

✨ Key Features

📊 End-to-End Fine-Tuning Pipeline

@pipeline
def openpipe_finetuning(
    # Data parameters
    data_source: str = "toy",
    system_prompt: str = "You are a helpful assistant",
    
    # OpenPipe parameters
    model_name: str = "zenml_finetuned_model",
    base_model: str = "meta-llama/Meta-Llama-3.1-8B-Instruct",
    
    # Training parameters
    enable_sft: bool = True,
    num_epochs: int = 3,
    # ...and more
):
    # Load and prepare your data
    data = data_loader(...)
    jsonl_path = openpipe_data_converter(...)
    
    # Create OpenPipe dataset and start fine-tuning
    dataset_id = openpipe_dataset_creator(...)
    finetuning_result = openpipe_finetuning_starter(...)
    
    return finetuning_result

🔍 Complete Traceability

Every run of your fine-tuning pipeline tracks:

Input data and processing
Training configuration and hyperparameters
Model performance and results

🛠️ Flexible Customization

Use toy datasets or bring your own data (CSV)
Select from a variety of base models
Customize supervised fine-tuning parameters
Set up continuous training processes

🔄 Production Deployment & Scheduling

You can deploy this integration on any infrastructure stack supported by ZenML to enable automated, scheduled fine-tuning workflows.

ZenML supports various orchestrators (Airflow, Kubernetes, Vertex AI, etc.) and cloud environments, allowing you to:

Run fine-tuning jobs on a recurring schedule
Trigger pipelines based on new data arrivals
Scale resources based on workload requirements
Integrate with your existing ML infrastructure

For more details on deployment options, check the ZenML documentation.

📊 Comprehensive Metadata Tracking

This integration leverages ZenML's metadata tracking capabilities to capture extensive information throughout the fine-tuning process:

Data preparation metrics: Shape of datasets, split ratios, sample distributions
Fine-tuning parameters: Model configurations, hyperparameters, training durations
Runtime statistics: Status transitions, completion times, resource utilization
Model information: URLs to access models, deployment timestamps, version tracking

All metadata is accessible in the ZenML dashboard, enabling:

Experiment comparison across multiple runs
Performance analysis and debugging
Easy reproduction of successful training jobs
Audit trails for model governance

🚀 Automatic Deployment and Redeployment

A key advantage of this integration is that OpenPipe automatically deploys your fine-tuned model as soon as training completes. Your model is immediately available via API without any additional deployment steps.

The OpenPipe console showing a successfully deployed fine-tuned model

When you run the pipeline again with new data, OpenPipe automatically retrains and redeploys your model, ensuring your production model always reflects your latest data. This makes it easy to implement a continuous improvement cycle:

Fine-tune initial model
Collect feedback and new examples
Rerun the pipeline to update the model
Repeat to continuously improve performance

📚 Advanced Usage

Custom Data Source

# Use your own CSV dataset
python run.py --openpipe-api-key=opk-your-api-key --data-source=path/to/data.csv

Model Selection

# Fine-tune Llama-3-70B instead of the default
python run.py --openpipe-api-key=opk-your-api-key --model-name=my-model --base-model=meta-llama/Meta-Llama-3-70B-Instruct

🗂️ Bringing Your Own Data

The integration supports using your own custom datasets for fine-tuning. Here's how to prepare and use your data:

Data Format Requirements

Your CSV file should include at minimum these two columns:

A column with user messages/questions (default: question)
A column with assistant responses/answers (default: answer)

Example CSV structure:

question,answer,product
"How do I turn on my Ultra TV?","Press the power button on the remote or on the bottom right of the TV.",television
"Is my Ultra SmartWatch waterproof?","Yes, the Ultra SmartWatch is water-resistant up to 50 meters.",smartwatch

Understanding the Data Transformation Process

When you provide your CSV file, the pipeline automatically:

Reads your CSV data
Applies the system prompt to all examples
Converts the data to OpenPipe's required JSONL format
Splits the data into training and testing sets

The final JSONL format looks like this (from the generated openpipe_data.jsonl):

{
  "messages": [
    {"role": "system", "content": "You are a helpful customer service assistant for Ultra electronics products."},
    {"role": "user", "content": "What is the price of the UltraPhone X?"},
    {"role": "assistant", "content": "The UltraPhone X is available for $999. Would you like to know about our financing options?"}
  ],
  "split": "TRAIN",
  "metadata": {"product": "UltraPhone X"}
}

Step-by-Step Guide to Using Your Data

Prepare your CSV file with at least these columns:
- A question/user message column (named question by default)
- An answer/assistant response column (named answer by default)
- Any additional metadata columns you want to include (optional)

Run the pipeline with your data file:

python run.py --data-source=path/to/your/data.csv

Check the results in the ZenML dashboard or logs

Here's a complete example with all possible customizations:

python run.py \
  --data-source=my_customer_support_data.csv \
  --user-column=customer_query \
  --assistant-column=agent_response \
  --system-prompt="You are a helpful customer service assistant for Acme Corp." \
  --metadata-columns=product_category \
  --metadata-columns=customer_segment \
  --split-ratio=0.85

Customizing Column Names

If your CSV uses different column names than the defaults, specify them with command-line arguments:

python run.py \
  --data-source=path/to/your/data.csv \
  --user-column=prompt \
  --assistant-column=completion

For example, if your CSV looks like this:

prompt,completion,category
"What's your return policy?","We offer a 30-day no-questions-asked return policy.",returns
"Do you ship internationally?","Yes, we ship to over 50 countries worldwide.",shipping

Adding Metadata

You can include additional metadata columns in your CSV to enhance fine-tuning:

Add the columns to your CSV

Specify them when running the pipeline:

python run.py --data-source=path/to/your/data.csv --metadata-columns=category --metadata-columns=difficulty

Metadata can help OpenPipe better understand the context of your training examples and can be useful for:

Filtering and analyzing results
Creating specialized versions of your model
Understanding performance across different data categories

Data Splitting

By default, the pipeline splits your data into training and evaluation sets using a 90/10 split. You can adjust this:

python run.py --data-source=path/to/your/data.csv --split-ratio=0.8

System Prompt

You can set a custom system prompt that will be applied to all examples:

python run.py --data-source=path/to/your/data.csv --system-prompt="You are a customer service assistant for Ultra products."

Inspecting Model Details

# Get detailed information about an existing model
python run.py --openpipe-api-key=opk-your-api-key --model-name=my-model --fetch-details-only

🧩 How It Works

The integration leverages:

ZenML's Pipeline Orchestration: Handles workflow DAGs, artifact tracking, and reproducibility
OpenPipe's LLM Fine-Tuning: Provides state-of-the-art techniques for adapting foundation models

📖 Learn More

🤝 Contributing

Contributions are welcome! Please check out our contribution guidelines for details on how to get started.

🆘 Getting Help

Join the ZenML Slack community
Ask questions in the #general channel
Open an issue on GitHub

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name	Name	Last commit message	Last commit date
Latest commit htahir1 Update image paths to assets folder for OpenPipe deployment Mar 14, 2025 4ea0696 · Mar 14, 2025 History 22 Commits
assets	assets	Add automatic deployment and redeployment feature	Mar 14, 2025
configs	configs	Update base model to "meta-llama/Meta-Llama-3.1-8B-Instruct" across f…	Mar 14, 2025
pipelines	pipelines	Refactor typing imports in multiple files	Mar 14, 2025
steps	steps	Reorder imports and fix minor syntax issues	Mar 14, 2025
utils	utils	Start fine-tuning job using SDK with monitoring	Mar 14, 2025
.dockerignore	.dockerignore	initial commit	Mar 13, 2025
.gitignore	.gitignore	initial commit	Mar 13, 2025
LICENSE	LICENSE	initial commit	Mar 13, 2025
README.md	README.md	Update image paths to assets folder for OpenPipe deployment	Mar 14, 2025
blog_post.md	blog_post.md	Update image paths to assets folder for OpenPipe deployment	Mar 14, 2025
openpipe_data.jsonl	openpipe_data.jsonl	initial commit	Mar 13, 2025
requirements.txt	requirements.txt	Update ZenML version to 0.75.0 and OpenPipe dataset creation	Mar 14, 2025
run.py	run.py	Refactor typing imports in multiple files	Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZenML ❤️ OpenPipe: Fine-Tune LLMs with MLOps Best Practices

ZenML + OpenPipe brings production-grade MLOps to your LLM fine-tuning workflows

🌟 What is This Repository?

🚀 Quickstart

Installation

Set Up Your Environment

Run Your First Pipeline

✨ Key Features

📊 End-to-End Fine-Tuning Pipeline

🔍 Complete Traceability

🛠️ Flexible Customization

🔄 Production Deployment & Scheduling

📊 Comprehensive Metadata Tracking

🚀 Automatic Deployment and Redeployment

📚 Advanced Usage

Custom Data Source

Model Selection

🗂️ Bringing Your Own Data

Data Format Requirements

Understanding the Data Transformation Process

Step-by-Step Guide to Using Your Data

Customizing Column Names

Adding Metadata

Data Splitting

System Prompt

Inspecting Model Details

🧩 How It Works

📖 Learn More

🤝 Contributing

🆘 Getting Help

📜 License

About

Releases

Packages

Languages

License

zenml-io/zenml-openpipe

Folders and files

Latest commit

History

Repository files navigation

ZenML ❤️ OpenPipe: Fine-Tune LLMs with MLOps Best Practices

ZenML + OpenPipe brings production-grade MLOps to your LLM fine-tuning workflows

🌟 What is This Repository?

🚀 Quickstart

Installation

Set Up Your Environment

Run Your First Pipeline

✨ Key Features

📊 End-to-End Fine-Tuning Pipeline

🔍 Complete Traceability

🛠️ Flexible Customization

🔄 Production Deployment & Scheduling

📊 Comprehensive Metadata Tracking

🚀 Automatic Deployment and Redeployment

📚 Advanced Usage

Custom Data Source

Model Selection

🗂️ Bringing Your Own Data

Data Format Requirements

Understanding the Data Transformation Process

Step-by-Step Guide to Using Your Data

Customizing Column Names

Adding Metadata

Data Splitting

System Prompt

Inspecting Model Details

🧩 How It Works

📖 Learn More

🤝 Contributing

🆘 Getting Help

📜 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages