Skip to content

Easily create datasets and finetune LLMs with OpenPipe & ZenML

License

Notifications You must be signed in to change notification settings

zenml-io/zenml-openpipe

Folders and files

NameName
Last commit message
Last commit date
Mar 14, 2025
Mar 14, 2025
Mar 14, 2025
Mar 14, 2025
Mar 14, 2025
Mar 13, 2025
Mar 13, 2025
Mar 13, 2025
Mar 14, 2025
Mar 14, 2025
Mar 13, 2025
Mar 14, 2025
Mar 14, 2025

Repository files navigation

ZenML ❀️ OpenPipe: Fine-Tune LLMs with MLOps Best Practices

ZenML + OpenPipe brings production-grade MLOps to your LLM fine-tuning workflows


ZenML Logo      OpenPipe Logo


Python ZenML OpenPipe Slack


🌟 What is This Repository?

This repository provides a powerful integration between ZenML and OpenPipe, combining ZenML's production-grade MLOps orchestration with OpenPipe's specialized LLM fine-tuning capabilities.

Perfect for teams who need to:

  • Create reproducible LLM fine-tuning pipelines
  • Track all datasets, models, and experiments
  • Deploy fine-tuned models to production with confidence
  • Apply MLOps best practices to LLM workflows

πŸš€ Quickstart

Installation

# Clone the repository
git clone https://github.com/zenml-io/zenml-openpipe.git
cd zenml-openpipe

# Install dependencies
pip install -r requirements.txt

Set Up Your Environment

  1. OpenPipe Account: Sign up for OpenPipe to get your API key
  2. ZenML: You can use ZenML in two ways:

Run Your First Pipeline

# Set your OpenPipe API key
export OPENPIPE_API_KEY=opk-your-api-key

# Run the pipeline with the toy dataset
python run.py

Once the pipeline completes, OpenPipe automatically deploys your fine-tuned model and makes it available through their API. You can immediately use your model with a simple API call:

curl https://api.openpipe.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer opk-your-api-key" \
  -d '{
    "model": "customer_service_assistant",
    "messages": [
      {"role": "system", "content": "You are a helpful customer service assistant."},
      {"role": "user", "content": "How do I reset my password?"}
    ]
  }'

For Python applications, you can use the OpenPipe Python SDK:

# pip install openpipe

from openpipe import OpenAI

client = OpenAI(
  openpipe={"api_key": "opk-your-api-key"}
)

completion = client.chat.completions.create(
    model="openpipe:customer_service_assistant",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful customer service assistant for Ultra electronics products."
        },
        {
            "role": "user",
            "content": "Can I trade in my old device for a new UltraPhone X?"
        }
    ],
    temperature=0,
    openpipe={
        "tags": {
            "prompt_id": "counting",
            "any_key": "any_value"
        }
    },
)

print(completion.choices[0].message)

When you need to update your model with new data, simply run the pipeline again, and OpenPipe will automatically retrain and redeploy the updated model.

✨ Key Features

πŸ“Š End-to-End Fine-Tuning Pipeline

@pipeline
def openpipe_finetuning(
    # Data parameters
    data_source: str = "toy",
    system_prompt: str = "You are a helpful assistant",
    
    # OpenPipe parameters
    model_name: str = "zenml_finetuned_model",
    base_model: str = "meta-llama/Meta-Llama-3.1-8B-Instruct",
    
    # Training parameters
    enable_sft: bool = True,
    num_epochs: int = 3,
    # ...and more
):
    # Load and prepare your data
    data = data_loader(...)
    jsonl_path = openpipe_data_converter(...)
    
    # Create OpenPipe dataset and start fine-tuning
    dataset_id = openpipe_dataset_creator(...)
    finetuning_result = openpipe_finetuning_starter(...)
    
    return finetuning_result

πŸ” Complete Traceability

Every run of your fine-tuning pipeline tracks:

  • Input data and processing
  • Training configuration and hyperparameters
  • Model performance and results
Pipeline Lineage

πŸ› οΈ Flexible Customization

  • Use toy datasets or bring your own data (CSV)
  • Select from a variety of base models
  • Customize supervised fine-tuning parameters
  • Set up continuous training processes

πŸ”„ Production Deployment & Scheduling

You can deploy this integration on any infrastructure stack supported by ZenML to enable automated, scheduled fine-tuning workflows.

ZenML supports various orchestrators (Airflow, Kubernetes, Vertex AI, etc.) and cloud environments, allowing you to:

  • Run fine-tuning jobs on a recurring schedule
  • Trigger pipelines based on new data arrivals
  • Scale resources based on workload requirements
  • Integrate with your existing ML infrastructure

For more details on deployment options, check the ZenML documentation.

πŸ“Š Comprehensive Metadata Tracking

This integration leverages ZenML's metadata tracking capabilities to capture extensive information throughout the fine-tuning process:

  • Data preparation metrics: Shape of datasets, split ratios, sample distributions
  • Fine-tuning parameters: Model configurations, hyperparameters, training durations
  • Runtime statistics: Status transitions, completion times, resource utilization
  • Model information: URLs to access models, deployment timestamps, version tracking

All metadata is accessible in the ZenML dashboard, enabling:

  • Experiment comparison across multiple runs
  • Performance analysis and debugging
  • Easy reproduction of successful training jobs
  • Audit trails for model governance

πŸš€ Automatic Deployment and Redeployment

A key advantage of this integration is that OpenPipe automatically deploys your fine-tuned model as soon as training completes. Your model is immediately available via API without any additional deployment steps.

OpenPipe Deployed Model The OpenPipe console showing a successfully deployed fine-tuned model

When you run the pipeline again with new data, OpenPipe automatically retrains and redeploys your model, ensuring your production model always reflects your latest data. This makes it easy to implement a continuous improvement cycle:

  1. Fine-tune initial model
  2. Collect feedback and new examples
  3. Rerun the pipeline to update the model
  4. Repeat to continuously improve performance

πŸ“š Advanced Usage

Custom Data Source

# Use your own CSV dataset
python run.py --openpipe-api-key=opk-your-api-key --data-source=path/to/data.csv

Model Selection

# Fine-tune Llama-3-70B instead of the default
python run.py --openpipe-api-key=opk-your-api-key --model-name=my-model --base-model=meta-llama/Meta-Llama-3-70B-Instruct

πŸ—‚οΈ Bringing Your Own Data

The integration supports using your own custom datasets for fine-tuning. Here's how to prepare and use your data:

Data Format Requirements

Your CSV file should include at minimum these two columns:

  • A column with user messages/questions (default: question)
  • A column with assistant responses/answers (default: answer)

Example CSV structure:

question,answer,product
"How do I turn on my Ultra TV?","Press the power button on the remote or on the bottom right of the TV.",television
"Is my Ultra SmartWatch waterproof?","Yes, the Ultra SmartWatch is water-resistant up to 50 meters.",smartwatch

Understanding the Data Transformation Process

When you provide your CSV file, the pipeline automatically:

  1. Reads your CSV data
  2. Applies the system prompt to all examples
  3. Converts the data to OpenPipe's required JSONL format
  4. Splits the data into training and testing sets

The final JSONL format looks like this (from the generated openpipe_data.jsonl):

{
  "messages": [
    {"role": "system", "content": "You are a helpful customer service assistant for Ultra electronics products."},
    {"role": "user", "content": "What is the price of the UltraPhone X?"},
    {"role": "assistant", "content": "The UltraPhone X is available for $999. Would you like to know about our financing options?"}
  ],
  "split": "TRAIN",
  "metadata": {"product": "UltraPhone X"}
}

Step-by-Step Guide to Using Your Data

  1. Prepare your CSV file with at least these columns:

    • A question/user message column (named question by default)
    • An answer/assistant response column (named answer by default)
    • Any additional metadata columns you want to include (optional)
  2. Run the pipeline with your data file:

    python run.py --data-source=path/to/your/data.csv
  3. Check the results in the ZenML dashboard or logs

Here's a complete example with all possible customizations:

python run.py \
  --data-source=my_customer_support_data.csv \
  --user-column=customer_query \
  --assistant-column=agent_response \
  --system-prompt="You are a helpful customer service assistant for Acme Corp." \
  --metadata-columns=product_category \
  --metadata-columns=customer_segment \
  --split-ratio=0.85

Customizing Column Names

If your CSV uses different column names than the defaults, specify them with command-line arguments:

python run.py \
  --data-source=path/to/your/data.csv \
  --user-column=prompt \
  --assistant-column=completion

For example, if your CSV looks like this:

prompt,completion,category
"What's your return policy?","We offer a 30-day no-questions-asked return policy.",returns
"Do you ship internationally?","Yes, we ship to over 50 countries worldwide.",shipping

Adding Metadata

You can include additional metadata columns in your CSV to enhance fine-tuning:

  1. Add the columns to your CSV
  2. Specify them when running the pipeline:
    python run.py --data-source=path/to/your/data.csv --metadata-columns=category --metadata-columns=difficulty

Metadata can help OpenPipe better understand the context of your training examples and can be useful for:

  • Filtering and analyzing results
  • Creating specialized versions of your model
  • Understanding performance across different data categories

Data Splitting

By default, the pipeline splits your data into training and evaluation sets using a 90/10 split. You can adjust this:

python run.py --data-source=path/to/your/data.csv --split-ratio=0.8

System Prompt

You can set a custom system prompt that will be applied to all examples:

python run.py --data-source=path/to/your/data.csv --system-prompt="You are a customer service assistant for Ultra products."

Inspecting Model Details

# Get detailed information about an existing model
python run.py --openpipe-api-key=opk-your-api-key --model-name=my-model --fetch-details-only

🧩 How It Works

The integration leverages:

  1. ZenML's Pipeline Orchestration: Handles workflow DAGs, artifact tracking, and reproducibility
  2. OpenPipe's LLM Fine-Tuning: Provides state-of-the-art techniques for adapting foundation models

πŸ“– Learn More

🀝 Contributing

Contributions are welcome! Please check out our contribution guidelines for details on how to get started.

πŸ†˜ Getting Help

πŸ“œ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

Easily create datasets and finetune LLMs with OpenPipe & ZenML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages