A composable component orchestrator for Reinforcement Learning from Verifiable Rewards (RLVR) training of Large Language Models on reasoning tasks.
- RLVR Specialization: Built for reasoning tasks with verifiable outcomes
- Zero-Code Configuration: Train models by modifying YAML configs only
- GRPO Implementation: Generative Reward in Policy Optimization
- Verifiable Rewards: Mathematical correctness, format compliance
- Modular Architecture: Swappable components for rapid experimentation
# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
poetry install
# Train on GSM8K with default GRPO configuration
poetry run python run_pipeline.py
# Custom configuration
poetry run python run_pipeline.py \
model.model_name_or_path="unsloth/gemma-3-1b-it" \
train.max_steps=500 \
train.learning_rate=1e-5
# Switch datasets
python run_pipeline.py data_component=finqa
# Adjust reward weights
python run_pipeline.py \
reward.reward_functions[0].params.correct_reward=5.0
# Memory-efficient training
python run_pipeline.py \
model.load_in_4bit=true \
train.per_device_train_batch_size=8
The pipeline consists of 6 composable components:
- Data Component: Dataset loading and preprocessing (GSM8K, FinQA)
- Model Component: Model initialization, LoRA adapters, quantization
- Training Loop: GRPO algorithm implementation
- Reward Component: Verifiable reward functions for reasoning
- Evaluation Component: In-training and post-training evaluation
- Observer Component: Experiment tracking (W&B integration)
- GSM8K: Grade school math word problems
- FinQA: Financial reasoning and calculation
- Answer Matching: Numerical correctness verification
- Format Checking: Structured reasoning compliance
- Custom Rewards: Extensible reward function system
All behavior is controlled through Hydra YAML configurations:
# conf/config.yaml
defaults:
- data_component@data: default
- model_component@model: default
- training_loop_component@train: default
- reward_component@reward: default
- evaluation_component@eval: default
- prompts@prompts: math_reasoning
See CONFIGURATION_GUIDE.md for detailed configuration options.
- ARCHITECTURE.md: System design and component details
- COMPONENT_GUIDE.md: Component development guide
- CONFIGURATION_GUIDE.md: Configuration reference
This pipeline is optimized for RLVR (Reinforcement Learning from Verifiable Rewards) rather than traditional RLHF. Key differences:
- Verifiable Outcomes: Rewards computed from objective correctness
- Reasoning Tasks: Structured mathematical and logical reasoning
- Format Compliance: Reward structured thinking patterns
- Deterministic Evaluation: Reproducible reward calculation
- Fork the repository
- Create feature branch
- Follow component development patterns in COMPONENT_GUIDE.md
- Add tests for new components
- Submit pull request
MIT License - see LICENSE file for details.
Built for reasoning. Configured for research. Optimized for results.