This project demonstrates parameter-efficient fine-tuning of transformer models using both Low-Rank Adaptation (LoRA) and Low-Rank Structured Reparameterization (LSR) with Hydra configuration management for better experiment tracking.
LoRA is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights entirely
- Injects trainable rank decomposition matrices into each layer of the Transformer architecture
- Significantly reduces the number of parameters needed for fine-tuning (often by >99%)
For example, fine-tuning RoBERTa-base (125M parameters) with LoRA might only require training ~0.1-0.5M parameters while achieving comparable performance to full fine-tuning.
LSR extends LoRA by using Kronecker products to achieve even greater parameter efficiency:
- Assumes weight matrices can be factored (e.g., 768 = 32 × 24)
- Uses multiple Kronecker product terms to represent the adaptation
- Reduces parameter count substantially compared to standard LoRA
For example, where LoRA might use 24K parameters, LSR might use just 3.5K parameters (85% reduction) while maintaining comparable performance.
project_root/
│
├── src/ # Source code
│ ├── models/ # Model implementations
│ │ ├── __init__.py
│ │ ├── lora.py # LoRA architecture implementation
│ │ └── lsr.py # LSR architecture implementation
│ ├── data/ # Data processing
│ │ ├── __init__.py
│ │ └── dataset.py # Dataset loading and preprocessing
│ ├── training/ # Training utilities
│ │ ├── __init__.py
│ │ └── trainer.py # Training configuration
│ └── utils/ # Utility functions
│ ├── __init__.py
│ └── metrics.py # Evaluation metrics
│
├── conf/ # Hydra configuration
│ ├── config.yaml # Main config
│ ├── model/ # Model configurations
│ │ ├── roberta_lora.yaml # LoRA configuration
│ │ └── roberta_lsr.yaml # LSR configuration
│ ├── data/ # Dataset configurations
│ │ └── mrpc.yaml # MRPC dataset config
│ └── training/ # Training configurations
│ └── lora_training.yaml # Training parameters
│
├── main.py # Main entry point
├── requirements.txt # Dependencies
└── README.md # Documentation
The project follows a modular design with clear separation of concerns:
- Models: Implementations of parameter-efficient fine-tuning methods (LoRA and LSR)
- Data: Dataset loading, preprocessing, and tokenization
- Training: Training loop and optimization configuration
- Utils: Evaluation metrics and helper functions
- Conf: Hydra configuration files for all components
This modular structure makes it easy to:
- Add new parameter-efficient methods (like we did with LSR)
- Support different datasets beyond MRPC
- Experiment with various training configurations
- Track and compare results across different approaches
The project implements two different approaches to parameter-efficient fine-tuning, housed in a modular architecture that makes it easy to compare methods and extend to new techniques.
Both LoRA and LSR use a common adapter pattern where:
- Original pre-trained weights are frozen
- Small trainable modules are injected at strategic points
- A scaling factor balances the influence of the adapters
The implementations share a consistent interface:
load_lora_model()
andload_lsr_model()
functions with parallel parametersreplace_linear_with_lora()
andreplace_linear_with_lsr()
for transformer modification- Consistent parameter naming and scaling approaches
The system dynamically selects the fine-tuning method based on configuration:
if cfg.model.adaptation_type == "lsr":
model, tokenizer = load_lsr_model(...)
else: # Default to LoRA
model, tokenizer = load_lora_model(...)
Hydra configurations control all aspects of the methods:
- Which layers to adapt (query, key, value projections)
- Hyperparameters like rank, scaling factor, and number of terms
- Model selection and dataset choices
To run with the default configuration (LoRA):
python main.py
To run with LSR instead:
python main.py model=roberta_lsr
This will:
- Load RoBERTa-base and apply the selected adaptation method
- Fine-tune on the MRPC dataset for 20 epochs
- Save the best model based on validation accuracy
- Log metrics and results to the output directory
With Hydra, you can easily override any configuration parameter:
# Change LoRA rank to 16
python main.py model.lora.r=16
# Try LSR with different parameters
python main.py model=roberta_lsr model.lsr.num_terms=8 model.lsr.r=4
# Use a different dataset
python main.py data=sst2
# Change multiple parameters
python main.py model.lora.r=16 training.num_train_epochs=10 training.learning_rate=1e-4
One of the key strengths of this project structure is the ability to systematically compare different parameter-efficient fine-tuning methods. Below are some example experiments that leverage Hydra's multirun capability.
Compare LoRA and LSR approaches with their default settings:
python main.py --multirun model=roberta_lora,roberta_lsr
This will run both methods sequentially with their default hyperparameters.
Compare different rank settings for both approaches:
# LoRA with different ranks
python main.py model=roberta_lora --multirun model.lora.r=4,8,16,32
# LSR with different ranks
python main.py model=roberta_lsr --multirun model.lsr.r=1,2,4,8
# LSR with different numbers of terms
python main.py model=roberta_lsr --multirun model.lsr.num_terms=4,8,16,32
You can even perform grid searches across different methods and parameters:
python main.py --multirun model=roberta_lora,roberta_lsr \
training.learning_rate=1e-4,5e-4,1e-3
For a thorough comparison, we recommend:
- Start with equal parameter budgets (e.g., LSR with r=2 vs LoRA with r=8)
- Analyze not just final performance but also training dynamics
- Compare inference speed (which should be similar for both methods)
- Examine performance across different datasets and tasks
Understanding the mathematical foundations of these approaches helps explain their efficiency advantages.
In LoRA, for a pre-trained weight matrix
Where:
-
$B \in \mathbb{R}^{d \times r}$ and$A \in \mathbb{R}^{r \times k}$ -
$r \ll \min(d, k)$ is the low-rank dimension (typically 4-32)
The total parameter count for the adaptation is
LSR extends LoRA by using Kronecker products. For a weight matrix that can be factored into dimensions
Where:
-
$A_{1i} \in \mathbb{R}^{r \times d_1}$ ,$A_{2i} \in \mathbb{R}^{r \times d_2}$ -
$B_{1i} \in \mathbb{R}^{k_1 \times r}$ ,$B_{2i} \in \mathbb{R}^{k_2 \times r}$ -
$\otimes$ denotes the Kronecker product -
$t$ is the number of terms in the sum (typically 4-32)
The parameter count becomes
For a 768×768 linear layer:
Method | Configuration | Parameter Count | % of Original |
---|---|---|---|
Full Fine-tuning | - | 589,824 | 100% |
LoRA | r=8 | 12,288 | 2.08% |
LoRA | r=16 | 24,576 | 4.17% |
LSR | r=2, t=16, factors=32×24 | 3,584 | 0.61% |
LSR | r=4, t=8, factors=32×24 | 3,584 | 0.61% |
The vectorized implementation in this codebase ensures that, despite the mathematical complexity, LSR remains computationally efficient during both training and inference.
Empirical results show the effectiveness of both parameter-efficient fine-tuning methods:
When fine-tuning RoBERTa-base on MRPC with LoRA (r=8), you should expect:
- Accuracy: ~85-88%
- F1 Score: ~89-92%
- Training Parameters: ~0.2% of full model parameters
- Training Time: Significantly faster than full fine-tuning
When fine-tuning RoBERTa-base on MRPC with LSR (r=2, num_terms=16), you should expect:
- Accuracy: ~84-87%
- F1 Score: ~88-91%
- Training Parameters: ~0.03% of full model parameters (approximately 7x fewer than LoRA)
- Training Time: Similar to LoRA, both much faster than full fine-tuning
A key insight is that LSR achieves nearly the same performance as LoRA while using a fraction of the parameters. This is particularly valuable for:
- Memory-constrained environments: When fine-tuning very large models on limited hardware
- Deployment scenarios: Where model size directly impacts inference costs
- Multitask adaptation: When maintaining multiple task-specific adaptations simultaneously
The optimal choice between LoRA and LSR depends on your specific constraints and requirements.
The modular design makes it easy to extend this framework in several directions:
To implement a new parameter-efficient method:
- Create a new implementation file in
src/models/
(e.g.,src/models/new_method.py
) - Implement the core adapter class (following the pattern of
LoRALinear
orLSRLinear
) - Add functions to apply the method to a model (like
replace_linear_with_lora
) - Create a model loader function (similar to
load_lora_model
) - Add a configuration file in
conf/model/
(e.g.,roberta_new_method.yaml
) - Update
main.py
to recognize and initialize the new method
To support a new base model architecture:
- Create a new configuration file in
conf/model/
- Implement any necessary model-specific code in
src/models/
- Ensure the adaptation methods can properly target the right layers in the new architecture
To add support for new datasets:
- Create a new configuration file in
conf/data/
- Ensure the dataset preprocessing in
src/data/dataset.py
handles the new dataset - Consider adding dataset-specific metrics if needed in
src/utils/metrics.py
To modify the training process:
- Update or create a new configuration in
conf/training/
- Extend the trainer implementation in
src/training/trainer.py
if needed