The student model is trained with the loss for the objective at hand (masked language modeling (MLM) for generic distillation and specific tasks for task-based distillation), while also forcing the predictions to be identical to the teacher model.
DistilBERT (HuggingFace, 2019)
- Student architecture: Better to reduce the number of layers than sequence length
- Student initialization: Initialize the student from the teacher by taking one layer out of two
- Distillation process: Distill on very large batches leveraging gradient accumulation (up to 4 examples per batch) using dynamic masking
- Data and compute: English Wikipedia and Toronto Book Corpus; trained on 8 16GB V100 GPUs for ~90 hours
Multilingual Distilled BERT (HuggingFace, 2020)
- 6 layers, 768 dimension, 12 heads, 134M parameters (vs. 177M for mBERT-base)
- Trained on concatenation of Wikipedia in 104 languages
Advanced Distillation - Patient Knowledge Distillation (Sun et al., 2019)
- Learn from intermediate layers, not only the final output layer
Distilling Monolingual Models from mBERT (Singh et al., 2022)
- First work distilling monolingual models
- Distillation loss = NLL, Cosine loss = Directional similarity, MLM loss = Standard cross-entropy
- Reduce vocabulary of the student model post-distillation
- Initialization from the teacher model improves performance
- Fine-tuned on downstream tasks (sentiment, topic classification, POS, NER)
Distilling Efficient Language-Specific Models for Cross-Lingual Transfer (Ansell et al., 2023)
- Bilingual distillation: Only source and target language
- Two-phase training:
- General phase - align hidden representations
- Task-specific phase - fine-tune student with task-adapted teacher
- Lottery-Ticket Sparse FineTuning (LT-SFT) for efficient multi-task training
The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation (Wibowo et al., 2024)
- Initialization from fine-tuned teacher contributes the most
- MSE instead of KL Divergence → Faster convergence and higher performance