This is the repo for NAACL 2024 paper "MT-Patcher: Selective and Expandable Knowledge Distillation from Large Language Models for Machine Translation" (https://arxiv.org/abs/2403.09522)
This repository shares the code of our latest work on efficiently distilling knowledge from LLMs to MT models. In this work,
- We identity two problems in the traditional KD in the LLM era:
- Non-selective: the student model has mastered reasonable amount of knowledge, while we still repeatedly finetune it on the knowledge its has known in traditional KD
- Non-extendable: traditional KD cannot transfer knowledge that occurs very few times in the monolingual data very well
- We propose to leveerage LLM's language ability to transfer knowledge more efficiently
- we convert LLMs to be a translation feedbacker, providing feedbacks on student's translation, and only keep those with errors to train student models
- we convert LLMs to be a Synthesis Model, synthesizing multiple diverse contexts for errors that student fails on,
- we convert LLMs to be a Analogy Model, ancipating potential errors from errors that students fails on
- Experiments show that we can achieve comparable translation results by only distilling 10% data, and enlarging the dataset based on the Synthesis Model and Analogy Model further improves the performance.
Our code is built on huggingface transformers.
- python>=3.8.0
- pytorch>=2.1.0
- vllm (optional)
We recommend to use vllm to accelerate inference.
For building a MT-Patcher, we need to firstly collect a student's translation of a monolingual corpus. An example scripts is:
bash sh_scripts/inference.sh \
--model wxjiao/ParroT-7b \
--tokenizer wxjiao/ParroT-7b \
--infile $monolingual_path \
--savefile $TRANSLATION_PATH \
--num_gpus 8 \
--task translation \
--output_key tgt_text \
--use_vllm 1 \
--srclang en --tgtlang de
Next, we need to collect GPT-4's demonstration data for executing the MT-Patcher pipeline. To do this, you should use the prompt provided in the prompts/
dir, to collect the data from GPT-4.
After collecting the demonstration data from GPT-4, we can train our MT-Patcher using the following command:
bash sh_scripts/deepspeed_run.sh \
--train_file $COMBINED_GPT4_DATA \
--model_name_or_path Llama2-13b \
--batch_size 4 \
--update_freq 8 \
--output_dir $PATCHER_MODEL \
--devices 0,1,2,3,4,5,6,7
We can then run the full MT-Patcher pipeline on student's translation:
bash sh_scripts/pipeline.sh \
--data_file $TRANSLATION_PATH \
--savefile $PATCH_DATA_PATH \
--num_gpus 8 \
--srclang en --tgtlang de \
--patcher_model $PATCHER_MODEL \
--generate_feedback 1 --generate_case 1 --generate_analogy 1
We can then gather all finetuning data for finetune the student model.
bash sh_scripts/deepspeed_run.sh \
--train_file $COMBINED_PATCH_DATA_PATH \
--model_name_or_path wxjiao/ParroT-7b \
--batch_size 4 \
--update_freq 8 \
--output_dir $FINETUNED_STUDENT_MODEL \
--devices 0,1,2,3,4,5,6,7
If you find this repo useful, please feel free to leave a star and cite our paper:
@misc{li2024mtpatcher,
title={MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation},
author={Jiahuan Li and Shanbo Cheng and Shujian Huang and Jiajun Chen},
year={2024},
eprint={2403.09522},
archivePrefix={arXiv},
primaryClass={cs.CL}
}