Google MT5 Transformers models of any size for everything that is input-output NLP using PyTorch-Lightning
, Hugging Face, Cuda and Streamlit.
Table of Contents
- Clone the repo
git clone https://github.com/DvdNss/Google-MT5-For-Everything- Install requirements
pip install -r requirements.txtdata/: folder containing train/valid files.train_example.tsv: training file example.valid_example.tsv: validation file example.
model/: folder containing models.notebook/: folder containing jupyter notebook.Google-MT5-For-Everything.ipynb: jupyter notebook.
resource/: folder containing repo's imgs.source/: folder containing source files.datamodule.py: data module script.inference.py: inference script.mt5.py: model script.train.py: training script.
tokenizer/: folder containing tokenizer.app.py: streamlit web app script.LICENSEREADME.mdrequirements.txt
- Build
train.tsvandvalid.tsvfiles, with each having 2 columns: one for inputs and one for outputs, separated by\t. Inputs must be in the formattask: input. See examples below.
| Inputs | Outputs |
|---|---|
| translate: What is your name? | Quel est ton nom? |
| paraphrase: I hate spiders | I dislike spiders |
- Train a model (see
train.pyscript). You can also use the command linepython source/train.pywith your own arguments.
from source.datamodule import DataModule
from pytorch_lightning import seed_everything, Trainer
from source.mt5 import MT5
seed_everything(42)
data = DataModule(
train_file='data/train.tsv',
valid_file='data/valid.tsv',
inputs_col='inputs',
outputs_col='outputs',
input_max_length=512,
output_max_length=128,
batch_size=12,
added_tokens=['<hl>', '<sep>']
)
model = MT5(model_name_or_path='google/mt5-small', learning_rate=5e-4)
Trainer(max_epochs=10, gpus=1, default_root_dir='model/').fit(model=model, datamodule=data)
# Models will be saved in model/lightning_logs every epoch.- Use a model for inference (see
inference.pyscript).
from transformers import AutoTokenizer
from source.mt5 import MT5
# Loading model and tokenizer
model = MT5.load_from_checkpoint('path_to_checkpoint.ckpt').eval().cuda()
model.tokenizer = AutoTokenizer.from_pretrained('tokenizer', use_fast=True)
inputs = ['question: Who is the French president? context: Emmanuel Macron is the French president. ']
# Prediction
prediction = model.predict(inputs=inputs)
print(prediction) # --> Emmanuel Macron- Use Streamlit web app after updating
path_to_checkpointinapp.pywith your model path.
streamlit run app.pyDavid NAISSE - @LinkedIn - [email protected]