This repository supports data augmentation using GPT-4o-mini for NVIDIA Neural Text Normalization Models.
Generated sample data is available in sample_data
directory.
Note: Since NVIDIA provides a pretrained model trained on Google Text Normalization Dataset, check it first.
pip install -r requirements.txt
Fill in api_key.txt
with your OpenAI api key.
This repository has two settings for data augmentation.
# Based on your own text, which has a sentence to normalize for every line.
python augment_data.py --input_path <YOUR_TXT_PATH>
# GPT-4o-mini generates challenging sentences itself.
python augment_data.py --augment_from_scratch --sentence_num_from_scratch <TOTAL SENTENCES TO GENERATE>
In addition, if you have enough rate limits for OpenAI API, parallel augmentation is supported by:
python augment_data.py --workers <NUMBER OF PARALLEL THREADS>
For further details, check the help messages by python augment_data.py --help
.
Basically, follow the instructions in the training script. Since our pipeline generates a google-style .tsv data, direct usage is supported.
Recommend to follow NeMo.
conda create --name nemo python==3.10.12
conda activate nemo
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['nlp']
It is highly likely that you will experience package version errors, due to NeMo's notorious version incompatibility. We recommend downgrading huggingface_hub
and transformers
if there is any problem.
Furthermore, if you experience errors related to val_loss
logging error, consider adding self.log("val_loss", val_loss)
to line 163
of ~/.conda/envs/nemo/lib/python3.10/site-packages/nemo/collections/nlp/models/duplex_text_normalization/duplex_decoder.py
Configure your training setting at ~/NeMo/examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml
. Then simply run
python ~/NeMo/examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py