augmented-text-normalization

This repository supports data augmentation using GPT-4o-mini for NVIDIA Neural Text Normalization Models.

Generated sample data is available in sample_data directory.

Data Augmentation

Note: Since NVIDIA provides a pretrained model trained on Google Text Normalization Dataset, check it first.

Setup

pip install -r requirements.txt

API Key

Fill in api_key.txt with your OpenAI api key.

Run Augmentation

This repository has two settings for data augmentation.

# Based on your own text, which has a sentence to normalize for every line.
python augment_data.py --input_path <YOUR_TXT_PATH>

# GPT-4o-mini generates challenging sentences itself.
python augment_data.py --augment_from_scratch --sentence_num_from_scratch <TOTAL SENTENCES TO GENERATE>

In addition, if you have enough rate limits for OpenAI API, parallel augmentation is supported by:

python augment_data.py --workers <NUMBER OF PARALLEL THREADS>

For further details, check the help messages by python augment_data.py --help.

Train

Basically, follow the instructions in the training script. Since our pipeline generates a google-style .tsv data, direct usage is supported.

Environment

Recommend to follow NeMo.

conda create --name nemo python==3.10.12
conda activate nemo

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['nlp']

It is highly likely that you will experience package version errors, due to NeMo's notorious version incompatibility. We recommend downgrading huggingface_hub and transformers if there is any problem.

Furthermore, if you experience errors related to val_loss logging error, consider adding self.log("val_loss", val_loss) to line 163 of ~/.conda/envs/nemo/lib/python3.10/site-packages/nemo/collections/nlp/models/duplex_text_normalization/duplex_decoder.py

Train

Configure your training setting at ~/NeMo/examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml. Then simply run

python ~/NeMo/examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
sample_data		sample_data
DataAugmenter.py		DataAugmenter.py
README.md		README.md
api_key.txt		api_key.txt
augment_data.py		augment_data.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

augmented-text-normalization

Data Augmentation

Setup

API Key

Run Augmentation

Train

Environment

Train

About

Releases

Packages

Languages

neosapience/augmented-text-normalization

Folders and files

Latest commit

History

Repository files navigation

augmented-text-normalization

Data Augmentation

Setup

API Key

Run Augmentation

Train

Environment

Train

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages