Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAPT tutorial #11664

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
3 changes: 2 additions & 1 deletion .github/workflows/code-formatting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,12 @@ jobs:
echo "Will run on these files:
${FILTERED[@]}"

set +xe
set +e
LOG=$(pylint ${FILTERED[@]})
EXIT_CODE=$?
set -e

set +x
echo "OUTPUT<<EOF" >> $GITHUB_ENV
echo "$LOG" >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
Expand Down
3 changes: 3 additions & 0 deletions tutorials/llm/llama-3/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,6 @@ This repository contains Jupyter Notebook tutorials using the NeMo Framework for
* - `Llama3 LoRA Fine-Tuning and Supervised Fine-Tuning using NeMo2 <./nemo2-sft-peft>`_
- `SQuAD <https://arxiv.org/abs/1606.05250>`_ for LoRA and `Databricks-dolly-15k <https://huggingface.co/datasets/databricks/databricks-dolly-15k>`_ for SFT
- Perform LoRA PEFT and SFT on Llama 3 8B using NeMo 2.0
* - `Llama3 Domain Adaptive Pre-Training <./dapt>`_
- `Domain-Specific Data <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation>`_
- Perform Domain Adaptive Pre-Training on Llama 3 8B using NeMo 2.0
56 changes: 56 additions & 0 deletions tutorials/llm/llama-3/dapt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Training Code for DAPT (Domain Adaptive Pre-Training)

[ChipNeMo](https://arxiv.org/pdf/2311.00176) is a chip design domain adapted LLM. Instead of directly deploying off-theshelf commercial or open-source LLMs, the paper instead adopts the following domain adaptation techniques: domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models.

Here, we share a tutorial with best practices on training for DAPT (domain-adaptive pre-training).

If the data is not ready, use [Step0_Dummy_data.ipynb](./Step0_Dummy_Data.ipynb) to create dummy data.
To use real data, refer to the “Prepare Real Dataset” section.

Then, sequentially proceed with [Step1_DAP.ipynb](./Step1_DAP.ipynb), [Step2_Alignment.ipynb](./Step2_Alignment.ipynb), and [Step3_Domain_Retrieval.ipynb](./Step3_Domain_Retrieval.ipynb).

## Prepare Real Dataset
You can use dummy data for testing.
Please see [create dummy data](./Step0_Dummy_Data.ipynb)

### Domain Specific Data

```bash
apt-get update
apt-get install poppler-utils
apt install tesseract-ocr
pip install opencv-python==4.5.5.64
pip install nltk==3.8.1
```

Please see [NeMo-Curator DAPT](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)

### General Purpose Data: Wiki

```bash
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

pip install wikiextractor
python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl
```

### Model Alignment: Oassat & HelpSteer data

```bash
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=/work/Data/oasst
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer_data.py --output_directory=/work/Data/helpsteer

cat /work/Data/oasst/train.jsonl /work/Data/helpsteer/train.jsonl | awk '{for(i=1;i<=4;i++) print}' > /work/Data/merge_steerlm_train.jsonl
cat /work/Data/oasst/val.jsonl /work/Data/helpsteer/val.jsonl > /work/Data/merge_steerlm_val.jsonl
rm -rf /work/Data/oasst
rm -rf /work/Data/helpsteer

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=/work/Data/merge_steerlm_train.jsonl \
--output-file=/work/Data/merge_steerlm_train_reg.jsonl

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=/work/Data/merge_steerlm_val.jsonl \
--output-file=/work/Data/merge_steerlm_val_reg.jsonl
```
Loading
Loading