NVIDIA · jihyeonRyu · Dec 19, 2024 · Dec 19, 2024 · Dec 19, 2024 · Dec 19, 2024
@@ -139,11 +139,12 @@ jobs:
           echo "Will run on these files:
           ${FILTERED[@]}"
 
-          set +xe
+          set +e
           LOG=$(pylint ${FILTERED[@]})          
           EXIT_CODE=$?
           set -e
 
+          set +x
           echo "OUTPUT<<EOF" >> $GITHUB_ENV
           echo "$LOG" >> $GITHUB_ENV
           echo "EOF" >> $GITHUB_ENV

diff --git a/tutorials/llm/llama-3/README.rst b/tutorials/llm/llama-3/README.rst
@@ -23,3 +23,6 @@ This repository contains Jupyter Notebook tutorials using the NeMo Framework for
    * - `Llama3 LoRA Fine-Tuning and Supervised Fine-Tuning using NeMo2 <./nemo2-sft-peft>`_
      - `SQuAD <https://arxiv.org/abs/1606.05250>`_ for LoRA and `Databricks-dolly-15k <https://huggingface.co/datasets/databricks/databricks-dolly-15k>`_ for SFT
      - Perform LoRA PEFT and SFT on Llama 3 8B using NeMo 2.0
+   * - `Llama3 Domain Adaptive Pre-Training <./dapt>`_
+     - `Domain-Specific Data <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation>`_
+     - Perform Domain Adaptive Pre-Training on Llama 3 8B using NeMo 2.0
diff --git a/tutorials/llm/llama-3/dapt/README.md b/tutorials/llm/llama-3/dapt/README.md
@@ -0,0 +1,56 @@
+# Training Code for DAPT (Domain Adaptive Pre-Training)
+
+[ChipNeMo](https://arxiv.org/pdf/2311.00176) is a chip design domain adapted LLM. Instead of directly deploying off-theshelf commercial or open-source LLMs, the paper instead adopts the following domain adaptation techniques: domain-adaptive tokenization, domain adaptive continued pretraining, model alignment with domain-specific instructions, and domain adapted retrieval models. 
+
+Here, we share a tutorial with best practices on training for DAPT (domain-adaptive pre-training).
+
+If the data is not ready, use [Step0_Dummy_data.ipynb](./Step0_Dummy_Data.ipynb) to create dummy data.
+To use real data, refer to the “Prepare Real Dataset” section.
+
+Then, sequentially proceed with [Step1_DAP.ipynb](./Step1_DAP.ipynb), [Step2_Alignment.ipynb](./Step2_Alignment.ipynb), and [Step3_Domain_Retrieval.ipynb](./Step3_Domain_Retrieval.ipynb).
+
+## Prepare Real Dataset
+You can use dummy data for testing. 
+Please see [create dummy data](./Step0_Dummy_Data.ipynb)
+
+### Domain Specific Data
+
+```bash
+apt-get update
+apt-get install poppler-utils
+apt install tesseract-ocr
+pip install opencv-python==4.5.5.64
+pip install nltk==3.8.1
+```
+
+Please see [NeMo-Curator DAPT](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)
+
+### General Purpose Data: Wiki
+
+```bash
+wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
+
+pip install wikiextractor
+python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
+find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl
+```
+
+### Model Alignment: Oassat & HelpSteer data 
+
+```bash
+python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=/work/Data/oasst
+python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer_data.py --output_directory=/work/Data/helpsteer
+
+cat /work/Data/oasst/train.jsonl /work/Data/helpsteer/train.jsonl | awk '{for(i=1;i<=4;i++) print}' > /work/Data/merge_steerlm_train.jsonl
+cat /work/Data/oasst/val.jsonl /work/Data/helpsteer/val.jsonl > /work/Data/merge_steerlm_val.jsonl
+rm -rf /work/Data/oasst
+rm -rf /work/Data/helpsteer
+
+python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+   --input-file=/work/Data/merge_steerlm_train.jsonl \
+   --output-file=/work/Data/merge_steerlm_train_reg.jsonl
+
+python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+   --input-file=/work/Data/merge_steerlm_val.jsonl \
+   --output-file=/work/Data/merge_steerlm_val_reg.jsonl
+```