-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
For NeMo 2502 container, when one tries to convert a SC2 checkpoint trained with sequence parallelism, NeMo checkpoint converter script NeMo/scripts/checkpoint_converters/convert_starcoder2_nemo_to_hf.py will crash with
Can not use sequence paralllelism without tensor parallelism` error.
This happens because the script is using a fixed model_config.tensor_model_parallel_size = 1
, however for certain GPU/model size configurations, one might need a higher TP size
Steps/Code to reproduce bug
- Launch a NeMo 2502 container
- Train a starcoder2 model using sequence parallelism and TP=2
- Run the
NeMo/scripts/checkpoint_converters/convert_starcoder2_nemo_to_hf.py
script to convert it to HF
python /opt/NeMo/scripts/checkpoint_converters/convert_starcoder2_nemo_to_hf.py --input_name_or_path mymodel.nemo --output_path mymodel_hf/ --hf-model-name /pretrained/huggingface/starcoder2-7b/ --precision bf16"
Expected behavior
Should convert the model to HF
Environment overview (please complete the following information)
- Environment location: Docker
- Method of install: Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull
&docker run
commands used
Additional context
Add any other context about the problem here.
Example: GPU model
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working