Skip to content

Starcoder 2 NeMo to HF Checkpoint Converter Is Crashing For Models Trained With Sequence Parallelism #14302

@evellasques

Description

@evellasques

Describe the bug
For NeMo 2502 container, when one tries to convert a SC2 checkpoint trained with sequence parallelism, NeMo checkpoint converter script NeMo/scripts/checkpoint_converters/convert_starcoder2_nemo_to_hf.py will crash with Can not use sequence paralllelism without tensor parallelism` error.

This happens because the script is using a fixed model_config.tensor_model_parallel_size = 1, however for certain GPU/model size configurations, one might need a higher TP size

Steps/Code to reproduce bug

  • Launch a NeMo 2502 container
  • Train a starcoder2 model using sequence parallelism and TP=2
  • Run the NeMo/scripts/checkpoint_converters/convert_starcoder2_nemo_to_hf.py script to convert it to HF
python /opt/NeMo/scripts/checkpoint_converters/convert_starcoder2_nemo_to_hf.py --input_name_or_path mymodel.nemo --output_path mymodel_hf/ --hf-model-name /pretrained/huggingface/starcoder2-7b/ --precision bf16"

Expected behavior

Should convert the model to HF

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of install: Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions