Loss is nan, stopping training

[rank0:INFO|finetune.py:327] 2025-01-07 18:36:21,918 >> Trainable parameter count : 7013158912 (local rank), 7013158912 (all).
Frozen parameter count : 0 (local rank), 0 (all).
/home/zengshuang.zs/anaconda3/envs/mgpt/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py:444: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
apply gradient checkpointing
[rank0:INFO|finetune.py:360] 2025-01-07 18:36:32,541 >> Wrapped model: 
FullyShardedDataParallel(
  (_fsdp_wrapped_module): ChameleonXLLMXForConditionalGeneration(
    (model): ChameleonModel(
      (embed_tokens): FullyShardedDataParallel(
        (_fsdp_wrapped_module): Embedding(65536, 4096)
      )
      (layers): ModuleList(
        (0-31): 32 x FullyShardedDataParallel(
          (_fsdp_wrapped_module): CheckpointWrapper(
            (_checkpoint_wrapped_module): ChameleonDecoderLayer(
              (self_attn): ChameleonSdpaAttention(
                (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
                (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
                (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
                (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
                (q_norm): ChameleonLayerNorm((128,), eps=1e-05, elementwise_affine=True)
                (k_norm): ChameleonLayerNorm((128,), eps=1e-05, elementwise_affine=True)
                (rotary_emb): ChameleonRotaryEmbedding()
              )
              (mlp): ChameleonMLP(
                (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
                (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
                (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
                (act_fn): SiLU()
              )
              (input_layernorm): ChameleonRMSNorm((4096,), eps=1e-05)
              (post_attention_layernorm): ChameleonRMSNorm((4096,), eps=1e-05)
              (dropout): Dropout(p=0.05, inplace=False)
            )
          )
        )
      )
      (norm): ChameleonRMSNorm((4096,), eps=1e-05)
    )
    (lm_head): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Linear(in_features=4096, out_features=65536, bias=False)
    )
  )
)
[rank0:INFO|finetune.py:421] 2025-01-07 18:36:32,544 >> effective batch size: 1
[rank0:INFO|dataset.py:27] 2025-01-07 18:36:32,544 >> read dataset config from configs/data/sample.yaml
[rank0:INFO|dataset.py:30] 2025-01-07 18:36:32,545 >> DATASET CONFIG:
[rank0:INFO|dataset.py:31] 2025-01-07 18:36:32,545 >> {'META': [{'path': '/home/zengshuang.zs/Lumina-mGPT/lumina_mgpt/data/single_image/record.json'}]}
/home/zengshuang.zs/Lumina-mGPT/xllmx/data/dataset.py:101: UserWarning: Use existing h5 data cache: xllmx_data_cache/configs-data-sample-yaml
Note: if the actual data defined by the data config has changed since your last run, please delete the cache manually and re-run this experiment, or the data actually used will not be updated
  warnings.warn(
[rank0:INFO|finetune.py:423] 2025-01-07 18:36:32,551 >> <xllmx.data.dataset.FinetuneConversationDataset object at 0x7fd853992020>
[rank0:INFO|finetune.py:510] 2025-01-07 18:36:32,703 >> Start training for 2 epochs
/home/zengshuang.zs/Lumina-mGPT/xllmx/solvers/finetune/finetune.py:588: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  "bf16": torch.cuda.amp.autocast(dtype=torch.bfloat16),
/home/zengshuang.zs/Lumina-mGPT/xllmx/solvers/finetune/finetune.py:589: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  "fp16": torch.cuda.amp.autocast(dtype=torch.float16),
[rank0:ERROR|finetune.py:600] 2025-01-07 18:36:35,360 >> Loss is nan, stopping training
[rank0]:[W107 18:36:35.138718050 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss is nan, stopping training #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss is nan, stopping training #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions