[torch_xla] Fix NaN training loss #34

tengyifei · 2025-01-14T04:44:37Z

Fixes #9

Fixed by switching to the Adafactor optimizer, which is what we have been using for model optimization in the past several months.

Tested:

  XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3
  torchprime/torch_xla_models/train.py \
      torchprime/torch_xla_models/configs/run.json

  NUM_SLICES=1 TPU_TYPE=v6e-256 launcher/run_xpk.sh \
     torchprime/torch_xla_models/train.py \
     --dataset_name wikitext \
     --dataset_config_name 'wikitext-2-raw-v1' \
     --output_dir /tmp \
     --cache_dir /tmp \
     --global_batch_size 256 \
     --logging_steps 10 \
     --max_steps 15 \
     --profile_step 5 \
     --model_id 'meta-llama/Meta-Llama-3-8B' \
     --tokenizer_name 'meta-llama/Meta-Llama-3-8B' \
     --block_size 8192 \
     --fsdp full_shard \
     --fsdp_config torchprime/torch_xla_models/configs/fsdp_config.json

Fixed by switching to the Adafactor optimizer, which is what we have been using for model optimization in the past several months. Tested: XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py \ torchprime/torch_xla_models/configs/run.json NUM_SLICES=1 TPU_TYPE=v6e-256 launcher/run_xpk.sh \ torchprime/torch_xla_models/train.py \ --dataset_name wikitext \ --dataset_config_name 'wikitext-2-raw-v1' \ --output_dir /tmp \ --cache_dir /tmp \ --global_batch_size 256 \ --logging_steps 10 \ --max_steps 15 \ --profile_step 5 \ --model_id 'meta-llama/Meta-Llama-3-8B' \ --tokenizer_name 'meta-llama/Meta-Llama-3-8B' \ --block_size 8192 \ --fsdp full_shard \ --fsdp_config torchprime/torch_xla_models/configs/fsdp_config.json

bhavya01

I think that the AdamW optimizer should also work well with the models. That's what MaxText uses.

Okay to submit this for now but can we open a separate issue that training loss is NaN with AdamW optimizer.

tengyifei · 2025-01-14T19:53:56Z

SGTM

tengyifei requested a review from bhavya01 January 14, 2025 17:53

bhavya01 approved these changes Jan 14, 2025

View reviewed changes

tengyifei merged commit 3aa04b2 into main Jan 14, 2025
6 checks passed

tengyifei mentioned this pull request Jan 14, 2025

[torch_xla] Training loss is NaN under the AdamW optimizer #36

Open

tengyifei deleted the yifeit/fix-llama-nan branch January 26, 2025 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch_xla] Fix NaN training loss #34

[torch_xla] Fix NaN training loss #34

Uh oh!

tengyifei commented Jan 14, 2025 •

edited

Loading

Uh oh!

bhavya01 left a comment

Uh oh!

tengyifei commented Jan 14, 2025

Uh oh!

Uh oh!

Uh oh!

[torch_xla] Fix NaN training loss #34

[torch_xla] Fix NaN training loss #34

Uh oh!

Conversation

tengyifei commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhavya01 left a comment

Choose a reason for hiding this comment

Uh oh!

tengyifei commented Jan 14, 2025

Uh oh!

Uh oh!

Uh oh!

tengyifei commented Jan 14, 2025 •

edited

Loading