llamafactory中用lora微调qwen3-32b-awq模型时梯度为nan损失为0的问题及其解决方案 #9125

gysabc · 2025-09-12T06:59:00Z

gysabc
Sep 12, 2025

数据：内部的数据
问题发生：训练日志中，某一条突然出现grad_norm为nan，之后的每条日志均是loss为0，grad_norm为nan

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.903926402016153e-05, 'epoch': 0.12, 'num_input_tokens_seen': 4096, 'train_runtime': 12.0031, 'train_tokens_per_second': 341.245}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.619397662556435e-05, 'epoch': 0.19, 'num_input_tokens_seen': 6144, 'train_runtime': 17.5649, 'train_tokens_per_second': 349.789}

训练参数设置：

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "LLaMA Factory Training",
            "type": "debugpy",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "env": {
                "CUDA_VISIBLE_DEVICES": "4"
            },
            "args": [
                "--stage", "sft",
                "--do_train", "True",
                "--model_name_or_path", "models/Qwen3-32B-AWQ",
                "--preprocessing_num_workers", "16",
                "--finetuning_type", "lora",
                "--template", "qwen3",
                "--flash_attn", "auto",
                "--dataset_dir", "data",
                "--dataset", "reasoning_train_1_chosen_instruct2",
                "--cutoff_len", "2048",
                "--learning_rate", "1e-4",
                "--num_train_epochs", "1.0",
                "--max_samples", "100000",
                "--per_device_train_batch_size", "1",
                "--gradient_accumulation_steps", "1",
                "--lr_scheduler_type", "cosine",
                "--max_grad_norm", "1.0",
                "--logging_steps", "1",
                "--save_steps", "100",
                "--warmup_steps", "0",
                "--packing", "False",
                "--enable_thinking", "True",
                "--report_to", "none",
                "--output_dir", "/app/data2/llamafactory/output_reasoning_train_1/v1",
                "--bf16", "True",
                "--plot_loss", "True",
                "--trust_remote_code", "True",
                "--ddp_timeout", "180000000",
                "--include_num_input_tokens_seen", "True",
                "--optim", "adamw_torch",
                "--lora_rank", "8",
                "--lora_alpha", "16",
                "--lora_dropout", "0",
                "--lora_target", "all"
            ],
            "justMyCode": false
        }
    ]
}

问题分析：

定位到是某条数据在前向计算时算出了inf，导致后续矩阵乘法算出了nan
具体是因为awq的线性层在计算的时候，会先将输入从fp32转换为fp16，转换前的数值超出fp16的表示范围，导致错误

问题解决：

修改源码/opt/conda/lib/python3.11/site-packages/awq/modules/linear/gemm.py的258行，设置阈值进行控制
修改前：x=x.half()
修改后：x = torch.clamp(x, min=torch.finfo(torch.float16).min, max=torch.finfo(torch.float16).max).half()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llamafactory中用lora微调qwen3-32b-awq模型时梯度为nan损失为0的问题及其解决方案 #9125

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

llamafactory中用lora微调qwen3-32b-awq模型时梯度为nan损失为0的问题及其解决方案 #9125

Uh oh!

gysabc Sep 12, 2025

Replies: 0 comments

gysabc
Sep 12, 2025