微调多模态数据集时，Stream模式，epoch和steps怎么设置 #8273

Juvenilecris · 2025-06-02T07:09:30Z

Juvenilecris
Jun 2, 2025

Reminder

I have read the above rules and searched the existing issues.

System Info

model_name_or_path: /fs-computility/llm_code_collab/liujiaheng/wangnn/models/Qwen/Qwen2.5-VL-7B-Instruct
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json

dataset: shartgptvideo_train_300k
buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
dispatch_batches: false

template: qwen2_vl
cutoff_len: 32768

overwrite_cache: false
preprocessing_num_workers: 32
dataloader_num_workers: 16
max_steps: 1000000

output_dir: saves/qwen2_5vl-7b/sft/621441
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]

per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

Reproduction

我这边日志显示[INFO|trainer.py:2414] 2025-06-02 06:59:40,147 >> ***** Running training *****
[2025-06-02 14:59:40] [INFO|trainer.py:2415] 2025-06-02 06:59:40,147 >>   Num examples = 32,000,000
[2025-06-02 14:59:40] [INFO|trainer.py:2416] 2025-06-02 06:59:40,147 >>   Num Epochs = 9,223,372,036,854,775,807
[2025-06-02 14:59:40] [INFO|trainer.py:2417] 2025-06-02 06:59:40,147 >>   Instantaneous batch size per device = 1
[2025-06-02 14:59:40] [INFO|trainer.py:2420] 2025-06-02 06:59:40,147 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[2025-06-02 14:59:40] [INFO|trainer.py:2421] 2025-06-02 06:59:40,147 >>   Gradient Accumulation steps = 2
[2025-06-02 14:59:40] [INFO|trainer.py:2422] 2025-06-02 06:59:40,147 >>   Total optimization steps = 1,000,000
[2025-06-02 14:59:40] [INFO|trainer.py:2423] 2025-06-02 06:59:40,150 >>   Number of trainable parameters = 7,615,616,512
  0%|          | 0/1000000 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...


# 但实际上我的样本只有300000个，epochs我设置为2，steps为什么就是maxsteps的值，那实际真的跑maxsteps值，但实际上我目前的设置，实际steps应该是：18860。这是为什么，我应该怎么设置呢

Others

No response

WayenVan · 2025-08-11T21:27:03Z

WayenVan
Aug 11, 2025

同样的问题，一般来说流式数据读取完了会抛出stopIteration异常然后进入下一个epoch。但是这里难道是会自动重启然后永远在第一个epoch中？有没有人做过测试的

0 replies

WayenVan · 2025-08-11T21:54:40Z

WayenVan
Aug 11, 2025

参考huggingface 官方文档关于max_steps 的说明： https://huggingface.co/docs/transformers/en/main_classes/trainer?utm_source=chatgpt.com#transformers.Seq2SeqTrainer

当使用max_steps的时候，将覆盖关于epoch的设置，如果数据集消耗完了会在同一个epoch中从头开始继续训练。也就是说永远只在epoch0中训练直到达到max_steps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

微调多模态数据集时，Stream模式，epoch和steps怎么设置 #8273

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

微调多模态数据集时，Stream模式，epoch和steps怎么设置 #8273

Uh oh!

Uh oh!

Juvenilecris Jun 2, 2025

Reminder

System Info

Reproduction

Others

Replies: 2 comments

Uh oh!

WayenVan Aug 11, 2025

Uh oh!

WayenVan Aug 11, 2025

Juvenilecris
Jun 2, 2025

WayenVan
Aug 11, 2025

WayenVan
Aug 11, 2025