-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Is there an existing issue for this bug?
- I have searched the existing issues
The bug has not been fixed in the latest main branch
- I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
Steps
git clone https://github.com/hpcaitech/ColossalAI
cd ColossalAI
pip install .
cd examples/language/deepseek
colossalai run --nproc_per_node 16 benchmark.py -c 7b -g -b 16 --tp 1 --pp 4 --num_steps 50 --pp_style interleaved --n_chunks 2
Throw below error:
[rank12]: Traceback (most recent call last):
[rank12]: File "/root/ColossalAI/examples/language/deepseek/benchmark.py", line 295, in <module>
[rank12]: main()
[rank12]: File "/root/ColossalAI/examples/language/deepseek/benchmark.py", line 258, in main
[rank12]: outputs = booster.execute_pipeline(
[rank12]: File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 221, in execute_pipeline
[rank12]: return self.plugin.execute_pipeline(batch, model, criterion, optimizer, return_loss, return_outputs)
[rank12]: File "/usr/local/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1407, in execute_pipeline
[rank12]: outputs = self.scheduler.forward_backward_step(
[rank12]: File "/usr/local/lib/python3.10/site-packages/colossalai/pipeline/schedule/interleaved_pp.py", line 607, in forward_backward_step
[rank12]: result = self.run_forward_backward(
[rank12]: File "/usr/local/lib/python3.10/site-packages/colossalai/pipeline/schedule/interleaved_pp.py", line 462, in run_forward_backward
[rank12]: output_obj = self.forward_step(model_chunk, model_chunk_id, input_obj, criterion, accum_loss, outputs)
[rank12]: File "/usr/local/lib/python3.10/site-packages/colossalai/pipeline/schedule/interleaved_pp.py", line 309, in forward_step
[rank12]: internal_inputs["stage_index"] = self.stage_manager.stage_indices[model_chunk_id]
[rank12]: AttributeError: 'PipelineStageManager' object has no attribute 'stage_indices'
Analysis
I have found PipelineStageManager has not initialize stage_indices in policies/deepseek.py#L334, but policies/deepseek_v3.py#L120 did it
Solution
assign stage_manager.stage_indices after policies/deepseek.py#L334
Environment
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working