-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[WIP] add deepseek-v3 #35926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] add deepseek-v3 #35926
Conversation
…e-from-pretrained
Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2
tests, which use inheritance!
@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks! |
One big thing would be |
Okay our HUB has small issues but these are the only tests left to pass: FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_can_use_safetensors - AssertionError: DeepseekV3ForCausalLM: Tensor model.layers.4.mlp.gate.weight: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_feed_forward_chunking - AssertionError: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_from_pretrained_no_checkpoint - AssertionError: False is not true
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_initialization - AssertionError: -0.0012460040161386132 not found in [0.0, 1.0] : Parameter layers.2.mlp.gate.weight of model <class 'transformers.models.deepseek_v3.modeling_deepseek_v3.DeepseekV3Model'> seems not properly initialized
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_left_padding_compatibility - AssertionError: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_load_save_without_tied_weights - AssertionError: DeepseekV3Model: Tensor layers.4.mlp.gate.weight: Tensor-likes are not close!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing - AssertionError: False is not true : model.layers.2.mlp.experts.4.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing_use_reentrant - AssertionError: False is not true : model.layers.2.mlp.experts.2.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing_use_reentrant_false - AssertionError: False is not true : model.layers.2.mlp.experts.2.gate_proj.weight in DeepseekV3ForCausalLM has no gradient! |
All tests pass locally! Will merge, as I said super happy to have contributions to improve! |
sorry for not updating code from my side for a long time and thanks for your contributions! I was quite busy to take care of my baby 😢. Happy to see all tests are passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest for typo!
final_hidden_states.index_add_(0, token_indices, weighted_output) | ||
|
||
# in original deepseek, the output of the experts are gathered once we leave this module | ||
# thus the moe module is itelsf an IsolatedParallel module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# thus the moe module is itelsf an IsolatedParallel module | |
# thus the moe module is itself an IsolatedParallel module |
Oups sorry |
It's totally okay, great work! |
Hi everyone, thanks a lot for contributions! Does that mean we have support in HF to load the checkpoints in FP8 directly? or we need convert it firstly in bf16 using DeepSeek's script? |
It load fp8 firectly! |
Conversion is needed to run bf16 tho, or PR to support on the fly de-compression of the FP8 ! |
@mseeger sorry for late merge, now you can create issue and give PR for efficient inference! |
* init commit * style * take comments into account * add deepseekv3 modeling * remove redundant code * apply make style * apply fix-copies * make format * add init files * rename deepseekv3 into deepseek_v3 based on its model_type * rename deepseekv3 into deepseek_v3 based on its model_type * deepseek-v3 not deepseek_v3 * set model_type as deepseek_v3 * use default docs * apply make * fill type and docstring * add rope_config_validation * use custom DeepseekV3MLP * hold code only for checkpoints congifuration; remove redundant * revise rope yarn for DeepSeek variation * rename DeepSeek-V3 * some refactoring * revise load_hook to work properly; make moe func trainable; use llama instead of mixtral * fix attention forward * use -1 for not-changing dim when to use exapnd * refactor DeepseekV3TopkRouter * use reshape_for_rope instead of load_hook; revise attention forward for TP; rename q_head_dim with qk_head_dim * register pre_hook and hook both * make style * use n_shared_experts * Update src/transformers/models/deepseek_v3/configuration_deepseek_v3.py Co-authored-by: Arthur <[email protected]> * add test file * update modeling_file according to modular file * make style * add mapping for DeepseekV3ForSequenceClassification * remove aux_loss_alpha * add deepseek_v3 for perf * add deepseek_v3 * rename test as deepseekv3 * use tiny-deepseek-v3 * remove DeepseekV3ForSequenceClassification * cache before padding * remote output_router_logits * Revert "remote output_router_logits" This reverts commit f264f80. * remove output_router_logits * make e_score_correction_bias as buffer * skip tests not compatible * make style * make e_score_correction_bias as buffer * use rope_interleave instead of load_hook * skip tests not compatible with MLA * add doc for rope_interleave * fix typo * remove torch.no_grad for selecting topk * fix post merge issue * mrege with main and simplify * nits * final * small fixes * fix * support TP better * stash * changes currently requires * remove synch * more fixes for TP * temp fix for TP : some attention layers's FP8 scales are too small + shared is local colwise and anything is local if FP8 because weights are used * updates to have generation work! * push most of the changes * reorder functions + call for contributions! * update readme * nits * update * ruff was updated on main * merge with main and fix copies * revert unrelated changes * route all tokens to all experts when testing to avoid no gradient iddues * finish fixing all tests * fixup * nit * clean config * last readme changes * nit * do cnit * typo * last nit * one more one more --------- Co-authored-by: Arthur Zucker <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: [email protected] <[email protected]>
What does this PR do?
This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.
resolved: #35425
Before submitting
Pull Request section?
to it if that's the case: DeepSeek V3 Support #35425
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
to: @ArthurZucker