Skip to content

[WIP] add deepseek-v3 #35926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 101 commits into from
Mar 28, 2025
Merged

[WIP] add deepseek-v3 #35926

merged 101 commits into from
Mar 28, 2025

Conversation

bzantium
Copy link
Contributor

@bzantium bzantium commented Jan 28, 2025

What does this PR do?

This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.

resolved: #35425

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

@Rocketknight1
Copy link
Member

Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port.

@bzantium bzantium changed the title [WIP] add deepseekv3 [WIP] add deepseek-v3 Jan 29, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2 tests, which use inheritance!

@cuichenx
Copy link

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 29, 2025

One big thing would be TP support, the base_tp_plan would probably need to be updated to make sure each mlp's gat up down have the correct order, unless the direct usage of dist remove this need

@ArthurZucker
Copy link
Collaborator

Okay our HUB has small issues but these are the only tests left to pass:

FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_can_use_safetensors - AssertionError: DeepseekV3ForCausalLM: Tensor model.layers.4.mlp.gate.weight: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_feed_forward_chunking - AssertionError: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_from_pretrained_no_checkpoint - AssertionError: False is not true
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_initialization - AssertionError: -0.0012460040161386132 not found in [0.0, 1.0] : Parameter layers.2.mlp.gate.weight of model <class 'transformers.models.deepseek_v3.modeling_deepseek_v3.DeepseekV3Model'> seems not properly initialized
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_left_padding_compatibility - AssertionError: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_load_save_without_tied_weights - AssertionError: DeepseekV3Model: Tensor layers.4.mlp.gate.weight: Tensor-likes are not close!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing - AssertionError: False is not true : model.layers.2.mlp.experts.4.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing_use_reentrant - AssertionError: False is not true : model.layers.2.mlp.experts.2.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing_use_reentrant_false - AssertionError: False is not true : model.layers.2.mlp.experts.2.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!

@ArthurZucker
Copy link
Collaborator

All tests pass locally! Will merge, as I said super happy to have contributions to improve!

@bzantium
Copy link
Contributor Author

sorry for not updating code from my side for a long time and thanks for your contributions! I was quite busy to take care of my baby 😢. Happy to see all tests are passed!

@ArthurZucker
Copy link
Collaborator

Thanks a lot for you help @bzantium ! 🤗 kudos to you and @mseeger !

Copy link
Contributor Author

@bzantium bzantium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest for typo!

final_hidden_states.index_add_(0, token_indices, weighted_output)

# in original deepseek, the output of the experts are gathered once we leave this module
# thus the moe module is itelsf an IsolatedParallel module
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# thus the moe module is itelsf an IsolatedParallel module
# thus the moe module is itself an IsolatedParallel module

@ArthurZucker ArthurZucker merged commit eca74d1 into huggingface:main Mar 28, 2025
18 checks passed
@ArthurZucker
Copy link
Collaborator

Oups sorry

@bzantium
Copy link
Contributor Author

It's totally okay, great work!

@Neo9061
Copy link

Neo9061 commented Mar 28, 2025

Hi everyone, thanks a lot for contributions! Does that mean we have support in HF to load the checkpoints in FP8 directly? or we need convert it firstly in bf16 using DeepSeek's script?

@ArthurZucker
Copy link
Collaborator

It load fp8 firectly!

@ArthurZucker
Copy link
Collaborator

Conversion is needed to run bf16 tho, or PR to support on the fly de-compression of the FP8 !

@bzantium
Copy link
Contributor Author

bzantium commented Apr 2, 2025

@mseeger sorry for late merge, now you can create issue and give PR for efficient inference!

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* init commit

* style

* take comments into account

* add deepseekv3 modeling

* remove redundant code

* apply make style

* apply fix-copies

* make format

* add init files

* rename deepseekv3 into deepseek_v3 based on its model_type

* rename deepseekv3 into deepseek_v3 based on its model_type

* deepseek-v3 not deepseek_v3

* set model_type as deepseek_v3

* use default docs

* apply make

* fill type and docstring

* add rope_config_validation

* use custom DeepseekV3MLP

* hold code only for checkpoints congifuration; remove redundant

* revise rope yarn for DeepSeek variation

* rename DeepSeek-V3

* some refactoring

* revise load_hook to work properly; make moe func trainable; use llama instead of mixtral

* fix attention forward

* use -1 for not-changing dim when to use exapnd

* refactor DeepseekV3TopkRouter

* use reshape_for_rope instead of load_hook; revise attention forward for TP; rename q_head_dim with qk_head_dim

* register pre_hook and hook both

* make style

* use n_shared_experts

* Update src/transformers/models/deepseek_v3/configuration_deepseek_v3.py

Co-authored-by: Arthur <[email protected]>

* add test file

* update modeling_file according to modular file

* make style

* add mapping for DeepseekV3ForSequenceClassification

* remove aux_loss_alpha

* add deepseek_v3 for perf

* add deepseek_v3

* rename test as deepseekv3

* use tiny-deepseek-v3

* remove DeepseekV3ForSequenceClassification

* cache before padding

* remote output_router_logits

* Revert "remote output_router_logits"

This reverts commit f264f80.

* remove output_router_logits

* make e_score_correction_bias as buffer

* skip tests not compatible

* make style

* make e_score_correction_bias as buffer

* use rope_interleave instead of load_hook

* skip tests not compatible with MLA

* add doc for rope_interleave

* fix typo

* remove torch.no_grad for selecting topk

* fix post merge issue

* mrege with main and simplify

* nits

* final

* small fixes

* fix

* support TP better

* stash

* changes currently requires

* remove synch

* more fixes for TP

* temp fix for TP : some attention layers's FP8 scales are too small + shared is local colwise and anything is local if FP8 because weights are used

* updates to have generation work!

* push most of the changes

* reorder functions + call for contributions!

* update readme

* nits

* update

* ruff was updated on main

* merge with main and fix copies

* revert unrelated changes

* route all tokens to all experts when testing to avoid no gradient iddues

* finish fixing all tests

* fixup

* nit

* clean config

* last readme changes

* nit

* do cnit

* typo

* last nit

* one more one more

---------

Co-authored-by: Arthur Zucker <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeepSeek V3 Support