[WIP] add deepseek-v3 #35926

bzantium · 2025-01-28T05:45:28Z

What does this PR do?

This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.

resolved: #35425

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case: DeepSeek V3 Support #35425
[] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

…e-from-pretrained

…feature/huggingface#35425

…um/transformers into feature/huggingface#35425

Rocketknight1 · 2025-01-28T13:25:18Z

Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port.

ArthurZucker

Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2 tests, which use inheritance!

src/transformers/models/deepseek_v3/modular_deepseek_v3.py

cuichenx · 2025-01-29T17:15:39Z

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

ArthurZucker · 2025-01-29T17:29:29Z

One big thing would be TP support, the base_tp_plan would probably need to be updated to make sure each mlp's gat up down have the correct order, unless the direct usage of dist remove this need

ArthurZucker · 2025-03-28T10:48:35Z

Okay our HUB has small issues but these are the only tests left to pass:

FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_can_use_safetensors - AssertionError: DeepseekV3ForCausalLM: Tensor model.layers.4.mlp.gate.weight: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_feed_forward_chunking - AssertionError: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_from_pretrained_no_checkpoint - AssertionError: False is not true
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_initialization - AssertionError: -0.0012460040161386132 not found in [0.0, 1.0] : Parameter layers.2.mlp.gate.weight of model <class 'transformers.models.deepseek_v3.modeling_deepseek_v3.DeepseekV3Model'> seems not properly initialized
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_left_padding_compatibility - AssertionError: Tensor-likes are not close!
FAILED tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_load_save_without_tied_weights - AssertionError: DeepseekV3Model: Tensor layers.4.mlp.gate.weight: Tensor-likes are not close!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing - AssertionError: False is not true : model.layers.2.mlp.experts.4.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing_use_reentrant - AssertionError: False is not true : model.layers.2.mlp.experts.2.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!
[DeepseekV3ForCausalLM] SUBFAIL tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_training_gradient_checkpointing_use_reentrant_false - AssertionError: False is not true : model.layers.2.mlp.experts.2.gate_proj.weight in DeepseekV3ForCausalLM has no gradient!

ArthurZucker · 2025-03-28T13:16:40Z

All tests pass locally! Will merge, as I said super happy to have contributions to improve!

bzantium · 2025-03-28T14:49:09Z

sorry for not updating code from my side for a long time and thanks for your contributions! I was quite busy to take care of my baby 😢. Happy to see all tests are passed!

ArthurZucker · 2025-03-28T14:49:55Z

Thanks a lot for you help @bzantium ! 🤗 kudos to you and @mseeger !

bzantium

suggest for typo!

bzantium · 2025-03-28T14:55:34Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+                final_hidden_states.index_add_(0, token_indices, weighted_output)
+
+        # in original deepseek, the output of the experts are gathered once we leave this module
+        # thus the moe module is itelsf an IsolatedParallel module


Suggested change

# thus the moe module is itelsf an IsolatedParallel module

# thus the moe module is itself an IsolatedParallel module

ArthurZucker · 2025-03-28T14:57:07Z

Oups sorry

bzantium · 2025-03-28T14:59:47Z

It's totally okay, great work!

Neo9061 · 2025-03-28T21:48:49Z

Hi everyone, thanks a lot for contributions! Does that mean we have support in HF to load the checkpoints in FP8 directly? or we need convert it firstly in bf16 using DeepSeek's script?

ArthurZucker · 2025-03-29T06:50:11Z

It load fp8 firectly!

ArthurZucker · 2025-03-29T06:50:39Z

Conversion is needed to run bf16 tho, or PR to support on the fly de-compression of the FP8 !

bzantium · 2025-04-02T11:09:33Z

@mseeger sorry for late merge, now you can create issue and give PR for efficient inference!

* init commit * style * take comments into account * add deepseekv3 modeling * remove redundant code * apply make style * apply fix-copies * make format * add init files * rename deepseekv3 into deepseek_v3 based on its model_type * rename deepseekv3 into deepseek_v3 based on its model_type * deepseek-v3 not deepseek_v3 * set model_type as deepseek_v3 * use default docs * apply make * fill type and docstring * add rope_config_validation * use custom DeepseekV3MLP * hold code only for checkpoints congifuration; remove redundant * revise rope yarn for DeepSeek variation * rename DeepSeek-V3 * some refactoring * revise load_hook to work properly; make moe func trainable; use llama instead of mixtral * fix attention forward * use -1 for not-changing dim when to use exapnd * refactor DeepseekV3TopkRouter * use reshape_for_rope instead of load_hook; revise attention forward for TP; rename q_head_dim with qk_head_dim * register pre_hook and hook both * make style * use n_shared_experts * Update src/transformers/models/deepseek_v3/configuration_deepseek_v3.py Co-authored-by: Arthur <[email protected]> * add test file * update modeling_file according to modular file * make style * add mapping for DeepseekV3ForSequenceClassification * remove aux_loss_alpha * add deepseek_v3 for perf * add deepseek_v3 * rename test as deepseekv3 * use tiny-deepseek-v3 * remove DeepseekV3ForSequenceClassification * cache before padding * remote output_router_logits * Revert "remote output_router_logits" This reverts commit f264f80. * remove output_router_logits * make e_score_correction_bias as buffer * skip tests not compatible * make style * make e_score_correction_bias as buffer * use rope_interleave instead of load_hook * skip tests not compatible with MLA * add doc for rope_interleave * fix typo * remove torch.no_grad for selecting topk * fix post merge issue * mrege with main and simplify * nits * final * small fixes * fix * support TP better * stash * changes currently requires * remove synch * more fixes for TP * temp fix for TP : some attention layers's FP8 scales are too small + shared is local colwise and anything is local if FP8 because weights are used * updates to have generation work! * push most of the changes * reorder functions + call for contributions! * update readme * nits * update * ruff was updated on main * merge with main and fix copies * revert unrelated changes * route all tokens to all experts when testing to avoid no gradient iddues * finish fixing all tests * fixup * nit * clean config * last readme changes * nit * do cnit * typo * last nit * one more one more --------- Co-authored-by: Arthur Zucker <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: [email protected] <[email protected]>

ArthurZucker and others added 13 commits September 9, 2024 08:06

init commit

b926c3d

Merge branch 'main' of github.com:huggingface/transformers into updat…

c62c5b7

…e-from-pretrained

style

5b85023

take comments into account

3b76bda

add deepseekv3 modeling

704767e

Merge branch 'main' into feature/huggingface#35425

737ee3a

Merge branch 'main' of https://github.com/bzantium/transformers into …

fc3a4c7

…feature/huggingface#35425

remove redundant code

244e793

Merge branch 'feature/huggingface#35425' of https://github.com/bzanti…

0968df5

…um/transformers into feature/huggingface#35425

apply make style

4fb2a80

apply fix-copies

6b002e5

make format

4ec1e88

add init files

114ab84

bzantium added 7 commits January 28, 2025 23:21

rename deepseekv3 into deepseek_v3 based on its model_type

779f8d2

rename deepseekv3 into deepseek_v3 based on its model_type

22623a3

deepseek-v3 not deepseek_v3

78b19b0

set model_type as deepseek_v3

eb0e3a4

use default docs

57088cc

apply make

0ef561b

fill type and docstring

9a75a56

bzantium changed the title ~~[WIP] add deepseekv3~~ [WIP] add deepseek-v3 Jan 29, 2025

ArthurZucker mentioned this pull request Jan 29, 2025

Unknown quantization type, got fp8 #35471

Closed

4 tasks

ruidazeng approved these changes Jan 29, 2025

View reviewed changes

bzantium added 2 commits January 30, 2025 00:28

add rope_config_validation

cdf83e4

use custom DeepseekV3MLP

51990b9

ruidazeng mentioned this pull request Jan 29, 2025

Does hf/transformers even support R1? huggingface/open-r1#116

Closed

ArthurZucker reviewed Jan 29, 2025

View reviewed changes

ArthurZucker added 2 commits March 28, 2025 13:37

route all tokens to all experts when testing to avoid no gradient iddues

a8fff20

finish fixing all tests

13019a7

ArthurZucker added the New model label Mar 28, 2025

ArthurZucker added 2 commits March 28, 2025 13:54

fixup

9b310a1

nit

e3628a3

ArthurZucker added 7 commits March 28, 2025 14:47

clean config

9eb38e6

last readme changes

8cb959b

nit

a55630b

do cnit

bce2073

typo

a1f1f3f

last nit

d2ae072

one more one more

372efd6

bzantium commented Mar 28, 2025

View reviewed changes

ArthurZucker merged commit eca74d1 into huggingface:main Mar 28, 2025
18 checks passed

Cyrilvallez mentioned this pull request Mar 28, 2025

Fix state_dict map location when quantized #37086

Merged

Cyrilvallez mentioned this pull request Mar 31, 2025

Allow quantizers to work with a state dict on meta device #37133

Closed

Rocketknight1 mentioned this pull request Mar 31, 2025

Warnings when loading Deepseek-V3 without custom code #37134

Closed

mobicham mentioned this pull request Apr 3, 2025

Loading HQQ quantized models is broken since #35926 #37263

Open

Cyrilvallez mentioned this pull request Apr 22, 2025

Fix duplicated weights in fp8 quantization #37667

Merged

	# thus the moe module is itelsf an IsolatedParallel module
	# thus the moe module is itself an IsolatedParallel module

[WIP] add deepseek-v3 #35926

[WIP] add deepseek-v3 #35926

Uh oh!

Conversation

bzantium commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Jan 28, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuichenx commented Jan 29, 2025

Uh oh!

ArthurZucker commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Mar 28, 2025

Uh oh!

ArthurZucker commented Mar 28, 2025

Uh oh!

bzantium commented Mar 28, 2025

Uh oh!

ArthurZucker commented Mar 28, 2025

Uh oh!

bzantium left a comment

Choose a reason for hiding this comment

Uh oh!

bzantium Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Mar 28, 2025

Uh oh!

bzantium commented Mar 28, 2025

Uh oh!

Neo9061 commented Mar 28, 2025

Uh oh!

ArthurZucker commented Mar 29, 2025

Uh oh!

ArthurZucker commented Mar 29, 2025

Uh oh!

bzantium commented Apr 2, 2025

Uh oh!

Uh oh!

bzantium commented Jan 28, 2025 •

edited

Loading

ArthurZucker commented Jan 29, 2025 •

edited

Loading