microsoft / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4.2k
Star 36k

Code
Issues 991
Pull requests 124
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/DeepSpeed

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

991 Open 1,916 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] Cannot access local variable 'locations' where it is not associated with a value bug

Something isn't working

compression

#6913 opened Dec 25, 2024 by Guodanding

[BUG] FAILED: multi_tensor_adam.cuda.o with bug

Something isn't working

training

#6912 opened Dec 24, 2024 by XueruiSu

[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun bug

Something isn't working

training

#6911 opened Dec 24, 2024 by dawnik17

[BUG] RuntimeError: The size of tensor a (2048) must match the size of tensor b (1024) at non-singleton dimension 2 bug

Something isn't working

deepspeed-chat

Related to DeepSpeed-Chat

#6910 opened Dec 24, 2024 by Lowlowlowlowlowlow

[REQUEST] is fp8 training supported? enhancement

New feature or request

#6908 opened Dec 24, 2024 by janelu9

nv-ds-chat CI test failure ci-failure

#6907 opened Dec 24, 2024 by github-actions bot

[BUG] RuntimeError: Unable to JIT load the fp_quantizer op due to it not being compatible due to hardware/software issue. FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels bug

Something isn't working

compression

#6906 opened Dec 23, 2024 by GHBigD

nv-torch-nightly-v100 CI test failure ci-failure

#6904 opened Dec 23, 2024 by github-actions bot

[BUG] triton kernel， loss 0， grar-norm nan bug

Something isn't working

training

#6902 opened Dec 22, 2024 by mdy666

[REQUEST] Support for XLA/TPU enhancement

New feature or request

#6901 opened Dec 21, 2024 by radna0

nv-nightly CI test failure ci-failure

#6900 opened Dec 20, 2024 by github-actions bot

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted).

#6896 opened Dec 19, 2024 by fabiogeraci

MPI environment variables are not set

#6895 opened Dec 18, 2024 by fabiogeraci

DeepSpeed with ZeRO3 strategy cannot build 'fused_adam' bug

Something isn't working

training

#6892 opened Dec 18, 2024 by LeonardoZini

How to perform inference MoE model with expert parallel

#6891 opened Dec 18, 2024 by Guodanding

How can DeepSpeed be configured to prevent the merging of parameter groups

#6878 opened Dec 16, 2024 by CLL112

How do I know if stage-3 is a success by using deepspeed？ training

#6877 opened Dec 16, 2024 by hwhyyds

[BUG] Cannot use --hostfile to start multi-node training in Docker. bug

Something isn't working

training

#6875 opened Dec 16, 2024 by Ind1x1

Windows wheel build error - Tried everything with all requirements you have build

Improvements to the build and testing systems.

windows

#6871 opened Dec 14, 2024 by FurkanGozukara

[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 bug

Something isn't working

training

#6870 opened Dec 14, 2024 by yafuly

[BUG] Mismatch of model parameters when using Sequence Parallel bug

Something isn't working

training

#6868 opened Dec 13, 2024 by chetwin-character

[BUG]When fine-tuning an LLM, the following error occurs after training for some time: self.optimizer.param_groups[param_group_id]['params'] = [] IndexError: list index out of range bug

Something isn't working

training

#6857 opened Dec 12, 2024 by tdtgi

[BUG] Unable to Use quantization_setting for Customizing MoQ in DeepSpeed Inference bug

Something isn't working

compression

#6853 opened Dec 11, 2024 by cyx96

DeepSpeed with trl bug

Something isn't working

training

#6852 opened Dec 11, 2024 by sagie-dekel

[QUESTIONS]：Some questions about running Domino enhancement

New feature or request

#6851 opened Dec 11, 2024 by yingtongxiong

Previous 1 2 3 4 5 … 39 40 Next

Previous Next

ProTip! no:milestone will show everything without a milestone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly