You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Making an issue to track expected work for DeepSeek experimental:
1 - Integrate DeepGEMM support (contiguous) as an additional inference option - this uses groupwise/blockwise fp8 quantization - completed (#1124)
1A - add triton contigous group gemm (AMD compat) - completed (#1154)
2 - refactor token processing to avoid code duplication. = PR #1127
3 - add proper training loop support - initial working PR landed. (see train_ds_real.py).
4 - need basic unit tests for checkins
5 - review AMD port for Symmetric Memory. (PR merged into PT core, need to verify run on AMD).
6 - finalize which groupGEMM's we want to support long term (torch bf16 + DeepSeek for fp8?). AMD? updates -
fix for torch.group_gemm hang (#1166) so this has full training support now.
torch.__scaled_mm with wrappers via torchAO and thus fp8 rowwise has been added for ds inference now. (#1142)
7 - implement stats tracking for experts (exflow optimization) and subsequent more efficient expert placement. (initial stats tracking added, but only tracks topk1, needs topk6). Update = initial token tracking in place for topk==1, need to expand to topk==6.
8 - large scale training runs to prove out everything.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Making an issue to track expected work for DeepSeek experimental:
1 - Integrate DeepGEMM support (contiguous) as an additional inference option - this uses groupwise/blockwise fp8 quantization - completed (#1124)
1A - add triton contigous group gemm (AMD compat) - completed (#1154)
2 - refactor token processing to avoid code duplication. = PR #1127
3 - add proper training loop support - initial working PR landed. (see train_ds_real.py).
4 - need basic unit tests for checkins
5 - review AMD port for Symmetric Memory. (PR merged into PT core, need to verify run on AMD).
6 - finalize which groupGEMM's we want to support long term (torch bf16 + DeepSeek for fp8?). AMD?
updates -
fix for torch.group_gemm hang (#1166) so this has full training support now.
torch.__scaled_mm with wrappers via torchAO and thus fp8 rowwise has been added for ds inference now. (#1142)
7 - implement stats tracking for experts (exflow optimization) and subsequent more efficient expert placement. (initial stats tracking added, but only tracks topk1, needs topk6). Update = initial token tracking in place for topk==1, need to expand to topk==6.
8 - large scale training runs to prove out everything.
The text was updated successfully, but these errors were encountered: