Skip to content

doc and swap stuff #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 323 commits into
base: yz/softmax-misalignment
Choose a base branch
from
Open

Conversation

Carol25170
Copy link
Owner

No description provided.

aakhundov and others added 30 commits February 13, 2023 15:27
…A#790)

* Fix MHA kernel

Summary:

ATT

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Extend DualGemm to support batched mode (NVIDIA#5)

Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.

* Decouple LayoutB0 and LayoutB1 in DualGemm

The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.

In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.

* Remove comment as no longer relevant

* Revert Fix MHA kernel

---------

Co-authored-by: mikeiovine <[email protected]>
fix the copyright of a new file.
* fix typo

* fix a deadlink to code
copyright banner
* Changes to iterators to support s8 gemm with f16 outputs

* should work

---------

Co-authored-by: Sujan Gonugondla <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* fMHA: Add support for bias+dropout in FW

* Remove 'getMaximumSharedMemoryPerBlockKb'

* fix comments

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* ex24[gemm_grouped]: Allow to change layout/dtype

* Address suggestion from @jackkosaian

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Aniket Shivam <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* Hide streams and typinfo from nvrtc

* Use __CUDACC_RTC__ instead CUDA_ARCH for guard
* expose StoreT parameter for potential speed

* add storeT to more elementwise

---------

Co-authored-by: Haicheng Wu <[email protected]>
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well


Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
msft moe paper
* add bytetransformer

* update arxiv link

* re-order
* add guards for sm>=70

* drop guard to 530
* [layout] Fix AffineRank2ColumnMajor::packed()

* correct affine2row::packed

---------

Co-authored-by: Haicheng Wu <[email protected]>
Xinyu302 and others added 25 commits October 23, 2024 12:44
* Fix README

* Improve README

---------

Co-authored-by: Haicheng Wu <[email protected]>
* Include of regular_tile_iterator.h fixed for NVRTC

* More include fixed for NVRTC
…s/gemm/device/gemm_universal.h" (NVIDIA#1569)

fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
…_Traits support (NVIDIA#1856)

* fix wrong A/BLayout in  MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for  m8n8k128, m16n8k128  mma.and.popc in MMA_Traits instantiation

* add "print" template for  subbyte_reference<T>
…rs (NVIDIA#1931)

* move two warpgroup_wait

* merge main

---------

Co-authored-by: Siyuan Fu <[email protected]>
* Fix `cutlass` python library with cuda `12.6.2.post1`

Previously we had this error:
```
  File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
    _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
                       ^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```

* Update sm90_utils.py

* Update generator.py

* Update python/cutlass_library/generator.py

Co-authored-by: Jack Kosaian <[email protected]>

* Update python/cutlass_library/sm90_utils.py

Co-authored-by: Jack Kosaian <[email protected]>

---------

Co-authored-by: Jack Kosaian <[email protected]>
Copy link

vercel bot commented May 24, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cutlass ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 28, 2025 2:12am

@Carol25170 Carol25170 changed the base branch from master to yz/softmax-applyshape May 27, 2025 22:38
@Carol25170 Carol25170 changed the base branch from yz/softmax-applyshape to yz/softmax-misalignment May 27, 2025 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.