forked from yzhaiustc/cutlass
-
Notifications
You must be signed in to change notification settings - Fork 0
doc and swap stuff #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Carol25170
wants to merge
323
commits into
yz/softmax-misalignment
Choose a base branch
from
v3.6.0-update
base: yz/softmax-misalignment
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…A#790) * Fix MHA kernel Summary: ATT Test Plan: Reviewers: Subscribers: Tasks: Tags: * Extend DualGemm to support batched mode (NVIDIA#5) Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set. * Decouple LayoutB0 and LayoutB1 in DualGemm The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously. In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above. * Remove comment as no longer relevant * Revert Fix MHA kernel --------- Co-authored-by: mikeiovine <[email protected]>
fix the copyright of a new file.
Co-authored-by: Haicheng Wu <[email protected]>
* fix typo * fix a deadlink to code
copyright banner
* Changes to iterators to support s8 gemm with f16 outputs * should work --------- Co-authored-by: Sujan Gonugondla <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* fMHA: Add support for bias+dropout in FW * Remove 'getMaximumSharedMemoryPerBlockKb' * fix comments --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Aniket Shivam <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* Hide streams and typinfo from nvrtc * Use __CUDACC_RTC__ instead CUDA_ARCH for guard
* expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <[email protected]>
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
msft moe paper
* add bytetransformer * update arxiv link * re-order
* add guards for sm>=70 * drop guard to 530
* [layout] Fix AffineRank2ColumnMajor::packed() * correct affine2row::packed --------- Co-authored-by: Haicheng Wu <[email protected]>
Co-authored-by: Haicheng Wu <[email protected]>
* Fix README * Improve README --------- Co-authored-by: Haicheng Wu <[email protected]>
* Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC
…s/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
…A#1894) Co-authored-by: Siyuan Fu <[email protected]>
…_Traits support (NVIDIA#1856) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T>
…rs (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <[email protected]>
* Fix `cutlass` python library with cuda `12.6.2.post1` Previously we had this error: ``` File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp> _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")] ^^^^^^ ValueError: invalid literal for int() with base 10: 'post1' ``` * Update sm90_utils.py * Update generator.py * Update python/cutlass_library/generator.py Co-authored-by: Jack Kosaian <[email protected]> * Update python/cutlass_library/sm90_utils.py Co-authored-by: Jack Kosaian <[email protected]> --------- Co-authored-by: Jack Kosaian <[email protected]>
* update * fix a typo
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.