forked from Dao-AILab/flash-attention
-
Notifications
You must be signed in to change notification settings - Fork 60
[CI] use docker run #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
micmelesse
wants to merge
30
commits into
main_perf
Choose a base branch
from
micmelesse/actions_docker
base: main_perf
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[CI] use docker run #145
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Enable Fwd and Backward Enable Fwd and Backward Enable fwd and varlen_fwd on AMD (#63) * flash_attn_func works Compress This is a combination of 12 commits. add scripts save add our kernel import our kernel round trip use bshd layout figure out segfault fix show backward failure with prints save backward work run forward only test smallest config on everything add test fix remove pre commit install triton skip dropout pin d 32 factor d just run power of 2 remove timeout run serially clean up clean up 2 * Varlen works This is a combination of 6 commits. save some tests passing enable more enable everything move around alibi works * keep interface and kernel seperate * clean up enable flash_attn_with_kvcache (#68) * Compress kvcache work This is a combination of 11 commits. kvcache work This is a combination of 4 commits. kvcache is not supported save save decode save clean up merge save cases save save save save key mask on triton side fix q size issue test combos save * fix causal. use cache_seqlens * clean and test what works * some configs work on new_kv but fails on 1,8 * cache overwrite correct * new_kv works more or less * test local * work on paged kv attention * prefill paged attention * fix has_batch_idx and skip local and rotatary emb * save * save * save * save * handle new_kv when paged kv cache * all except has_batch_idx works * major options are green * test all * add tests * save * clean up * minor clean up * simplest config * save debug true * save * refactor slightly * save work * need key masking * force hip * use is_hip * save * fix cache_seq_len issue * work on new_kv * pass new_kv data * save * benchmark fwd only * disable debug * pandas pdf * save * set methods * record number of heads * use configs * flexiable dim, n-heads, headofdim * better benchmarking * basic inplace update working * works upto 64 * new_kv supported! * test case for has_batch_idx * has_batch_idx works! * save * save * save * save ref * fix mqa and gqa by duplicating * GQA and MQA working by kernel modifications * fix new_kv with gqa * cache index * deal with nans on fwd_splitk * save * causal working on basic case * causal works! * alibi works! * clean up * clean prefill changes * remove bwd stuff * limit decode test to test_op_fwd * add ref * use bfloat Fixes after rebase Fixes after rebase rebase fixes deal with kvcache failure new run for branch cancel-in-progress fix varlen_fwd bug enable packed layouts and all configs (#72) Clean up for Upstream (#81) * Clean Clean This is a combination of 4 commits. clean 1 clean 2 clean more match main typo fix * use is_hip() * clean up more * skip odd d only * fix bug * skip randomly * use Flag * update readme * remove quantization * remove bwd * minor * print * remove verbose print * qunatize zero's out the d stride Enable Vanilla Bwd and Refactor (#86) * Vanilla BWD Vanilla BWD This is a combination of 79 commits. save test_flash_attn_output use impl functions pass layout add ref move arround impls fix stride issue save oai kernel add baseline impl save bwd kernel working remove old impl remove block_ptrs from bwd pass padded dmodel and apply masking. the old test cases work but cases with small d don't work save save more prints rename to M to L save add notes add old_bwd back fa failure fails in kernels too isolate new bwd and keep old bwd in place clean up softmax_lse doesnot match refernce LOG flag softmax_lse with LN2 move qk_scale to loop pass ln2 to fwd just print kernel input test softmax output from forward test exp_scores_triton save all the ref create ref USE_EXP2 path return scores mask scores when returning them. Basic impl test passes scores and output match show max_diff return score needs to be adjusted as we find new maxes all good outputs. old style RCP2 example prep bwd_impl test save try openai save fix softmax_lse bug test_op_bwd_impl starting to work! new kernel. exp2 works but exp is faliing fix bwd exp2 add m and n masks. small cases still don't work match old and new kernel prints compare old and new print inputs save old kernel match on dv dq works compare to pytorch including softmax in forward fix bwd impl bug small sizes in bwd impl work old bwd test pass. Moving on to kernel tests dq, dk and dv are filled in place if given. Need to match cast to match fa fix non bug fix dv mismatch. use_exp2 was set to true in fwd fix case up 128 refactor and clean up a bit more issue is that dq and dk are not zeros dq must be zeroed out ignore segfaults fa ref and my ref match! all tests run use tolerance 1e-3 we need to figure out preprocessing save clean up save test delta diff move old impl out new preprocess function preprocessing_use_o flag working _bwd_preprocess_use_p basic cases pass all green fwd exp2 usage is done right before exp * refactor * refactor 2 * refactor 3 * fix bug * try ci * add flag * rename to utils * skip test_op_fwd_decode_int4_kv * reduce head size * try again * go back to old head sizes * Use Strides Use Strides This is a combination of 11 commits. use strides in bwd add layout test in forward fix shape layout function smaller tests save fix varlen error no headsize passed to bwd deal with varlen layout save save save save * use gen scripts * varlen fwd passing * core fwd ref impl * fix minor bugs * wrap varlen- launcher attention_forward_pytorch_ref_impl * varlen backward ref added * add offsets for varlen * fix delta bug * varlen bwd working * save * runs on Mi200 * just test basics * save * fix bug * fix varlen in64 bug * add ref * test_impl working with causal * fix qkvpacked issue * qkvpacked run tests * remove test_backward * save * just test output * dump into tensors * softmaxlse layout for varlen * small cases working * bwd thd green. although maybe some oom * forward out and lse are good. Something wrong with backward ref * make varlen ref work * save work, ref is working mostly * 91 failed, 6542 passed, 6336 skipped, 1 warning * ref is all green * debug flag in utils * found bad softmax_lse in varlen fwd * fix bug in softmax lse. strides in varlen werenot right * add causal tests and 32*32 bwd doesnot have segfault * save * fix oom by reducing block size for small heads * bwd ref with causal working * test impl * causal test passes * causal working * fix tests * nicer bench * fix qvpacked error * fix varlen qvpacked bug * fix minor bug * bench prefill and prefill_old using the same script * autotune configs for fwd * autotune flag * clean up decode impl * clean up * clean up more * bench everything by default and return time * clean up readmes REBASE: fix interface changes in rebase rename test to test_flash_attn_triton_amd REBASE: fix unpad diffs minor clean up in setup FLASH_ATTENTION_TRITON_AMD flags bench fwd and bwd fix sequence_parallel
* sequence_parallel working on bwd_impl test * fix qkv error * save * save * save * bwd 3 times faster * clean up * fix varlen bug * use copy back dict * fix qkvpacked bug * reduce bench sizes * print copy back
* Autotune off by default * rework tests
* ignore ck code * update triton
* Update README.md * fix readme
* simple failing test * ref is working * fix bug * save * find failing case * fowrad varlen mqa/gqa works * add mqa configs to bwd test * varlen bwd ref fixed * save failing case * GQA flag * ones passes * go back to values * save * bhsd working with mqa * remove repo * test layouts * clean up * test back to normal * clean up more * use zeros_like * zero out
* feat: added rotary support in kvcache * confirmed non-fused rotary passes all tests
* Add RDNA CI This is a combination of 4 commits. try navi try matrix small change try minimal change * limit navi tests * stop casting to fp32 which leads to oom on navi * enable all causal * revert all causal * skip compiler bug on navi
* Alex's work This is a combination of 11 commits. save fix: dropout=0.0 woorks feat: dropout restrictions removed. failing tests test: reduced tests to simple cases test: failure is due to query + key padding mask NOT varlen itself feat: varlen dropout fwd passes fix: varlen bwd dropout works! test: discovered bwd error for non-dropout cases for large seqlen save save use triton commit 3ca2f498e98ed7249b82722587c511a5610e00c4 -- now batched layout passes * Almost Everything works. This is a combination of 16 commits. Work so far This is a combination of 63 commits. pick test case save philox offsets into metadata pass offset to ref common dropout mask simple droput out mask start dropout ref. work on returning SD_Mask next with negative numbers refernce is working dropout bwd ref faling case transfer rng_state properly save changes one dropout mask function save save minizmize diff save use torch.where in backward save save save dk works! passes reference is working. TODO" attn_ref is broken varlen ref working attn failing case with ones. attn_ref matches. fails with randn. we are seeing failure with large sizes from dv. save skip attn matrices compare the masks and find failing case rm cdiv_fn put dropout and alibi in common save compare masks save save pytorch ref is using tiles save save tl_rand_ref cache ref dropout mask new generate_dropout_mask_ref using tiling issolate failing varlen case simple dropout loop on k print rng_outputs save fwd kernel works save dv passed close to dk simple ref save seperate droped and scaled in ref and triton kernel ref changes working delta with dp find failing dv failures find failing case due to delta save delta from dp working bwd impl green enable test fwd save save delete kernels save probably mask application mismatch dump forward dropout pass dropout mask tensor to bwd_core different dropout fraction in fwd and bwd mismatch found on columns greater than 64 fix dropout bug. philox was not offset run full suite stop debug and approximate delta fix drop_mask non issue skip attn check clean up common bad varlen config fix varlen bug save * fix datatype mismatch * clean up * use pytorch dropout * It works on MI300. * remove _bwd_preprocess_use_p * fix torch interface bug --------- Co-authored-by: Alex Kranias <[email protected]>
* disable navi * start test * test fp16 against fp8 * save scaling code so far * global scaling * add per_head_scaling * dump qk * save dumping q, k and qk to fp32 tensor * fix pointer bug * save reproducer * dump p and acc * fp8 working with my debug input * save * change api for dequant * pass descale_p * clean up * most working * save * save * varlen half way * some varlen examples work * improve varlen debug input * varlen mostly working * push working cases * fix ref bug * fix backward bug * fix varlen backward bug * use descale to set fp8 * check arch fp8 support * cache arch * try again * skip bad config on MI200 * skip decode nan config on MI200 * fix mistake * skip more * run full suit * Update amd_tests.yml * address comments * navi ci is broken * raise error tolerance to 2.5e-1 * target MI300 directly * show gfx * try again * don't fail matrix if one path fails * try upstream triton * just get MI300 working * Fix install bug This is a combination of 5 commits. try this use --no-build-isolation put route at .python run full suite remove triton * run ref on cpu * move ref test to navi machines * pin triton * add bench deps
* Clean up This is a combination of 4 commits. update base image disable navi for now all causal seems to work on MI300 skip MI200 causal bugs * remove MI200 skips * just run on prs or manually * add navi back * try again * update readme * mark flakey test * ref bug
…kernels (#122) * added the split file * overhauled split file, need to add new kernels * copied triton fa over for reference * added comments * preprocess and dkdv done * fixed dkdv, added dq * fixed assumption on q, kv length different, run but incorrect * added standalone test for split bwd kernel * minor change on the ptr arith * separated the dkdv and dq kernels * GQA works now, onto seqlen q != k * dk,dq working, dv still failing * fixed the masking and num_step calc, now q==k works * added debug print with interpreter, might not work entirely w/o next commit * fixed all issues with q != k * fixed varlen issue * fixup on debug print * fixed dropout, esp w/ varlen * added USE_EXP2 toggle * added noncausal kernel * updated internal test for noncausal and use_exp2 * formatting * fixed dropout from seed bug * added envvar USE_SPLIT to toggle btw bwd kernels * fixed the qkv pack issue and removed hack * added the split kernel into interface_fa.py * change USE_SPLIT to USE_SINGLE_BWD_KERNEL to make split default * removed redundant file * fixed missing import in test * fixed import in interface_fa.py * revert changes in flash_attn_interface.py * updated strides to adapt to various tensor init shape * fixed issue that dqkv not zero'd * disabled the AMD local test
* fix fp8 bug * fix type bug * forgot nones * docker file
* reenable * randomly sample * clean up ci * add pytest-randomly * try again
* update triton commit * disable navi
CI on push to main_perf fix bugs and update ci
* Update README.md * update second readme
* fp8 BWD after figuring out varlen problem This is a combination of 21 commits. fp8 BWD Enable BWD fp8 with split kernel Enable BWD fp8 with per block scale factors for p and ds This is a combination of 9 commits. Enable BWD fp8 This is a combination of 12 commits. add backward test case save clean up disable ci lse is good dv matches reduce diff use do fp8 for dv kinda working group size is a constexpr clean up a bit everything except mqa/gqa works skip mqa cases 20 cases have nan on dropout save what you have disable tests failing enable tests per block descale_p and descale_ds use max(abs(()) clean up tests a bit more fix bug disable ci for now pass variables add flags add alternate path. Still need to load descale factors dv working dk works save add type info for backward fix DEBUG flag bug fix bug with backward. Normal forward works with dropout. Segfault with causal. Varlen has some issues. Might be related to strides. pass descale strides test causal fix causal compiler assert. min head should be 32 remove descale_p save explict name as causal isolate bad case just run fp8 tests bench with autotune min changes cast_fp8 helper cast_varlen_to_fp8 save minor highlight failing configs increase test cases mark failing recategorize misc tests group failing gqa configs add more tests add vis code min ci changes dump folder single image per tensors add tensor comparison gen varlen tensor vis varlen tensors varlen diff nice varlen vis vis function show seqlen in varlen add vis_tensors function simplify add color bars rm vis from test set canvas size. descale values are optional add ck tests add flag to build ck rm ck test assert requires grad ensure q, k, and v require gradients split vis rm interp, 8k and 300 dpi slice per page disable ci for now add more vis code tensor per image is better for vis_close, don't vis if no error. also vis all failing varlen tests varlen failures due to different seqlens rm vis code * rm require grad * decast fp8 for ref input, use fp16 as input fix minor things match readme decast fp8 for ref input, use fp16 as input * disable causal * fix bug * pass strides * DEBUG modes work only with interp * zero out varlen bwd grads * zero out everything * varlen dropout and causal works * add descale factors to other apis * save * unify tests * add packing flag * fix copy grad bug * add types, flags for zeroing tensors and accumlating fp32 This is a combination of 5 commits. extend ci time clean more minimize difference add types ZERO_TENSORS and ACCUMLATE_FP32 flags * just pass the output tensors * accumlate forwad in fp32 * fp8 in and fp8 out * return descale factors works for out * start fp8 return for bwd * return dq, dv, dk descale factors * save what you have * custom fp8 api function * add varlen function * test backward with varlen * test fp8 * kv cache fix * clean up interface * add packed api * fix qkv bug * disable bench * run big tests at the end * run in parrallel * Update utils.py * Update amd_tests.yml * add train script * use local configs for testing
* test and bench work compressed enable more tests match test add tests add more tests add nightly and do triton 3.2.0 add deps for benching min diff with og test reset changes rm readme changes reduce splitkv cases enable deterministic, kvpacked, swap_sq_sk & disable local, bfloat increase timeout 720 disable kvpacked skip flaky test be verbose skip config with 1 n_groups use grad strides rename maxseqlen and nonvarlen input helper bench mark api directly min diffs * mv test_op_prefill_bwd_split_impl * save test * test ir for sanity * test qkv ir * use input helper * kvpacked benching added * output do from the lower level functions * clean up packing input changing * clean up bwd * add qkv packed * add causal and dropout as a config * test all normal configs * add types * gen configs * improve configs * fix varlen bug * bench fp8 functions * combine benches * add varlen casting triton kernel * save varlen dataset * debug new cast * 2d casting kernel start & fix layout stride issue * basic cases passing in 2d kernel * all basic cases working * everything working * show correct mode for kvcache * train non varlen * update nightly tests * just latest torch * help text * skip new tests for now * add fns * match tests to main_perf * swap_sq_sk = False * limit to 8 workers * combine when bench fns are more than 1 * start on expanding casting kernel * bshd path for casting kernel * fix casting bshd bug * casting kernel working * Update interface_fa.py * clean up * run all bench marks * Update amd_tests.yml * remove -n 2 from fp8 tests * fix oom configs * remove all -n
* FP8 Bench work pass fp8 dtype gen fp8 values pass descale factors with inputs start work on fp8 output kernel output descale_o * fp8 seems slower * clean up newer benching code. fp8 is slower * output markdown and multiple types * bench all supported_dtypes for function by default * add dockerignore * need the .git for submodule update * ignore training data * get ready for ck * forward ck bench working * triton versus ck works * tuned triton perf comp * collect env flags * bench varlen and kvcache * function configs * show relative percentage diff * postive means triton faster negative means ck is faster * save * add new decode impl with switch flag * batch 1 and nheads 1 seems to work * autotune by default * simple stride calc in old impl * fixed bug due to strides are bhsd * rename the dim_k * clean up * old path works * rm block ptrs for q * rm block_ptrs for k * rm block_ptrs for v * rm block_ptrs from o * disable debug on bench * clean up * clean up names * compute offs_k properly * pass padded head to reduce kernel * fix o_mask bug * rm old impl * lambda grid * save final * ignore git stuff * add inference params to prefill * cache seqlens working * most cases work except newkv * fix minor bugs when runing fwd and bwd * check for backend * don't ignore .git * add modes * bench bwd * add llama configs * test fwd impl * run bwd_impl * move fp8 code * use Decode kernel for kvcache * fix fp8 import bug * fix bug * add arch in report * clean up test suite * fix fp8 typos * run ci * add fused kernel * add one kernel * update ci and readme * report ratios and remove split impl test expand bwd impl test * use split kernel * get one kernel working * use flag to switch bwd mode * clean up test_ir * one kernel has its own copy of the bwd kernels * autotune stub * pass og metaparams by default * add autotune configs * add tuning configs * update fused kernel code * use jingning * no auto tune for bwd * simpler varlen branching * fix constexpr bug * fix varlen fp8 * qkv fp8 working * fp8 qkv varlen green * fix bench functions * pick bench functions * bench defaults set * fix bug * add bench deps * bench env variations * per backend env configs * fix bug * add improved fused kernel * fix bug * final clean up
* test alibi * isolate failure * simpler test * clean up alibi * pass alibi to kernels * add stub code for actual alibi computation * add debug input * clean up ref. Use it to dev alibi first * add alibi in fwd ref * save * use compute_alibi_tensor_ref * normal fa works with alibi ref * alibi works on varlen ref * compare with ref * clean up ref prints * fix alibi none issue and use delta do o for ref * don't use alibi helper * alibi is green * run ci * fix test.py bug and update readme
* Fused with Good perf and stride fixed Fix fused bugs isolate failing case fix bug bring back test cases rm split impl in fused use exp2 is global variable now try oom fix save make fused the default limit to reproduce failure return default to split fix head size bug use exp2 back to true * new grid * BLK_SLICE_FACTOR = 1 * add tflops * new commit * test in parrallel * strides added by jusson * disable alibi * fix bugs again * default to fused * add bwd options for varlen * backend filter * default to jingning and batch 4 * best fwd config * fix TRITON_PRINT_AUTOTUNING flag bug * tune * Tuning fwd prefill * add if else * use flag * Minor mask fix * FLIP GRID * use best config for default * print when autotuning * test bfloat16 * fix k and v stride bugs * skip bfloat16 * test kvpacked * disable internal tests * pick default config based on arch * Add alibi in the new bwd kernel (#139) * enable alibi for jinging kernel enable alibi for jinging kernel match * save bad configs * fix alibi and causal bug * disable autotune by default * auto tune when benching is good * set best config * remove env var * Update amd_tests.yml * upgrad to triton==3.3.0 * increase shm * use 64 x 64 for now * save * handle 1d alibi * Add fp8 to fused kernel (#140) * fp8 stuff find test case compute delta fp8 basic fp8 config passing non causal path works * isolate bad case * fix fp8 bug * didnot fix fp8 bug * back to failing test * fp8 tests passing * skip * skip ref tests --------- Co-authored-by: Aliasger Zaidy <[email protected]>
* save * rm keys * fix keys * use GHA_RENDER_DEVICES * normal docker
0c7881d
to
805c04a
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use docker run in github actions ci