Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add avx2 intrinsics maybe #269

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

add avx2 intrinsics maybe #269

wants to merge 1 commit into from

Conversation

karpathy
Copy link
Owner

~minimal changes to maybe add AVX2 intrinsics (unaligned version). On my Linux box speeds things up ~27%, but this gap shrinks a lot when I omp.

@karpathy
Copy link
Owner Author

Doesn't seem to work on my Macbook sadly.

(base) karpathy@Andrejs-MacBook-Air llama2.c % make runavx2                                      
gcc -Ofast -march=native -mavx2 -DLLAMAC_AVX2 -o run run.c -lm
clang: error: the clang compiler does not support '-march=native'

@cgbur
Copy link
Contributor

cgbur commented Aug 10, 2023

On an AMD 5900x

 ❯ make runfast
gcc -Ofast -o run run.c -lm
llama2.c feature/avx2*​ 
 ❯ while true;  ./run stories15M.bin -t 0 | rg "tok/s"; end
achieved tok/s: 405.504587
achieved tok/s: 398.916968
achieved tok/s: 404.021938
achieved tok/s: 393.939394
achieved tok/s: 405.504587
achieved tok/s: 404.021938
achieved tok/s: 399.638336
achieved tok/s: 405.504587
achieved tok/s: 398.198198
achieved tok/s: 401.818182
^C⏎                                                                                                

llama2.c feature/avx2*​ 5s 
 ❯ make runavx2 
gcc -Ofast -march=native -mavx2 -DLLAMAC_AVX2 -o run run.c -lm

llama2.c feature/avx2*​ 
 ❯ while true;  ./run stories15M.bin -t 0 | rg "tok/s"; end
achieved tok/s: 533.816425
achieved tok/s: 518.779343
achieved tok/s: 521.226415
achieved tok/s: 508.045977
achieved tok/s: 513.953488
achieved tok/s: 531.250000
achieved tok/s: 522.458629
achieved tok/s: 524.940618
achieved tok/s: 524.940618
achieved tok/s: 527.446301
achieved tok/s: 522.458629

then modifying runfast to include -march-native

llama2.c feature/avx2*​ 4s 
 ❯ make runfast
gcc -Ofast -o run run.c -lm -march=native

llama2.c feature/avx2*​​ 
 ❯ while true;  ./run stories15M.bin -t 0 | rg "tok/s"; end
achieved tok/s: 535.108959
achieved tok/s: 524.940618
achieved tok/s: 532.530120
achieved tok/s: 529.976019
achieved tok/s: 528.708134
achieved tok/s: 532.530120
achieved tok/s: 526.190476
achieved tok/s: 527.446301
achieved tok/s: 523.696682
achieved tok/s: 531.250000

So I think the runfast with native compilation is able to pick up the optimization currently without the explicit instructions. I would probably make -march=native the default and add a make target for "portable" instead since you already need to specify the avx2 feature in the make file too.

@twobob
Copy link

twobob commented Aug 10, 2023

-mavx2 makes me have no output (no text is generated) fwiw when used with clang or x86_64-w64-mingw32-gcc.
on windows. I was, weirdly, testing this just today.
cl also chokes when passed it's equivalent argument for avx2 (as in no text is generated)
gcc is unaffected. seemed moot as outined above

@twobob
Copy link

twobob commented Aug 13, 2023

https://godbolt.org/z/M1vcaddqc pretty surprising how different the compilations are

did some more poking about but still did not pinpoint where clang, mingw and msvc bomb over the gcc build yet
so far only gcc on windows seems to swallow it. Perhaps others will have different experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants