add avx2 intrinsics maybe #269

karpathy · 2023-08-10T15:04:00Z

~minimal changes to maybe add AVX2 intrinsics (unaligned version). On my Linux box speeds things up ~27%, but this gap shrinks a lot when I omp.

karpathy · 2023-08-10T15:07:27Z

Doesn't seem to work on my Macbook sadly.

(base) karpathy@Andrejs-MacBook-Air llama2.c % make runavx2                                      
gcc -Ofast -march=native -mavx2 -DLLAMAC_AVX2 -o run run.c -lm
clang: error: the clang compiler does not support '-march=native'

cgbur · 2023-08-10T18:07:34Z

On an AMD 5900x

 ❯ make runfast
gcc -Ofast -o run run.c -lm
llama2.c feature/avx2* 
 ❯ while true;  ./run stories15M.bin -t 0 | rg "tok/s"; end
achieved tok/s: 405.504587
achieved tok/s: 398.916968
achieved tok/s: 404.021938
achieved tok/s: 393.939394
achieved tok/s: 405.504587
achieved tok/s: 404.021938
achieved tok/s: 399.638336
achieved tok/s: 405.504587
achieved tok/s: 398.198198
achieved tok/s: 401.818182
^C⏎                                                                                                

llama2.c feature/avx2* 5s 
 ❯ make runavx2 
gcc -Ofast -march=native -mavx2 -DLLAMAC_AVX2 -o run run.c -lm

llama2.c feature/avx2* 
 ❯ while true;  ./run stories15M.bin -t 0 | rg "tok/s"; end
achieved tok/s: 533.816425
achieved tok/s: 518.779343
achieved tok/s: 521.226415
achieved tok/s: 508.045977
achieved tok/s: 513.953488
achieved tok/s: 531.250000
achieved tok/s: 522.458629
achieved tok/s: 524.940618
achieved tok/s: 524.940618
achieved tok/s: 527.446301
achieved tok/s: 522.458629

then modifying runfast to include -march-native

llama2.c feature/avx2* 4s 
 ❯ make runfast
gcc -Ofast -o run run.c -lm -march=native

llama2.c feature/avx2* 
 ❯ while true;  ./run stories15M.bin -t 0 | rg "tok/s"; end
achieved tok/s: 535.108959
achieved tok/s: 524.940618
achieved tok/s: 532.530120
achieved tok/s: 529.976019
achieved tok/s: 528.708134
achieved tok/s: 532.530120
achieved tok/s: 526.190476
achieved tok/s: 527.446301
achieved tok/s: 523.696682
achieved tok/s: 531.250000

So I think the runfast with native compilation is able to pick up the optimization currently without the explicit instructions. I would probably make -march=native the default and add a make target for "portable" instead since you already need to specify the avx2 feature in the make file too.

twobob · 2023-08-10T20:17:45Z

-mavx2 makes me have no output (no text is generated) fwiw when used with clang or x86_64-w64-mingw32-gcc.
on windows. I was, weirdly, testing this just today.
cl also chokes when passed it's equivalent argument for avx2 (as in no text is generated)
gcc is unaffected. seemed moot as outined above

twobob · 2023-08-13T02:39:24Z

https://godbolt.org/z/M1vcaddqc pretty surprising how different the compilations are

did some more poking about but still did not pinpoint where clang, mingw and msvc bomb over the gcc build yet
so far only gcc on windows seems to swallow it. Perhaps others will have different experience.

add avx2 intrinsics maybe

d0309ab

atamurad mentioned this pull request Aug 19, 2023

Quantization Brainstorming #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add avx2 intrinsics maybe #269

add avx2 intrinsics maybe #269

karpathy commented Aug 10, 2023

karpathy commented Aug 10, 2023

cgbur commented Aug 10, 2023

twobob commented Aug 10, 2023

twobob commented Aug 13, 2023

add avx2 intrinsics maybe #269

Are you sure you want to change the base?

add avx2 intrinsics maybe #269

Conversation

karpathy commented Aug 10, 2023

karpathy commented Aug 10, 2023

cgbur commented Aug 10, 2023

twobob commented Aug 10, 2023

twobob commented Aug 13, 2023