Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fused matrix vector multiplication #271

Closed
wants to merge 1 commit into from
Closed

Conversation

cgbur
Copy link
Contributor

@cgbur cgbur commented Aug 11, 2023

This PR introduces a fused version of matrix multiplication that can be used when the same x vector is reused across multiple different Ws and Outs. This improves the speed by better reusing the x vector when it is in the cache.

Speed for single threaded sees a nice 10% boost. Unfortunately multi-threaded performance saw a 6% decrease on the 42M model. This makes me hesitant to call this a clear win because of this. I have made the macro generate normal matmul calls when OpenMP is enabled so that fused is only used in single threaded mode.

This idea was inspired by the work of Foundation42 in #94 and based off the fused matrix multiplication found in llama2.zig I had done previously.

The use of macros in C is quite ugly, if it does not fit in this repo then let it be an example for others to learn from and improve upon. I have tried to keep the code changes to a minimum.

Make commands

.PHONY: runomp
runomp: run.c
	$(CC) -Ofast -fopenmp -march=native run.c  -lm  -o run

.PHONY: runfast
runfast: run.c
	$(CC) -Ofast -o run run.c -lm -march=native

Results

Taken by averaging the shown runs below. CPU is an AMD 5900x.

model command master token/s fused token/s % increase
stories15M runfast 508 563 10.8%
stories15M runomp 2243 2589 15.4%
stories42M runfast 184 207 12.5%
stories42M runomp 336 314 -6.5%

Runfast Fused

while true;  ./runfast_master stories15M.bin -t 0 | rg "tok" ; end
achieved tok/s: 504.566210
achieved tok/s: 508.045977
achieved tok/s: 505.720824
achieved tok/s: 511.574074
achieved tok/s: 512.761021
achieved tok/s: 508.045977
achieved tok/s: 516.355140
achieved tok/s: 510.392610
achieved tok/s: 515.151515
achieved tok/s: 502.272727
achieved tok/s: 510.392610
achieved tok/s: 498.871332
achieved tok/s: 512.761021
achieved tok/s: 501.133787
achieved tok/s: 509.216590
achieved tok/s: 502.272727

awk to get average:

echo "..." | awk '{ total += $3; count++ } END { print total/count }'

Runfast Master

while true; ./runfast_fused stories15M.bin -t 0 | rg "tok" ; end
achieved tok/s: 572.538860
achieved tok/s: 568.123393
achieved tok/s: 571.059432
achieved tok/s: 563.775510
achieved tok/s: 565.217391
achieved tok/s: 565.217391
achieved tok/s: 556.675063
achieved tok/s: 565.217391
achieved tok/s: 553.884712
achieved tok/s: 571.059432
achieved tok/s: 559.493671
achieved tok/s: 556.675063
achieved tok/s: 565.217391
achieved tok/s: 549.751244

Runomp Master

llama.c-fork master\*​​ 5s
❯ while true; ./runomp_master stories15M.bin -t 0 | rg "tok" ; end
achieved tok/s: 2600.000000
achieved tok/s: 2278.350515
achieved tok/s: 2351.063830
achieved tok/s: 1921.739130
achieved tok/s: 2483.146067
achieved tok/s: 2278.350515
achieved tok/s: 2351.063830
achieved tok/s: 2027.522936
achieved tok/s: 2255.102041
achieved tok/s: 2232.323232
achieved tok/s: 2232.323232
achieved tok/s: 2255.102041
achieved tok/s: 1811.475410
achieved tok/s: 2326.315789
achieved tok/s: 2210.000000
achieved tok/s: 2376.344086
achieved tok/s: 2145.631068

Runomp Fused

llama.c-fork master\*​​
❯ while true; ./runomp_fused stories15M.bin -t 0 | rg "tok" ; end
achieved tok/s: 2662.650602
achieved tok/s: 2630.952381
achieved tok/s: 2728.395062
achieved tok/s: 2511.363636
achieved tok/s: 2540.229885
achieved tok/s: 2797.468354
achieved tok/s: 2662.650602
achieved tok/s: 2695.121951
achieved tok/s: 2695.121951
achieved tok/s: 2662.650602
achieved tok/s: 2540.229885
achieved tok/s: 2695.121951
achieved tok/s: 2728.395062
achieved tok/s: 2728.395062
achieved tok/s: 2728.395062
achieved tok/s: 1613.138686
achieved tok/s: 2600.000000
achieved tok/s: 2402.173913
achieved tok/s: 2511.363636
achieved tok/s: 2662.650602
achieved tok/s: 2376.344086
achieved tok/s: 2695.121951
achieved tok/s: 2695.121951

Extra data on stories42M.bin

while true;  ./runfast_master stories42M.bin -t 0 | rg "tok" ; end
achieved tok/s: 185.701831
achieved tok/s: 183.620690
achieved tok/s: 184.735473
achieved tok/s: 185.378590
achieved tok/s: 183.147034
achieved tok/s: 182.989691
achieved tok/s: 183.462532
while true;  ./runfast_fused stories42M.bin -t 0 | rg "tok" ; end
achieved tok/s: 204.807692
achieved tok/s: 209.645669
achieved tok/s: 208.414873
achieved tok/s: 209.439528
achieved tok/s: 205.400193
achieved tok/s: 206.595538
achieved tok/s: 209.852217
achieved tok/s: 208.211144
achieved tok/s: 206.997085
while true;  ./runomp_master stories42M.bin -t 0 | rg "tok" ; end
achieved tok/s: 359.797297
achieved tok/s: 321.752266
achieved tok/s: 347.471452
achieved tok/s: 357.382550
achieved tok/s: 325.688073
achieved tok/s: 353.233831
achieved tok/s: 333.333333
achieved tok/s: 315.088757
achieved tok/s: 346.905537
achieved tok/s: 308.695652
achieved tok/s: 331.259720
while true;  ./runomp_fused stories42M.bin -t 0 | rg "tok" ; end
achieved tok/s: 328.197227
achieved tok/s: 335.433071
achieved tok/s: 309.143687
achieved tok/s: 309.593023
achieved tok/s: 333.333333
achieved tok/s: 309.593023
achieved tok/s: 286.675639
achieved tok/s: 324.200913
achieved tok/s: 299.157303
achieved tok/s: 294.198895
achieved tok/s: 314.623338
achieved tok/s: 334.379906

@karpathy
Copy link
Owner

Cute! I learned something. But yes this is ugly. Thanks for the PR though!

@karpathy karpathy closed this Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants