run.c - Speed up with output buffering #193

trholding · 2023-07-31T11:36:35Z

Previously much of time was spent writing to screen which is relatively slow.

By enabling output buffering more work can be performed by writing groups of computed tokens to the buffer which is relatively fast, and then flushing the buffer periodically to screen/console.

Testing with the smallest model, a interactive tokens/s speed up of ~14% on standard builds to ~84% on open-mp builds has been achieved.

Usage:

run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens]

Where buffer_tokens is the number of tokens to be buffered.

Multiples of 2 seem to be ideal. 64 worked well for my use case on a low end machine.

The speed up may depend on model size and OS.

Example:

./run model.bin 0 0 "A car" 64

Previously much of time was spent writing to screen which is relatively slow. By enabling output buffering more work can be performed by writing groups of computed tokens to the buffer which is relatively fast, and then flushing the buffer periodically to screen/console. Testing with the smallest model, a interactive tokens/s speed up of ~14% on standard builds to ~84% on open-mp builds has been achieved. Usage: run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens] Where buffer_tokens is the number of tokens to be buffered. Multiples of 2 seem to be ideal. 64 worked well for my use case on a low end machine. The speed up may depend on model size and OS. Example: ./run model.bin 0 0 "A car" 64

trholding · 2023-07-31T11:45:00Z

Output:

Speed comparison without token buffering and with buffering via buffer_tokens sizes 16, 32, 64

[linux@linux llama2.c]$ make runomp
gcc -Ofast -fopenmp -march=native run.c  -lm  -o run
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 90.297450

real    0m2.844s
user    0m4.414s
sys     0m0.075s
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 16
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 128.269618

real    0m2.058s
user    0m3.666s
sys     0m0.050s
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 32
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 138.511678

real    0m1.862s
user    0m3.498s
sys     0m0.033s
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 64
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 139.802632

real    0m1.846s
user    0m3.495s
sys     0m0.029s

twobob · 2023-07-31T11:52:28Z

yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1 can confirm that output is obviously a bottleneck on some lumbering system. It felt like the best solution might be waiting until a space or punctuation character or BOS before emitting the next token.

trholding · 2023-07-31T11:58:00Z

yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1 can confirm that output is obviously a bottleneck on some lumbering system. It felt like the best solution might be waiting until a space or punctuation character or BOS before emitting the next token.

I can confirm that only dump is fast when redirecting output to file.

Waiting for punctuation could make it variable speed, speed up may vary wildly across outputs. I tested a similar approach with line buffering, as the length of the lines varied, so did the speed up too.

trholding · 2023-07-31T12:01:32Z

I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1

I think that the file output is a good idea, maybe filename as last parameter and it writes to file.

twobob · 2023-07-31T12:02:04Z

yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1 can confirm that output is obviously a bottleneck on some lumbering system. It felt like the best solution might be waiting until a space or punctuation character or BOS before emitting the next token.

I can confirm that only dump is fast when redirecting output to file.

Waiting for punctuation could make it variable speed, speed up may vary wildly across outputs. I tested a similar approach with line buffering, as the length of the lines varied, so did the speed up too.

yeah that was my finding. Obviously as the token length processing debt increases so does the time to output so my naiive metric was likely incorrectly conflating two problems and should really be calculated with the relevant non-linear function in mind. That aside, anything that speeds up the output gets a thumbs up from me.

twobob · 2023-07-31T14:18:29Z

I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1

I think that the file output is a good idea, maybe filename as last parameter and it writes to file.

I was thinking a reversible filesafe name automatically generated from the prompt mixed with a prefixed timestamp might be nice.

Original: inbox/20230731142853_?This could be a prompt.txt
Encoded: aW5ib3gvMjAyMzA3MzExNDI4NTNfP19UaGlzIGNvdWxkIGJlIGEgcHJvbXB0LnR4dA==
Decoded: inbox/20230731142853?_This could be a prompt.txt
sort of thing or whatever. anyway, sorry for the OT noise.

twobob · 2023-08-01T18:44:08Z

built with clang -Ofast -fopenmp -D_WIN32 -o run.exe -I. run.c win.c

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 2
achieved tok/s: 67.6 for first BOS

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 4
achieved tok/s: 67.6 for first BOS

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 8
achieved tok/s: 67.0 for first BOS

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 1
achieved tok/s: 66.9 for first BOS

with the latest updates as of 01/08/2023, perhaps this has become more moot.

Increased output buffer from 2048 to 4096 This fixes output buffer overflow and garbled output when output is larger than buffer in rare cases such as when blocks of token sequences are repeated during inference. ./run stories110M.bin 0 0 "A big dog" 256

trholding · 2023-08-02T17:37:13Z

run.exe ../out/model110M.bin 0 0 "and away they went" False 1

What is the "False" parameter for?

Okay, I tested on server (Aarch 64 / vps 4 core)

./run storiesXXXM.bin 0 0 "a car" 256

On 110M model, this does not improve speed. achieved tok/s: 75.459172 vs 74.443312 enabled 256
On 42M, negligible speedup : achieved tok/s: 184.957512 vs 189.620019 enabled 256
On 15M, some speed up: achieved tok/s: 615.942029 vs 648.854962 enabled 256

On the lowest end device (x86_64 / Dual Core):

./run /tmp/out/model.bin 0 0 "a car" 256

On 15M, nearly 2 x speed, achieved tok/s: 76.119403 vs 140.109890 enabled 256

run.c

Making the outbuff static, auto initializes it to zero, thus the memset can be avoided. Ref: karpathy#193 (comment)

ozabluda · 2023-08-02T22:20:22Z

run.c

@@ -570,7 +577,7 @@ int main(int argc, char *argv[]) {
        // following BOS token (1), sentencepiece decoder strips any leading whitespace (see PR #89)
        char *token_str = (token == 1 && vocab[next][0] == ' ') ? vocab[next]+1 : vocab[next];
        printf("%s", token_str);
-        fflush(stdout);
+        if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens


outbuf is not necessarily zero terminated

flush at the end

Suggested change

if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens

if (bufferflush==pos || pos == steps) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens

Logic:

On certain systems, writing to console / screen is relatively and variably slow compared to the computation.

Currently each token is written to screen and flushed so inherits 1

We speed up by batching the number of tokens to be flushed to screen.

We enable the number of tokens to be batched as a argument for flexibility

We need to size a appropriate buffer, so that we do not encounter a buffer overflow

We do not know beforehand the batch size, the size of each token or the context size

We estimate a safe buffer size like this: context size * (average token length + safe margin)

Context size is determined from the model config.seq_len parameter.

BPE subword units usually are 2-6 chars. Better to err, I used 6 as avg token length and 2 for a safe margin.

That explicitly sized buffer and its calculated size is passed as parameters to setvbuf to buffer the output. setvbuf will round it up to the next power of 2.

We are not requesting setvbuf to automatically create a buffer for us as with the NULL parameter nor are we passing BUFSIZ as our purpose is a buffer that can hold context size * (average token length + safe margin) for the maximum case such as batch size = context size.

Once that is set, each time the buffer is filled with a batch of the required number of tokens, it is flushed.

Your suggestion were helpful for latest commit. These are my findings:

if (bufferflush==pos && strlen(outbuff)<=4096) , here the && strlen(outbuff)<=4096 check is not required and wrong. Also reading a buffer managed by setvbuf is undefined behaviour.

if (bufferflush==pos || pos == steps), here || pos == steps would not be needed as the buffer can already hold max steps (config.seq_len) now.

You were right about a bug.

The bug was a hard coded buffer size that did not take into account context size * (average token length + safe margin). Even the increase to 4096 would have caused the overflow on Meta Llama2 7b.

Thank you.

Okay this doesn't build with MSVC. Need to figure an alternative.

"The default buffer size BUFSIZ is expected to be the most efficient buffer size for file I/O on the implementation", you need at least some benchmarks to second-guess it. For purity, safe buffer size would be context size * (max token length). There is no guarantee that it's smaller than 4096, which is an unnamed constant you use multiple times. But it's not clear that such purity is needed. In fact, I am not sure even sure that flushing after max tokens vs max bytes (which would have a much simpler implementation) is a right approach.

My proposed || pos == steps is there to flush at the end of the loop, although I just realized it's not needed, sorry about that, because of:

llama2.c/run.c

Line 584 in af8708d

printf("\nachieved tok/s: %f\n", (steps-1) / (double)(end-start)*1000);

If you have benchmarks on BUFSIZ please share it here. It would be of great help! Also kindly test with BUFSIZE and steps 0 and any buffer_tokens parameters such as 4, 8, 16, 32, 64 ... .

Usage:

run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens]

Where buffer_tokens is the number of tokens to be buffered.

Fixed a buffer overflow bug by changing hardcoded size of outbuff to one computed at runtime.

MSVC does not support VLA. outbuff size is now a hard coded value: Context size of Meta Llama 2 * (avg token length + margin) 4096 * ( 6 + 2 ) meh

karpathy · 2023-08-05T19:22:29Z

Surprising to me that printing is slow :(
Can we not do something much simpler and just have some buffer that we write to, and then print it all at once using the existing code?

There would be no need for complexity like this

 if (setvbuf(stdout, outbuff, _IOFBF, sizeof(outbuff)) != 0) {
     puts("Error: Buffer allocation!"); exit(EXIT_FAILURE);
     }

trholding · 2023-08-05T19:39:59Z

Surprising to me that printing is slow :( Can we not do something much simpler and just have some buffer that we write to, and then print it all at once using the existing code?

There would be no need for complexity like this
 if (setvbuf(stdout, outbuff, _IOFBF, sizeof(outbuff)) != 0) {
     puts("Error: Buffer allocation!"); exit(EXIT_FAILURE);
     }

In practice this check is not required as buffer allocation never failed in testing...

setvbuf(stdout, outbuff, _IOFBF, sizeof(outbuff)) worked fine.

I am thinking. I'll experiment, check and see if there is another way, with less lines of code.

karpathy · 2023-08-05T20:11:25Z

The complexity is from: what is setvbuf, and what is _IOFBF? Are these on all platforms? Any external calls or constants are a risk.

trholding · 2023-08-05T20:30:37Z

setvbuf is part of C standard since long. Available on all platforms, hence portable.

setvbuf is used to allocate a output buffer of required size.
printf and other functions then automatically use this buffer.
one can flush as needed to print to screen.

_IOFBF means Full IO Buffering

Nothing is at risk as bugs have been fixed. But the buffer has to be a static char array.

Ref: https://www.gnu.org/software/libc/manual/html_node/Controlling-Buffering.html
Ref: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setvbuf?view=msvc-170

C17 standard (ISO/IEC 9899:2018): 7.21.5.6 The setvbuf function (p: 225)
C11 standard (ISO/IEC 9899:2011): 7.21.5.6 The setvbuf function (p: 308)
C99 standard (ISO/IEC 9899:1999): 7.19.5.6 The setvbuf function (p: 273-274)
C89/C90 standard (ISO/IEC 9899:1990): 4.9.5.6 The setvbuf function

twobob · 2023-08-06T01:56:46Z

run.exe ../out/model110M.bin 0 0 "and away they went" False 1

What is the "False" parameter for?

Okay, I tested on server (Aarch 64 / vps 4 core)

./run storiesXXXM.bin 0 0 "a car" 256

On 110M model, this does not improve speed. achieved tok/s: 75.459172 vs 74.443312 enabled 256 On 42M, negligible speedup : achieved tok/s: 184.957512 vs 189.620019 enabled 256 On 15M, some speed up: achieved tok/s: 615.942029 vs 648.854962 enabled 256

On the lowest end device (x86_64 / Dual Core):

./run /tmp/out/model.bin 0 0 "a car" 256

On 15M, nearly 2 x speed, achieved tok/s: 76.119403 vs 140.109890 enabled 256

I have the option to save files on my builds...

./run
Usage: run [options]
Example: run model.bin -t 0.9 -n 256 -p "Once upon a time"
Options:
-t temperature, default 0.9
-s random seed, default time(NULL)
-n number of steps to run for, default 256. 0 = max_seq_len
-p prompt string, default none
-o top_p, default 1.0
-k print_tokens, 1 or 0 flag default 1
-f saveFileBool, 1 or 0 flag default 1
-l saveLogBool, 1 or 0 flag saves timestamps for tokens gen default 0
-b singleBOS, 1 or 0 flag default 1
-d dirname string, default none

the pertoken logs let me make helpful graphs

15M does seem to have a very slight curve to it that the other do not, where it "ramps back up" not sure if that leans into your figures above. inconclusive from the limited data I've run so far.

Okay I'll shove this back in the mix now its been updated and run some more tests tomorrow.
thanks

Inspired by karpathy#193, see that for details.

twobob · 2023-08-09T23:19:56Z

I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1

I think that the file output is a good idea, maybe filename as last parameter and it writes to file.

nothing fancy buts passes my "Assert that the decoded string matches the first 75 characters of the original prompt" and does the job I needed. Might tidy it up? but something like that.
https://github.com/twobob/llama2.c/blob/master/base64_utils.c#L70
end up with pseudo base64 encoding enough for a recoverable 75 char front index should that ever matter,

gonna go with fast and dirty until something official rocks up.

karpathy · 2023-08-10T04:59:25Z

Could we not just sprintf into a buffer and when the buffer grows beyond a threshold printf it in chunks? Isn't that much simpler?

twobob · 2023-08-13T01:51:21Z

Could we not just sprintf into a buffer and when the buffer grows beyond a threshold printf it in chunks? Isn't that much simpler?

all for that. it's what I do. 2 - 16 works fine. fwiw I also do the same when logging timestamps for the same reasons.

roughly - this - or similar, not horrible code

log_buffer[log_buffer_count] = timestamp;
log_buffer_count++;
if (log_buffer_count == LOG_BUFFER_SIZE) {
    for (size_t i = 0; i < LOG_BUFFER_SIZE; i++) {
        fprintf(timelog_file, "%ld,", log_buffer[i]); }
  fflush(timelog_file);
    log_buffer_count = 0;  }

These more fine-grained efforts obviously have merit however that would seem good enough (give or take the missing code)

karpathy · 2023-08-14T03:24:28Z

I'll punt on this PR. I think this is only an issue with the tiniest models, and for those we're going so fast that this doesn't even matter. I think.

trholding changed the title ~~run.c - Output buffering~~ run.c - Speed up with output buffering Jul 31, 2023

Merge branch 'karpathy:master' into upstreaming

0ba0a07

trholding added 2 commits August 2, 2023 22:20

Merge branch 'karpathy:master' into upstreaming

130e080

ozabluda reviewed Aug 2, 2023

View reviewed changes

run.c Outdated Show resolved Hide resolved

trholding added 2 commits August 3, 2023 02:01

Merge branch 'karpathy:master' into upstreaming

63e0bf0

run.c - remove memset

24dddcc

Making the outbuff static, auto initializes it to zero, thus the memset can be avoided. Ref: karpathy#193 (comment)

ozabluda reviewed Aug 2, 2023

View reviewed changes

trholding added 2 commits August 3, 2023 22:08

run.c - fixed output buffering

e9bf383

Fixed a buffer overflow bug by changing hardcoded size of outbuff to one computed at runtime.

run.c - fix windows builds

4681774

MSVC does not support VLA. outbuff size is now a hard coded value: Context size of Meta Llama 2 * (avg token length + margin) 4096 * ( 6 + 2 ) meh

trholding added 2 commits August 6, 2023 02:02

Update run.c - changed outbuff char array to static

4cf79e8

Merge remote-tracking branch 'upstream/master' into upstreaming

8893531

Merge remote-tracking branch 'upstream/master' into upstreaming

664db21

ozabluda added a commit to ozabluda/llama2.c that referenced this pull request Aug 6, 2023

Speed up streaming output bu buffering till newline

f714320

Inspired by karpathy#193, see that for details.

This was referenced Aug 6, 2023

Speed up streaming output by buffering till newline ozabluda/llama2.c#1

Closed

Speed up streaming output by buffering till newline #247

Closed

karpathy closed this Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run.c - Speed up with output buffering #193

run.c - Speed up with output buffering #193

trholding commented Jul 31, 2023 •

edited

Loading

trholding commented Jul 31, 2023

twobob commented Jul 31, 2023 •

edited

Loading

trholding commented Jul 31, 2023

trholding commented Jul 31, 2023

twobob commented Jul 31, 2023

twobob commented Jul 31, 2023

twobob commented Aug 1, 2023 •

edited

Loading

trholding commented Aug 2, 2023

ozabluda Aug 2, 2023

trholding Aug 3, 2023

trholding Aug 3, 2023

ozabluda Aug 3, 2023

trholding Aug 3, 2023

karpathy commented Aug 5, 2023

trholding commented Aug 5, 2023

karpathy commented Aug 5, 2023

trholding commented Aug 5, 2023

twobob commented Aug 6, 2023 •

edited

Loading

twobob commented Aug 9, 2023

karpathy commented Aug 10, 2023

twobob commented Aug 13, 2023 •

edited

Loading

karpathy commented Aug 14, 2023

	if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens
	if (bufferflush==pos \|\| pos == steps) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens

run.c - Speed up with output buffering #193

run.c - Speed up with output buffering #193

Conversation

trholding commented Jul 31, 2023 • edited Loading

trholding commented Jul 31, 2023

twobob commented Jul 31, 2023 • edited Loading

trholding commented Jul 31, 2023

trholding commented Jul 31, 2023

twobob commented Jul 31, 2023

twobob commented Jul 31, 2023

twobob commented Aug 1, 2023 • edited Loading

trholding commented Aug 2, 2023

ozabluda Aug 2, 2023

Choose a reason for hiding this comment

trholding Aug 3, 2023

Choose a reason for hiding this comment

trholding Aug 3, 2023

Choose a reason for hiding this comment

ozabluda Aug 3, 2023

Choose a reason for hiding this comment

trholding Aug 3, 2023

Choose a reason for hiding this comment

karpathy commented Aug 5, 2023

trholding commented Aug 5, 2023

karpathy commented Aug 5, 2023

trholding commented Aug 5, 2023

twobob commented Aug 6, 2023 • edited Loading

twobob commented Aug 9, 2023

karpathy commented Aug 10, 2023

twobob commented Aug 13, 2023 • edited Loading

karpathy commented Aug 14, 2023

trholding commented Jul 31, 2023 •

edited

Loading

twobob commented Jul 31, 2023 •

edited

Loading

twobob commented Aug 1, 2023 •

edited

Loading

twobob commented Aug 6, 2023 •

edited

Loading

twobob commented Aug 13, 2023 •

edited

Loading