Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run.c - Speed up with output buffering #193

Closed
wants to merge 11 commits into from

Conversation

trholding
Copy link
Contributor

@trholding trholding commented Jul 31, 2023

Previously much of time was spent writing to screen which is relatively slow.

By enabling output buffering more work can be performed by writing groups of computed tokens to the buffer which is relatively fast, and then flushing the buffer periodically to screen/console.

Testing with the smallest model, a interactive tokens/s speed up of ~14% on standard builds to ~84% on open-mp builds has been achieved.

Usage:

run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens]

Where buffer_tokens is the number of tokens to be buffered.

Multiples of 2 seem to be ideal. 64 worked well for my use case on a low end machine.

The speed up may depend on model size and OS.

Example:

./run model.bin 0 0 "A car" 64

Previously much of time was spent writing to screen which is relatively slow.

By enabling output buffering more work can be performed by writing groups of computed tokens to the buffer which is relatively fast, and then flushing the buffer periodically to screen/console.

Testing with the smallest model, a interactive tokens/s speed up of ~14% on standard builds to ~84% on open-mp builds has been achieved.

Usage:

run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens]

Where buffer_tokens is the number of tokens to be buffered.

Multiples of 2 seem to be ideal. 64 worked well for my use case on a low end machine.

The speed up may depend on model size and OS.

Example:

./run model.bin 0 0 "A car" 64
@trholding
Copy link
Contributor Author

Output:

run_token

Speed comparison without token buffering and with buffering via buffer_tokens sizes 16, 32, 64

[linux@linux llama2.c]$ make runomp
gcc -Ofast -fopenmp -march=native run.c  -lm  -o run
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 90.297450

real    0m2.844s
user    0m4.414s
sys     0m0.075s
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 16
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 128.269618

real    0m2.058s
user    0m3.666s
sys     0m0.050s
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 32
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 138.511678

real    0m1.862s
user    0m3.498s
sys     0m0.033s
[linux@linux llama2.c]$ time ./run /tmp/out/model.bin 0 0 "A car" 64
<s>
A car was driving down the road. He was carrying a big box. Suddenly, he heard a loud noise. He stopped and looked around. He saw a big truck coming towards him. He was scared.
The truck stopped and the driver got out. He said, "What are you doing here?"
The driver said, "I'm here to deliver this box. I'm going to take it to the store."
The driver was very scared. He said, "I'm sorry. I didn't know it was yours."
The driver said, "It's okay. I'm just going to deliver it to the store."
The driver drove away. The truck drove slowly and carefully. The driver was very careful. He drove slowly and carefully.
When the truck arrived, the driver opened the box. Inside was a big surprise! It was a big cake!
The driver was so happy. He said, "Thank you for delivering this cake. I'm so glad I could help."
The driver smiled and said, "You're welcome. I'm glad I could help."
<s>
Once upon a time, there was a little girl named
achieved tok/s: 139.802632

real    0m1.846s
user    0m3.495s
sys     0m0.029s

@trholding trholding changed the title run.c - Output buffering run.c - Speed up with output buffering Jul 31, 2023
@twobob
Copy link

twobob commented Jul 31, 2023

yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1 can confirm that output is obviously a bottleneck on some lumbering system. It felt like the best solution might be waiting until a space or punctuation character or BOS before emitting the next token.

@trholding
Copy link
Contributor Author

yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1 can confirm that output is obviously a bottleneck on some lumbering system. It felt like the best solution might be waiting until a space or punctuation character or BOS before emitting the next token.

I can confirm that only dump is fast when redirecting output to file.

Waiting for punctuation could make it variable speed, speed up may vary wildly across outputs. I tested a similar approach with line buffering, as the length of the lines varied, so did the speed up too.

@trholding
Copy link
Contributor Author

I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1

I think that the file output is a good idea, maybe filename as last parameter and it writes to file.

@twobob
Copy link

twobob commented Jul 31, 2023

yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1 can confirm that output is obviously a bottleneck on some lumbering system. It felt like the best solution might be waiting until a space or punctuation character or BOS before emitting the next token.

I can confirm that only dump is fast when redirecting output to file.

Waiting for punctuation could make it variable speed, speed up may vary wildly across outputs. I tested a similar approach with line buffering, as the length of the lines varied, so did the speed up too.

yeah that was my finding. Obviously as the token length processing debt increases so does the time to output so my naiive metric was likely incorrectly conflating two problems and should really be calculated with the relevant non-linear function in mind. That aside, anything that speeds up the output gets a thumbs up from me.

@twobob
Copy link

twobob commented Jul 31, 2023

I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1

I think that the file output is a good idea, maybe filename as last parameter and it writes to file.

I was thinking a reversible filesafe name automatically generated from the prompt mixed with a prefixed timestamp might be nice.

Original: inbox/20230731142853_?This could be a prompt.txt
Encoded: aW5ib3gvMjAyMzA3MzExNDI4NTNfP19UaGlzIGNvdWxkIGJlIGEgcHJvbXB0LnR4dA==
Decoded: inbox/20230731142853
?_This could be a prompt.txt
sort of thing or whatever. anyway, sorry for the OT noise.

@twobob
Copy link

twobob commented Aug 1, 2023

built with clang -Ofast -fopenmp -D_WIN32 -o run.exe -I. run.c win.c

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 2
achieved tok/s: 67.6 for first BOS

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 4
achieved tok/s: 67.6 for first BOS

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 8
achieved tok/s: 67.0 for first BOS

set OMP_NUM_THREADS=4 && run.exe ../out/model110M.bin 0 0 "and away they went" False 1
achieved tok/s: 66.9 for first BOS

with the latest updates as of 01/08/2023, perhaps this has become more moot.

Increased output buffer from 2048 to 4096

This fixes output buffer overflow and garbled output when output is larger than buffer in rare cases such as when blocks of token sequences are repeated during inference.

./run stories110M.bin 0  0  "A big dog" 256
@trholding
Copy link
Contributor Author

run.exe ../out/model110M.bin 0 0 "and away they went" False 1

What is the "False" parameter for?

Okay, I tested on server (Aarch 64 / vps 4 core)

./run storiesXXXM.bin 0 0 "a car" 256

On 110M model, this does not improve speed. achieved tok/s: 75.459172 vs 74.443312 enabled 256
On 42M, negligible speedup : achieved tok/s: 184.957512 vs 189.620019 enabled 256
On 15M, some speed up: achieved tok/s: 615.942029 vs 648.854962 enabled 256

On the lowest end device (x86_64 / Dual Core):

./run /tmp/out/model.bin 0 0 "a car" 256

On 15M, nearly 2 x speed, achieved tok/s: 76.119403 vs 140.109890 enabled 256

run.c Outdated Show resolved Hide resolved
Making the outbuff static, auto initializes it to zero, thus the memset can be avoided.

Ref: karpathy#193 (comment)
run.c Outdated
@@ -570,7 +577,7 @@ int main(int argc, char *argv[]) {
// following BOS token (1), sentencepiece decoder strips any leading whitespace (see PR #89)
char *token_str = (token == 1 && vocab[next][0] == ' ') ? vocab[next]+1 : vocab[next];
printf("%s", token_str);
fflush(stdout);
if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. outbuf is not necessarily zero terminated
  2. flush at the end
Suggested change
if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens
if (bufferflush==pos || pos == steps) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic:

  1. On certain systems, writing to console / screen is relatively and variably slow compared to the computation.
  2. Currently each token is written to screen and flushed so inherits 1
  3. We speed up by batching the number of tokens to be flushed to screen.
  4. We enable the number of tokens to be batched as a argument for flexibility
  5. We need to size a appropriate buffer, so that we do not encounter a buffer overflow
  6. We do not know beforehand the batch size, the size of each token or the context size
  7. We estimate a safe buffer size like this: context size * (average token length + safe margin)
  8. Context size is determined from the model config.seq_len parameter.
  9. BPE subword units usually are 2-6 chars. Better to err, I used 6 as avg token length and 2 for a safe margin.
  10. That explicitly sized buffer and its calculated size is passed as parameters to setvbuf to buffer the output. setvbuf will round it up to the next power of 2.
  11. We are not requesting setvbuf to automatically create a buffer for us as with the NULL parameter nor are we passing BUFSIZ as our purpose is a buffer that can hold context size * (average token length + safe margin) for the maximum case such as batch size = context size.
  12. Once that is set, each time the buffer is filled with a batch of the required number of tokens, it is flushed.

Your suggestion were helpful for latest commit. These are my findings:

if (bufferflush==pos && strlen(outbuff)<=4096) , here the && strlen(outbuff)<=4096 check is not required and wrong. Also reading a buffer managed by setvbuf is undefined behaviour.

if (bufferflush==pos || pos == steps), here || pos == steps would not be needed as the buffer can already hold max steps (config.seq_len) now.

You were right about a bug.

The bug was a hard coded buffer size that did not take into account context size * (average token length + safe margin). Even the increase to 4096 would have caused the overflow on Meta Llama2 7b.

Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this doesn't build with MSVC. Need to figure an alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The default buffer size BUFSIZ is expected to be the most efficient buffer size for file I/O on the implementation", you need at least some benchmarks to second-guess it. For purity, safe buffer size would be context size * (max token length). There is no guarantee that it's smaller than 4096, which is an unnamed constant you use multiple times. But it's not clear that such purity is needed. In fact, I am not sure even sure that flushing after max tokens vs max bytes (which would have a much simpler implementation) is a right approach.

My proposed || pos == steps is there to flush at the end of the loop, although I just realized it's not needed, sorry about that, because of:

llama2.c/run.c

Line 584 in af8708d

printf("\nachieved tok/s: %f\n", (steps-1) / (double)(end-start)*1000);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have benchmarks on BUFSIZ please share it here. It would be of great help! Also kindly test with BUFSIZE and steps 0 and any buffer_tokens parameters such as 4, 8, 16, 32, 64 ... .

Usage:

run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens]

Where buffer_tokens is the number of tokens to be buffered.

Fixed a buffer overflow bug by changing hardcoded size of outbuff to one computed at runtime.
MSVC does not support VLA.

outbuff size is now a hard coded value:

Context size of Meta Llama 2 * (avg token length + margin)

4096  * ( 6 + 2  )

meh
@karpathy
Copy link
Owner

karpathy commented Aug 5, 2023

Surprising to me that printing is slow :(
Can we not do something much simpler and just have some buffer that we write to, and then print it all at once using the existing code?

There would be no need for complexity like this

 if (setvbuf(stdout, outbuff, _IOFBF, sizeof(outbuff)) != 0) {
     puts("Error: Buffer allocation!"); exit(EXIT_FAILURE);
     }

@trholding
Copy link
Contributor Author

Surprising to me that printing is slow :( Can we not do something much simpler and just have some buffer that we write to, and then print it all at once using the existing code?

There would be no need for complexity like this

 if (setvbuf(stdout, outbuff, _IOFBF, sizeof(outbuff)) != 0) {
     puts("Error: Buffer allocation!"); exit(EXIT_FAILURE);
     }

In practice this check is not required as buffer allocation never failed in testing...

setvbuf(stdout, outbuff, _IOFBF, sizeof(outbuff)) worked fine.

I am thinking. I'll experiment, check and see if there is another way, with less lines of code.

@karpathy
Copy link
Owner

karpathy commented Aug 5, 2023

The complexity is from: what is setvbuf, and what is _IOFBF? Are these on all platforms? Any external calls or constants are a risk.

@trholding
Copy link
Contributor Author

setvbuf is part of C standard since long. Available on all platforms, hence portable.

setvbuf is used to allocate a output buffer of required size.
printf and other functions then automatically use this buffer.
one can flush as needed to print to screen.

_IOFBF means Full IO Buffering

Nothing is at risk as bugs have been fixed. But the buffer has to be a static char array.

Ref: https://www.gnu.org/software/libc/manual/html_node/Controlling-Buffering.html
Ref: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setvbuf?view=msvc-170

C17 standard (ISO/IEC 9899:2018): 7.21.5.6 The setvbuf function (p: 225)
C11 standard (ISO/IEC 9899:2011): 7.21.5.6 The setvbuf function (p: 308)
C99 standard (ISO/IEC 9899:1999): 7.19.5.6 The setvbuf function (p: 273-274)
C89/C90 standard (ISO/IEC 9899:1990): 4.9.5.6 The setvbuf function

@twobob
Copy link

twobob commented Aug 6, 2023

run.exe ../out/model110M.bin 0 0 "and away they went" False 1

What is the "False" parameter for?

Okay, I tested on server (Aarch 64 / vps 4 core)

./run storiesXXXM.bin 0 0 "a car" 256

On 110M model, this does not improve speed. achieved tok/s: 75.459172 vs 74.443312 enabled 256 On 42M, negligible speedup : achieved tok/s: 184.957512 vs 189.620019 enabled 256 On 15M, some speed up: achieved tok/s: 615.942029 vs 648.854962 enabled 256

On the lowest end device (x86_64 / Dual Core):

./run /tmp/out/model.bin 0 0 "a car" 256

On 15M, nearly 2 x speed, achieved tok/s: 76.119403 vs 140.109890 enabled 256

I have the option to save files on my builds...

./run
Usage: run [options]
Example: run model.bin -t 0.9 -n 256 -p "Once upon a time"
Options:
-t temperature, default 0.9
-s random seed, default time(NULL)
-n number of steps to run for, default 256. 0 = max_seq_len
-p prompt string, default none
-o top_p, default 1.0
-k print_tokens, 1 or 0 flag default 1
-f saveFileBool, 1 or 0 flag default 1
-l saveLogBool, 1 or 0 flag saves timestamps for tokens gen default 0
-b singleBOS, 1 or 0 flag default 1
-d dirname string, default none

the pertoken logs let me make helpful graphs

image

15M does seem to have a very slight curve to it that the other do not, where it "ramps back up" not sure if that leans into your figures above. inconclusive from the limited data I've run so far.

Okay I'll shove this back in the mix now its been updated and run some more tests tomorrow.
thanks

ozabluda added a commit to ozabluda/llama2.c that referenced this pull request Aug 6, 2023
@twobob
Copy link

twobob commented Aug 9, 2023

I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack
https://gist.github.com/twobob/8f00f184bf3e9d89b70ba58156ac14d1

I think that the file output is a good idea, maybe filename as last parameter and it writes to file.

nothing fancy buts passes my "Assert that the decoded string matches the first 75 characters of the original prompt" and does the job I needed. Might tidy it up? but something like that.
https://github.com/twobob/llama2.c/blob/master/base64_utils.c#L70
end up with pseudo base64 encoding enough for a recoverable 75 char front index should that ever matter,
image
gonna go with fast and dirty until something official rocks up.

@karpathy
Copy link
Owner

Could we not just sprintf into a buffer and when the buffer grows beyond a threshold printf it in chunks? Isn't that much simpler?

@twobob
Copy link

twobob commented Aug 13, 2023

Could we not just sprintf into a buffer and when the buffer grows beyond a threshold printf it in chunks? Isn't that much simpler?

all for that. it's what I do. 2 - 16 works fine. fwiw I also do the same when logging timestamps for the same reasons.

roughly - this - or similar, not horrible code

log_buffer[log_buffer_count] = timestamp;
log_buffer_count++;
if (log_buffer_count == LOG_BUFFER_SIZE) {
    for (size_t i = 0; i < LOG_BUFFER_SIZE; i++) {
        fprintf(timelog_file, "%ld,", log_buffer[i]); }
  fflush(timelog_file);
    log_buffer_count = 0;  }

These more fine-grained efforts obviously have merit however that would seem good enough (give or take the missing code)

@karpathy
Copy link
Owner

I'll punt on this PR. I think this is only an issue with the tiniest models, and for those we're going so fast that this doesn't even matter. I think.

@karpathy karpathy closed this Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants