-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run.c - Speed up with output buffering #193
Conversation
Previously much of time was spent writing to screen which is relatively slow. By enabling output buffering more work can be performed by writing groups of computed tokens to the buffer which is relatively fast, and then flushing the buffer periodically to screen/console. Testing with the smallest model, a interactive tokens/s speed up of ~14% on standard builds to ~84% on open-mp builds has been achieved. Usage: run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens] Where buffer_tokens is the number of tokens to be buffered. Multiples of 2 seem to be ideal. 64 worked well for my use case on a low end machine. The speed up may depend on model size and OS. Example: ./run model.bin 0 0 "A car" 64
Output: Speed comparison without token buffering and with buffering via buffer_tokens sizes 16, 32, 64
|
yeah. I could get like 200+ tok/s on the original release that ran at 8 tok/s on the grodder test machine with a very basic "only dump the output" hack |
I can confirm that only dump is fast when redirecting output to file. Waiting for punctuation could make it variable speed, speed up may vary wildly across outputs. I tested a similar approach with line buffering, as the length of the lines varied, so did the speed up too. |
I think that the file output is a good idea, maybe filename as last parameter and it writes to file. |
yeah that was my finding. Obviously as the token length processing debt increases so does the time to output so my naiive metric was likely incorrectly conflating two problems and should really be calculated with the relevant non-linear function in mind. That aside, anything that speeds up the output gets a thumbs up from me. |
I was thinking a reversible filesafe name automatically generated from the prompt mixed with a prefixed timestamp might be nice. Original: inbox/20230731142853_?This could be a prompt.txt |
built with clang -Ofast -fopenmp -D_WIN32 -o run.exe -I. run.c win.c
with the latest updates as of 01/08/2023, perhaps this has become more moot. |
Increased output buffer from 2048 to 4096 This fixes output buffer overflow and garbled output when output is larger than buffer in rare cases such as when blocks of token sequences are repeated during inference. ./run stories110M.bin 0 0 "A big dog" 256
What is the "False" parameter for? Okay, I tested on server (Aarch 64 / vps 4 core) ./run storiesXXXM.bin 0 0 "a car" 256 On 110M model, this does not improve speed. achieved tok/s: 75.459172 vs 74.443312 enabled 256 On the lowest end device (x86_64 / Dual Core): ./run /tmp/out/model.bin 0 0 "a car" 256 On 15M, nearly 2 x speed, achieved tok/s: 76.119403 vs 140.109890 enabled 256 |
Making the outbuff static, auto initializes it to zero, thus the memset can be avoided. Ref: karpathy#193 (comment)
run.c
Outdated
@@ -570,7 +577,7 @@ int main(int argc, char *argv[]) { | |||
// following BOS token (1), sentencepiece decoder strips any leading whitespace (see PR #89) | |||
char *token_str = (token == 1 && vocab[next][0] == ' ') ? vocab[next]+1 : vocab[next]; | |||
printf("%s", token_str); | |||
fflush(stdout); | |||
if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- outbuf is not necessarily zero terminated
- flush at the end
if (bufferflush==pos && strlen(outbuff)<=4096) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens | |
if (bufferflush==pos || pos == steps) { fflush(stdout); bufferflush+=buffertokens; } // flush after every n tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic:
- On certain systems, writing to console / screen is relatively and variably slow compared to the computation.
- Currently each token is written to screen and flushed so inherits 1
- We speed up by batching the number of tokens to be flushed to screen.
- We enable the number of tokens to be batched as a argument for flexibility
- We need to size a appropriate buffer, so that we do not encounter a buffer overflow
- We do not know beforehand the batch size, the size of each token or the context size
- We estimate a safe buffer size like this: context size * (average token length + safe margin)
- Context size is determined from the model config.seq_len parameter.
- BPE subword units usually are 2-6 chars. Better to err, I used 6 as avg token length and 2 for a safe margin.
- That explicitly sized buffer and its calculated size is passed as parameters to setvbuf to buffer the output. setvbuf will round it up to the next power of 2.
- We are not requesting setvbuf to automatically create a buffer for us as with the NULL parameter nor are we passing BUFSIZ as our purpose is a buffer that can hold context size * (average token length + safe margin) for the maximum case such as batch size = context size.
- Once that is set, each time the buffer is filled with a batch of the required number of tokens, it is flushed.
Your suggestion were helpful for latest commit. These are my findings:
if (bufferflush==pos && strlen(outbuff)<=4096)
, here the && strlen(outbuff)<=4096
check is not required and wrong. Also reading a buffer managed by setvbuf is undefined behaviour.
if (bufferflush==pos || pos == steps)
, here || pos == steps
would not be needed as the buffer can already hold max steps (config.seq_len) now.
You were right about a bug.
The bug was a hard coded buffer size that did not take into account context size * (average token length + safe margin). Even the increase to 4096 would have caused the overflow on Meta Llama2 7b.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this doesn't build with MSVC. Need to figure an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The default buffer size BUFSIZ is expected to be the most efficient buffer size for file I/O on the implementation", you need at least some benchmarks to second-guess it. For purity, safe buffer size would be context size * (max token length). There is no guarantee that it's smaller than 4096, which is an unnamed constant you use multiple times. But it's not clear that such purity is needed. In fact, I am not sure even sure that flushing after max tokens vs max bytes (which would have a much simpler implementation) is a right approach.
My proposed || pos == steps
is there to flush at the end of the loop, although I just realized it's not needed, sorry about that, because of:
Line 584 in af8708d
printf("\nachieved tok/s: %f\n", (steps-1) / (double)(end-start)*1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have benchmarks on BUFSIZ please share it here. It would be of great help! Also kindly test with BUFSIZE and steps 0 and any buffer_tokens parameters such as 4, 8, 16, 32, 64 ... .
Usage:
run <checkpoint_file> [temperature] [steps] [prompt] [buffer_tokens]
Where buffer_tokens is the number of tokens to be buffered.
Fixed a buffer overflow bug by changing hardcoded size of outbuff to one computed at runtime.
MSVC does not support VLA. outbuff size is now a hard coded value: Context size of Meta Llama 2 * (avg token length + margin) 4096 * ( 6 + 2 ) meh
Surprising to me that printing is slow :( There would be no need for complexity like this
|
In practice this check is not required as buffer allocation never failed in testing...
I am thinking. I'll experiment, check and see if there is another way, with less lines of code. |
The complexity is from: what is |
setvbuf is part of C standard since long. Available on all platforms, hence portable. setvbuf is used to allocate a output buffer of required size. _IOFBF means Full IO Buffering Nothing is at risk as bugs have been fixed. But the buffer has to be a static char array. Ref: https://www.gnu.org/software/libc/manual/html_node/Controlling-Buffering.html
|
I have the option to save files on my builds... ./run the pertoken logs let me make helpful graphs 15M does seem to have a very slight curve to it that the other do not, where it "ramps back up" not sure if that leans into your figures above. inconclusive from the limited data I've run so far. Okay I'll shove this back in the mix now its been updated and run some more tests tomorrow. |
Inspired by karpathy#193, see that for details.
nothing fancy buts passes my "Assert that the decoded string matches the first 75 characters of the original prompt" and does the job I needed. Might tidy it up? but something like that. |
Could we not just sprintf into a buffer and when the buffer grows beyond a threshold printf it in chunks? Isn't that much simpler? |
all for that. it's what I do. 2 - 16 works fine. fwiw I also do the same when logging timestamps for the same reasons. roughly - this - or similar, not horrible code
These more fine-grained efforts obviously have merit however that would seem good enough (give or take the missing code) |
I'll punt on this PR. I think this is only an issue with the tiniest models, and for those we're going so fast that this doesn't even matter. I think. |
Previously much of time was spent writing to screen which is relatively slow.
By enabling output buffering more work can be performed by writing groups of computed tokens to the buffer which is relatively fast, and then flushing the buffer periodically to screen/console.
Testing with the smallest model, a interactive tokens/s speed up of ~14% on standard builds to ~84% on open-mp builds has been achieved.
Multiples of 2 seem to be ideal. 64 worked well for my use case on a low end machine.
The speed up may depend on model size and OS.
Example: