Batch size consistency #11985

KennethEnevoldsen · 2022-12-16T15:05:26Z

KennethEnevoldsen
Dec 16, 2022

I have been working on an extension (TextDescriptives w. @HLasse) where we wish to calculate surprise (pseudo perplexity) using masked language models. However, we noticed that the current setup for creating batches does not seem to provide consistent batches. So as far as I understand when you use the .pipe the following happens related to batch:

A stream of docs is split into batches of the size specified in .pipe(... batch_size=n)
Then the batches are ordered according to length and potentially subbatched according to max_batch_items (specified in the component config).
Lastly, in the forward method on the thinc model the docs are split into spans of length 128 (with some overlap) before the forward pass is calculated.

This seems to be problematic as it might explode the actual batch size (the one passed to the model) given very long documents - thus leading to an uneven load on GPU memory. I might be misunderstanding something or missed a crucial step somewhere.

Answered by adrianeboyd

Dec 19, 2022

Yes, the memory load can be uneven if the text lengths vary a lot.

Currently, the smallest unit that nlp.pipe uses is a single text and it only has a setting to make batches with the same number of texts, so the presence of one very long text can lead to OOM errors for the batch containing that text. If you want to batch texts differently, you'd currently have to do it outside of nlp.pipe.

The transformer is the only built-in component that splits texts up into spans for processing, and all other components like ner process each text as a whole.

If you want more even memory usage, our current advice is to split your input into similar-sized texts, or just to avoid OOM, implement a max tex…

View full answer

adrianeboyd · 2022-12-19T12:36:41Z

adrianeboyd
Dec 19, 2022

Yes, the memory load can be uneven if the text lengths vary a lot.

Currently, the smallest unit that nlp.pipe uses is a single text and it only has a setting to make batches with the same number of texts, so the presence of one very long text can lead to OOM errors for the batch containing that text. If you want to batch texts differently, you'd currently have to do it outside of nlp.pipe.

The transformer is the only built-in component that splits texts up into spans for processing, and all other components like ner process each text as a whole.

If you want more even memory usage, our current advice is to split your input into similar-sized texts, or just to avoid OOM, implement a max text length and split very long texts if necessary. It gets a little tricky because the memory usage depends on the number of tokens, and the number of tokens can be wildly different for transformer tokenizers vs. spacy tokenizers (and by language), and for speed we don't want to run the transformer tokenizer in advance to do anything more flexible with the span lengths. So if get_spans.window isn't a good estimate, transformer currently truncates spans instead of doing anything more complicated, so you may end up with some tokens without corresponding transformer embeddings.

The overlapping strided spans approach is basically okay for transformer because it's not too hard to combine the overlapping embeddings, but it would get a lot trickier once you want to merge NER annotation, dependency trees, etc. If you have to split your texts without overlapping windows or you have to merge annotations within overlapping windows, our feeling is that you are more knowledgeable about how to do this for your input/components, so we don't want to try to do this for you.

As a side note, there is also the setting nlp.max_length that takes a stab at trying to prevent you from running into OOM errors for a typical pipeline like en_core_web_sm and a consumer laptop for a single text with nlp(), but it doesn't take batching into account.

2 replies

KennethEnevoldsen Dec 20, 2022
Author

Thanks! This was very helpful.

the number of tokens can be wildly different for transformer tokenizers vs. spacy tokenizers (and by language), and for speed we don't want to run the transformer tokenizer in advance to do anything more flexible with the span lengths.

Wouldn't it be possible to pass the batch size, batch it during the forward pass, and concatenate afterwards? This might naturally lead to some small batches. E.g. a batch size of 64, imagine a 64 getting converted to 71 sequences of 128 wp tokens. Then afterwards, that gets converted (within the thinc transformer) to two batches of 64, 7? Naturally, this could be fixed by combining it with the next batch using a generator.

The overlapping strided spans approach is basically okay for transformer because it's not too hard to combine the overlapping embeddings

Is it combined by simply taking the mean?

but it would get a lot trickier once you want to merge NER annotation

The new version of spacy-wrap actually wraps NER-based annotations, but it simply averages the logits before converting them. I should check what happens if it is run very long sequences. However I see that it wouldn't work transition-based parser.

adrianeboyd Dec 20, 2022

The overlapping strided spans approach is basically okay for transformer because it's not too hard to combine the overlapping embeddings

Is it combined by simply taking the mean?

It's configurable with the pooling setting in the transformer listener, but the default is to take the mean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Batch size consistency #11985

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Batch size consistency #11985

Uh oh!

KennethEnevoldsen Dec 16, 2022

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Dec 19, 2022

Uh oh!

Uh oh!

KennethEnevoldsen Dec 20, 2022 Author

Uh oh!

adrianeboyd Dec 20, 2022

KennethEnevoldsen
Dec 16, 2022

Replies: 1 comment 2 replies

adrianeboyd
Dec 19, 2022

KennethEnevoldsen Dec 20, 2022
Author