OOM GPU with en_core_web_trf #12305

jogonba2 · 2023-02-20T10:33:59Z

jogonba2
Feb 20, 2023

Hi!

I'm trying to train the en_core_web_trf model for NER with custom labels on a GPU with 48GB of memory. Doing some tests with a small sample of 50 examples, 16 labels, and batches of 8 examples, the GPU memory is exhausted. The length of the sequences is around 100 tokens with some long docs of more than 3000 tokens. OS is Ubuntu 22.04 with CUDA 11.6. This is my training code:

spacy.prefer_gpu()
self.nlp = spacy.load(config.model_name, exclude=["parser", "tagger"])
self.optimizer = self.nlp.resume_training()

examples = prepare_examples(annotated_documents)
self.update_model_labels(examples)
logger.info(f"The model labels are: {self.nlp.get_pipe('ner').labels}")

disabled_pipes = [
    pipe
    for pipe in self.nlp.pipe_names
    if pipe not in ["ner", "tok2vec", "transformer"]
]
logger.info("Starting training the NER model.")
with self.nlp.disable_pipes(*disabled_pipes):
    print("PIPES:", self.nlp.pipe_names)
    for epoch in range(self.config.max_epochs):
        logger.info(f"Training epoch {epoch}")
        shuffle(examples)
        losses = []
        for batch in spacy.util.minibatch(
            examples, size=self.config.train_batch_size
        ):
            batch_loss: Dict[str, float] = {}
            batch_examples = [
                spacy.training.Example.from_dict(
                    self.nlp.make_doc(text), annotation
                )
                for text, annotation in batch
            ]

            self.nlp.update(
                batch_examples,
                sgd=self.optimizer,
                drop=self.config.dropout,
                losses=batch_loss,
            )
            losses.append(batch_loss["ner"])
        logger.info(sum(losses) / len(losses))

logger.info("Training finished.")

I can't find what is the cause of such memory consumption. After truncating all the examples to 200 characters and doing torch.cuda.empty_cache() after each batch, the memory is around 10GB. Is this expected? seems a bit large resource usage.

Thanks!

svlandeg · 2023-02-21T14:48:40Z

svlandeg
Feb 21, 2023
Maintainer

Hi @jogonba2! Can you try adding a doc_cleaner component to your pipeline and see whether that improves things?

1 reply

jogonba2 Feb 23, 2023
Author

Hi @svlandeg. Yes, it tried also this following this discussion, but same behavior.

svlandeg · 2023-02-21T14:55:22Z

svlandeg
Feb 21, 2023
Maintainer

Note also that it looks like you're loading all of your data in memory in this loop. nlp.update actually takes an Iterable which can be a generator instead of a list. If you use a config and the spacy train command, you could set max_epochs to -1 to ensure the corpus is being streamed and not loaded into memory.

2 replies

jogonba2 Feb 23, 2023
Author

Isn't spacy.util.minibatch a generator of batches? In fact, the batch size is 8, so, the batch_examples will have only 8 examples each time. Besides, each sample is truncated to 200 characters. Trying with the roberta_base model and the HuggingFace library, truncating to 512 tokens (longer than 200 characters) and a batch of 16 examples, takes 6GB of memory.

svlandeg Feb 23, 2023
Maintainer

What is the type of examples in your code snippet? A list or a generator?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM GPU with en_core_web_trf #12305

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OOM GPU with en_core_web_trf #12305

jogonba2 Feb 20, 2023

Replies: 2 comments · 3 replies

svlandeg Feb 21, 2023 Maintainer

jogonba2 Feb 23, 2023 Author

svlandeg Feb 21, 2023 Maintainer

jogonba2 Feb 23, 2023 Author

svlandeg Feb 23, 2023 Maintainer

jogonba2
Feb 20, 2023

Replies: 2 comments 3 replies

svlandeg
Feb 21, 2023
Maintainer

jogonba2 Feb 23, 2023
Author

svlandeg
Feb 21, 2023
Maintainer

jogonba2 Feb 23, 2023
Author

svlandeg Feb 23, 2023
Maintainer