Skip to content

Cannot use nlp.sentences results in deep-learning training #427

@theoimbert-aphp

Description

@theoimbert-aphp

When training a TrainableComponent on GPU, trying to use span.sent does not work even with eds.sentences() in the nlp pipeline.
When setting context_getter=lambda span : span.sent in the trainable component, the training stops with the error :

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: 
`nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence 
boundaries by setting `doc[i].is_sent_start`.

How to reproduce the bug

# Pipeline definition
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
    eds.span_classifier(
        embedding=eds.span_pooler(
            pooling_mode="mean",
            embedding=eds.transformer(
                model="prajjwal1/bert-tiny",
            ),
        ),
        span_getter=["ents", "sc"],
        attributes=[
            "_.negation",
        ],
        context_getter=lambda span : span.sent,
    ),
    name="span_classifier",
)

training_data = (
        ...
)

nlp.post_init(training_data)
device = "cuda" if torch.cuda.is_available() else "cpu"
batches = (
    training_data.loop()
    .shuffle("dataset")
    .map(nlp.preprocess, kwargs={"supervision": True})
    .batchify(batch_size=32 * 128, batch_by=stat_batchify("tokens"))
    .map(nlp.collate, kwargs={"device": device})
)
batches = batches.set_processing(num_cpu_workers=1, process_start_method="spawn")
# Move the model to the GPU
nlp.to(device)

optimizer = torch.optim.AdamW(
    params=nlp.parameters(),
    lr=3e-4,
)

iterator = iter(batches)

for step in range(max_steps):
    batch = next(iterator)
    optimizer.zero_grad()
    with nlp.cache():
        loss = torch.zeros((), device=device)
        for name, component in nlp.torch_components():
            output = component(batch[name])
            if "loss" in output:
                loss += output["loss"]
        loss.backward()
        optimizer.step()

From my understanding, nlp.to(device) sends only the trainable components to the GPU, and because of that eds.sentences is never actually called on the training data. I imagine it would be possible to work around this problem by defining training_data with sentences already set, by running another pipeline on it before defining it as the training data for our training loop. However I have not really tried that yet since issue #426 is another problem around the use of functions as context_getter, for which I have not found a solution for now.

Your Environment

  • Python Version Used: 3.7.16
  • EDS-NLP Version Used: 0.17.2
  • spaCy: 3.7.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions