-
Notifications
You must be signed in to change notification settings - Fork 32
Description
When training a TrainableComponent on GPU, trying to use span.sent does not work even with eds.sentences() in the nlp pipeline.
When setting context_getter=lambda span : span.sent
in the trainable component, the training stops with the error :
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
`nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence
boundaries by setting `doc[i].is_sent_start`.
How to reproduce the bug
# Pipeline definition
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(
eds.span_classifier(
embedding=eds.span_pooler(
pooling_mode="mean",
embedding=eds.transformer(
model="prajjwal1/bert-tiny",
),
),
span_getter=["ents", "sc"],
attributes=[
"_.negation",
],
context_getter=lambda span : span.sent,
),
name="span_classifier",
)
training_data = (
...
)
nlp.post_init(training_data)
device = "cuda" if torch.cuda.is_available() else "cpu"
batches = (
training_data.loop()
.shuffle("dataset")
.map(nlp.preprocess, kwargs={"supervision": True})
.batchify(batch_size=32 * 128, batch_by=stat_batchify("tokens"))
.map(nlp.collate, kwargs={"device": device})
)
batches = batches.set_processing(num_cpu_workers=1, process_start_method="spawn")
# Move the model to the GPU
nlp.to(device)
optimizer = torch.optim.AdamW(
params=nlp.parameters(),
lr=3e-4,
)
iterator = iter(batches)
for step in range(max_steps):
batch = next(iterator)
optimizer.zero_grad()
with nlp.cache():
loss = torch.zeros((), device=device)
for name, component in nlp.torch_components():
output = component(batch[name])
if "loss" in output:
loss += output["loss"]
loss.backward()
optimizer.step()
From my understanding, nlp.to(device)
sends only the trainable components to the GPU, and because of that eds.sentences is never actually called on the training data. I imagine it would be possible to work around this problem by defining training_data with sentences already set, by running another pipeline on it before defining it as the training data for our training loop. However I have not really tried that yet since issue #426 is another problem around the use of functions as context_getter, for which I have not found a solution for now.
Your Environment
- Python Version Used: 3.7.16
- EDS-NLP Version Used: 0.17.2
- spaCy: 3.7.5