Wrong TRF alignment indices #10794
-
I noticed a strange behavior for the tokens returned by the transformer. The indices for some tokens are sometimes incorrectly assigned multiple times. How to reproduce the behaviour
This will lead to the following results:
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
This is a little unexpected, but not directly a bug. It's allowed for transformer tokens to align to more than one spacy token. To support both slow and fast transformers tokenizers in the same way in
Unfortunately the roberta tokenizer uses the character For fast tokenizers only, it would be possible to use the alignment that is returned by the tokenizer, but we haven't implemented this in |
Beta Was this translation helpful? Give feedback.
This is a little unexpected, but not directly a bug. It's allowed for transformer tokens to align to more than one spacy token.
To support both slow and fast transformers tokenizers in the same way in
spacy-transformers
, we're using a generic alignment algorithm (fromspacy-alignments
) to align the transformer tokens with the spacy tokens. For example, the tokens that are being aligned look like this:['He', 'is', 'the', 'recipient', 'of', 'multiple', 'accolades', ',', 'including', 'a', 'Golden', 'Globe', 'Award']
['<s>', 'He', 'Ġis', 'Ġthe', 'Ġrecipient', 'Ġof', 'Ġmultiple', 'Ġaccol', 'ades', ',', 'Ġincluding', 'Ġa', 'ĠGolden', 'ĠGlobe', 'ĠAward', '</s>']
Unfortunat…