Wrong TRF alignment indices #10794

azwierzc · 2022-05-12T12:40:06Z

azwierzc
May 12, 2022

I noticed a strange behavior for the tokens returned by the transformer. The indices for some tokens are sometimes incorrectly assigned multiple times.

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_trf")

text = """
John Christopher Depp II (born June 9, 1963) is an American actor, producer, musician and painter. He is the recipient of multiple accolades, including a Golden Globe Award and a Screen Actors Guild Award, in addition to nominations for three Academy Awards and two BAFTAs.

Depp made his debut in the horror film A Nightmare on Elm Street (1984), before rising to prominence as a teen idol on the television series 21 Jump Street (1987–1990). In the 1990s, Depp acted mostly in independent films, often playing eccentric characters. These included What's Eating Gilbert Grape (1993), Benny and Joon (1993), Dead Man (1995), Donnie Brasco (1997), and Fear and Loathing in Las Vegas (1998). Depp also began collaborating with director Tim Burton, starring in Edward Scissorhands (1990), Ed Wood (1994), and Sleepy Hollow (1999).

In the 2000s, Depp became one of the most commercially successful film stars by playing Captain Jack Sparrow in the Walt Disney swashbuckler film series Pirates of the Caribbean (2003–2017). He received critical praise for Finding Neverland (2004), and continued his commercially successful collaboration with Tim Burton with the films Charlie and the Chocolate Factory (2005), where he portrayed Willy Wonka, Corpse Bride (2005), Sweeney Todd: The Demon Barber of Fleet Street (2007), and Alice in Wonderland (2010).
"""

doc = nlp(text)
for i in range(len(doc)):
    print(doc._.trf_data.align[i].dataXd.T[0], doc[i])

This will lead to the following results:

[34 35] including
[35] a
...
[82 83] rising
[83] to

Your Environment

Operating System: Google Colab
Python Version Used: 3.7.13
spaCy Version Used: 3.2.3
Pipelines: en_core_web_trf (3.2.0)

Answered by adrianeboyd

May 13, 2022

This is a little unexpected, but not directly a bug. It's allowed for transformer tokens to align to more than one spacy token.

To support both slow and fast transformers tokenizers in the same way in spacy-transformers, we're using a generic alignment algorithm (from spacy-alignments) to align the transformer tokens with the spacy tokens. For example, the tokens that are being aligned look like this:

spacy: ['He', 'is', 'the', 'recipient', 'of', 'multiple', 'accolades', ',', 'including', 'a', 'Golden', 'Globe', 'Award']
transformer: ['<s>', 'He', 'Ġis', 'Ġthe', 'Ġrecipient', 'Ġof', 'Ġmultiple', 'Ġaccol', 'ades', ',', 'Ġincluding', 'Ġa', 'ĠGolden', 'ĠGlobe', 'ĠAward', '</s>']

Unfortunat…

View full answer

adrianeboyd · 2022-05-13T07:11:41Z

adrianeboyd
May 13, 2022

This is a little unexpected, but not directly a bug. It's allowed for transformer tokens to align to more than one spacy token.

To support both slow and fast transformers tokenizers in the same way in spacy-transformers, we're using a generic alignment algorithm (from spacy-alignments) to align the transformer tokens with the spacy tokens. For example, the tokens that are being aligned look like this:

spacy: ['He', 'is', 'the', 'recipient', 'of', 'multiple', 'accolades', ',', 'including', 'a', 'Golden', 'Globe', 'Award']
transformer: ['<s>', 'He', 'Ġis', 'Ġthe', 'Ġrecipient', 'Ġof', 'Ġmultiple', 'Ġaccol', 'ades', ',', 'Ġincluding', 'Ġa', 'ĠGolden', 'ĠGlobe', 'ĠAward', '</s>']

Unfortunately the roberta tokenizer uses the character Ġ as a special symbol, which the generic alignment algorithm recognizes as being G-like. It's possible for a transformer token to align to more than one spacy token, so the g at the end of including and the Ġ at the beginning of Ġa both look enough like G that the alignment algorithm aligns 'Ġincluding' to both spacy tokens.

For fast tokenizers only, it would be possible to use the alignment that is returned by the tokenizer, but we haven't implemented this in spacy-transformers yet.

3 replies

k-sap May 13, 2022

I think it's a trouble-causing behavior.

For different models, different tokenizers use different special symbols. Consequently, English alignment has issues with g, Polish BERT tokenizer has issues with w etc.

My use case is using an additional pipe with a transformers model and merging my annotation layer with Doc object in the additional doc._. field. In such use case doc._.trf_data.align is confusing. I need to align spacy and transformer tokens on my own (using spacy-alignments).

Is using special symbols in alignment necessary?

adrianeboyd May 13, 2022

I understand that it's frustrating, and we don't really like this situation, either.

spacy-transformers tries to handle whatever arbitrary wordpieces it might get back from an arbitrary slow tokenizer. These characters are just part of the tokenizer vocab and not something specified by spacy-transformers. As far as we've seen, it doesn't seem to be a huge issue in practice for models listening to a transformer component, since the transformer model will be fine-tuned based on this exact alignment.

spacy-transformers does have some hacks around the special symbols like <s> so that they're not aligned to s if at all possible (but you can't tell <s> as a special symbol from <s> in the input text). However, the additional characters coming from the tokenizer vocab like Ġ aren't encoded in a general-purpose way in the tokenizers.

The GPT2/Roberta tokenizer has a lot of cases where the vocab strings shouldn't be used for alignment at all because they're just an encoding from bytes to some arbitrary string representation rather than being related to the actual text string (which is where 32 is mapped to 'Ġ').

If you can redo the alignment in postprocessing for your specific task / tokenizer, this sounds like an acceptable workaround, although I understand that it's a hassle and a bit slower overall?

Overall a better solution would be to use the alignments from the fast tokenizers, but many users are still using slow tokenizers (including some of our own trained pipelines), so we don't want to switch to supporting only fast tokenizers, and I'm a little hesitant to have different results from spacy-transformers for slow vs. fast tokenizers if the settings are otherwise the same. But it has been on my to-do list for a while to implement enough support for fast tokenizers to see if using the direct alignment makes a difference for the performance of pipelines like en_core_web_trf.

k-sap May 17, 2022

Yes, I have acceptable realignment in postprocessing which sometimes also have some issues but it's almost fine and much better than trf_data.align.
Probably the most difficult part for me was to understand I need to give up using trf_data.align, then I could use spacy-alingments.

However, I hope it will be resolved somehow in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong TRF alignment indices #10794

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Wrong TRF alignment indices #10794

azwierzc May 12, 2022

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 3 replies

adrianeboyd May 13, 2022

k-sap May 13, 2022

adrianeboyd May 13, 2022

k-sap May 17, 2022

azwierzc
May 12, 2022

Replies: 1 comment 3 replies

adrianeboyd
May 13, 2022