No alignment for [UNK] tokens #12023

KennethEnevoldsen · 2022-12-23T18:15:46Z

KennethEnevoldsen
Dec 23, 2022

I have been using the alignment with great success in the spacy-wrap so far! However, when trying to address an issue, I ran into the problem that the alignment between the wordpieces and the tokens does not match in the presence of UNK tokens:

import spacy
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG

# use an example transformer
DEFAULT_CONFIG["transformer"]["model"]["name"] = "bert-base-uncased"
nlp_blank = spacy.blank("en")

trf = nlp.add_pipe("transformer", config=DEFAULT_CONFIG["transformer"])
trf.initialize(get_examples=lambda: [])

# forward pass
doc = nlp("My name is Wolfgang 🚀 and I live in Berlin.")

# inspect model output for '🚀'
doc._.trf_data.align[4]
# Ragged(data=array([], shape=(0, 0), dtype=int32), lengths=array([0], dtype=int32), data_shape=(-1, 1), starts_ends=None)
doc._.trf_data.wordpieces.strings[0][5] # +1 due to [CLS]
# '[UNK]'
doc._.trf_data.wordpieces
# WordpieceBatch(
# strings=[['[CLS]', 'my', 'name', 'is', 'wolfgang', '[UNK]', 'and', 
# 'i', 'live', 'in', 'berlin', '.', '[SEP]']], 
# input_ids=array([[  101,  2026,  2171,  2003, 13865,   100,  1998,  1045,  2444, 
# 1999,  4068,  1012,   102]], dtype=int32), 
# attention_mask=array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], 
# dtype=float32), 
# lengths=[13],
# token_type_ids=array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
# )

Naturally, one solution is simply to ignore the UNK tokens, however as seen here the UNK token embedding might contain useful information about the token 🚀.

I am wondering if there is a principled way to match up the UNK tokens? Naturally in this case you can estimate it (tokens 3 and 5 match to wordpiece embedding 4 and 6 så token 4 must match up with embedding 5), but that break down even in simple cases such as repeating UNK tokens.

Answered by adrianeboyd

Dec 29, 2022

I think it's that spacy-alignments doesn't align the string [UNK] to the string 🚀.

If you are feeling adventurous, you can try installing spacy-transformers from master (should currently be v1.2.0.dev0), which has been updated to use the alignments directly from fast tokenizers instead of the alignments from spacy-alignments. (spacy-alignments is still used for slow tokenizers.)

View full answer

polm · 2022-12-28T06:27:06Z

polm
Dec 28, 2022

This is a weird case, thanks for bringing it to our attention. At first I had trouble understanding your issue since the number of tokens matched, but you're right that if you use token lengths the aligment is off. On the other hand, the information to get a proper alignment still seems to be present. Consider this code:

import spacy
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG

# use an example transformer
DEFAULT_CONFIG["transformer"]["model"]["name"] = "bert-base-uncased"
nlp = spacy.blank("en")

trf = nlp.add_pipe("transformer", config=DEFAULT_CONFIG["transformer"])
trf.initialize(get_examples=lambda: [])

# forward pass
doc = nlp("My name is Wolfgang 🚀🚀 and I live in Berlin.")

# inspect model output for '🚀'
trf_data = doc._.trf_data
strings = trf_data.wordpieces.strings[0]
offset = 1
for tok in doc:
    wplen = trf_data.align.lengths[tok.i]
    offset_based = strings[offset : offset + wplen]

    align_idx = 0
    if trf_data.align[tok.i].data:
        align_idx = trf_data.align[tok.i].data[0][0]

    align_based = strings[align_idx : align_idx + wplen]
    print(tok.i, tok, offset_based, align_based, sep="\t")
    offset += wplen

Output:

0	My	['my']	['my']
1	name	['name']	['name']
2	is	['is']	['is']
3	Wolfgang	['wolfgang']	['wolfgang']
4	🚀	[]	[]
5	🚀	[]	[]
6	and	['[UNK]']	['and']
7	I	['and']	['i']
8	live	['i']	['live']
9	in	['live']	['in']
10	Berlin	['in']	['berlin']
11	.	['berlin']	['.']

Here you see that using just offsets things get weird, but if you use the start values from align it's possible to keep things lined up. I guess that's where you were running into trouble?

0 replies

adrianeboyd · 2022-12-29T09:44:04Z

adrianeboyd
Dec 29, 2022

I think it's that spacy-alignments doesn't align the string [UNK] to the string 🚀.

If you are feeling adventurous, you can try installing spacy-transformers from master (should currently be v1.2.0.dev0), which has been updated to use the alignments directly from fast tokenizers instead of the alignments from spacy-alignments. (spacy-alignments is still used for slow tokenizers.)

0 replies

KennethEnevoldsen · 2023-01-04T12:48:47Z

KennethEnevoldsen
Jan 4, 2023
Author

Thanks for the responses @adrianeboyd and @polm. Switching to the newest version of spacy-transformer does indeed resolve the problem.

pip install git+https://github.com/explosion/spacy-transformers

This leads to the intended output:

import spacy
import spacy_wrap
nlp = spacy.blank("en")

# specify model from the hf hub
config = {"model": {"name": "dslim/bert-base-NER"}}

# add it to the pipe
nlp.add_pipe("token_classification_transformer", config=config)

# test it on two samples
doc = nlp("My name is Wolfgang 🚀 and I live in Berlin.")
# [('Wolfgang 🚀', 'PER'), ('Berlin', 'LOC')]

print([(ent.text, ent.label_) for ent in doc.ents])
doc = nlp("My name is Wolfgang 🚀🚀 🚀 and I live in Berlin.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Wolfgang 🚀🚀 🚀', 'PER'), ('Berlin', 'LOC')]

Thanks for the help on this!

1 reply

thomashacker Jan 4, 2023

Thanks for reporting back, we're happy to hear that everything works now! 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No alignment for [UNK] tokens #12023

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

No alignment for [UNK] tokens #12023

KennethEnevoldsen Dec 23, 2022

Replies: 3 comments · 1 reply

polm Dec 28, 2022

adrianeboyd Dec 29, 2022

KennethEnevoldsen Jan 4, 2023 Author

thomashacker Jan 4, 2023

KennethEnevoldsen
Dec 23, 2022

Replies: 3 comments 1 reply

polm
Dec 28, 2022

adrianeboyd
Dec 29, 2022

KennethEnevoldsen
Jan 4, 2023
Author