Skip to content

Error calling tokenizer.get_vocab() (Codegen2.5) #85

Open
@ShushanArakelyan

Description

@ShushanArakelyan

I wanted to check if Codegen2.5 uses the same vocabulary as Codegen2 (a question to the authors: does it?), and noticed that calling .get_vocab() on tokenizer produces an error.

How to reproduce:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
tokenizer.get_vocab()

The expected output would be a dictionary with vocabulary.
The output I get instead is:

"UnicodeDecodeError Traceback (most recent call last)
Cell In[18], line 1
----> 1 tokenizer.get_vocab()

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in CodeGen25Tokenizer.get_vocab(self)
151 def get_vocab(self):
152 """Returns vocab as a dict"""
--> 153 vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
154 return vocab

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in (.0)
151 def get_vocab(self):
152 """Returns vocab as a dict"""
--> 153 vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
154 return vocab

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:169, in CodeGen25Tokenizer._convert_id_to_token(self, index)
167 def _convert_id_to_token(self, index):
168 """Converts an index (integer) in a token (str) using the vocab."""
--> 169 return self.encoder.decode_single_token_bytes(index).decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions