Error calling tokenizer.get_vocab() (Codegen2.5)

I wanted to check if Codegen2.5 uses the same vocabulary as Codegen2 (a question to the authors: does it?), and noticed that calling .get_vocab() on tokenizer produces an error. 

How to reproduce: 
```
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
tokenizer.get_vocab()
```

The expected output would be a dictionary with vocabulary. 
The output I get instead is: 


> "UnicodeDecodeError                        Traceback (most recent call last)
> Cell In[18], line 1
> ----> 1 tokenizer.get_vocab()
> 
> File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in CodeGen25Tokenizer.get_vocab(self)
>     151 def get_vocab(self):
>     152     """Returns vocab as a dict"""
> --> 153     vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
>     154     return vocab
> 
> File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in <dictcomp>(.0)
>     151 def get_vocab(self):
>     152     """Returns vocab as a dict"""
> --> 153     vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
>     154     return vocab
> 
> File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:169, in CodeGen25Tokenizer._convert_id_to_token(self, index)
>     167 def _convert_id_to_token(self, index):
>     168     """Converts an index (integer) in a token (str) using the vocab."""
> --> 169     return self.encoder.decode_single_token_bytes(index).decode("utf-8")
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error calling tokenizer.get_vocab() (Codegen2.5) #85

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error calling tokenizer.get_vocab() (Codegen2.5) #85

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions