Repeated `bos_token_id` is added when using tokenizer

The code I use is:
```python
from transformers import AutoTokenizer

messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"}
]

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1")
# first usage
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
decode_text = tokenizer.decode(inputs[0])
print(f"Decoded text: {decode_text}")
# second usage
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt")
decode_text = tokenizer.decode(inputs['input_ids'][0])
print(f"Decoded text: {decode_text}")
```
And the results are:
```txt
Decoded text: <s>[INST] Hello![/INST] Hi there!</s>[INST] How are you?[/INST]
Decoded text: <s><s>[INST] Hello![/INST] Hi there!</s>[INST] How are you?[/INST]
```
A repeated `<s>` is added into the input respectively.

I'm not sure whether this is an expected behaviour, but it does really confused me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repeated `bos_token_id` is added when using tokenizer #42221

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repeated bos_token_id is added when using tokenizer #42221

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Repeated `bos_token_id` is added when using tokenizer #42221