Skip to content

Conversation

@yashwantbezawada
Copy link
Contributor

@yashwantbezawada yashwantbezawada commented Nov 6, 2025

What does this PR do?

Fixes #42024

I found this while testing tokenizers - when you modify model_input_names on one tokenizer instance, it was affecting all other instances of the same tokenizer class.

The problem is that model_input_names is defined as a class-level list, and in the __init__ method (line 1417), when no custom model_input_names is provided, it was just referencing the class attribute directly instead of making a copy.

So all instances were sharing the same list object. This is a classic Python gotcha with mutable class attributes.

The fix is simple - wrap it in list() to create a new list for each instance:

Before:

self.model_input_names = kwargs.pop("model_input_names", self.model_input_names)

After:

self.model_input_names = list(kwargs.pop("model_input_names", self.model_input_names))

This ensures each tokenizer instance gets its own independent copy of the list. Now modifications to one instance won't affect others.

The reproduction from the issue shows the problem clearly - with the fix, the second tokenizer instance will have the original list values instead of inheriting the modifications from the first instance.

Fixes huggingface#42024

The model_input_names attribute was defined as a class-level list, and
when initializing tokenizer instances, they were all pointing to the same
list object. This meant modifying model_input_names on one instance would
affect all other instances.

The issue was in tokenization_utils_base.py line 1417:
```python
self.model_input_names = kwargs.pop("model_input_names", self.model_input_names)
```

When no model_input_names is passed in kwargs, it would use the class
attribute directly (self.model_input_names), creating a reference to the
shared list instead of creating a new list for the instance.

Fixed by wrapping it in list() to ensure each instance gets its own copy:
```python
self.model_input_names = list(kwargs.pop("model_input_names", self.model_input_names))
```

This is a standard pattern for handling mutable default values in Python.
@yashwantbezawada yashwantbezawada force-pushed the fix/model-input-names-singleton-42024 branch from 4d018bf to 6ba1ffb Compare November 6, 2025 02:17
@Rocketknight1
Copy link
Member

This solution LGTM but cc @ArthurZucker @itazap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PreTrainedTokenizerBase.model_input_names is a singleton

2 participants