Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve API to use a custom batch_sampler in a SentenceTransformerTrainer #3152

Open
alonme opened this issue Jan 3, 2025 · 3 comments · May be fixed by #3162
Open

Improve API to use a custom batch_sampler in a SentenceTransformerTrainer #3152

alonme opened this issue Jan 3, 2025 · 3 comments · May be fixed by #3162
Assignees
Labels
enhancement New feature or request

Comments

@alonme
Copy link

alonme commented Jan 3, 2025

Currently only specific batch_sampler values are possible
There seems to be a need to enable users to create custom batch samplers

Examples from issues:

  1. Batch sampler #3123
  2. Using DataLoader in v3? #2707
  3. MNRL with Multiple hard negatives per query and NoDuplicatesBatchSampler #2954

I believe that the current solution (that i found the hard way) that is suggested in the issues is to subclass SentenceTransformerTrainer and override get_batch_sampler, is not documented well enough and isn't straightforward

Is there any reason not to just accept anything that inherits from DefaultBatchSampler (for example) as a parameter?

@tomaarsen
Copy link
Collaborator

Hello!

The current implementation is just an extension of transformers TrainingArguments, where e.g. optimizer options are also provided by passing a string to optim with one of these strings: https://github.com/huggingface/transformers/blob/12ba96aa3cb3e4ed2a3ffb77b59f53f8ce9ac1fa/src/transformers/training_args.py#L143-L185

Having said that, the transformers Trainer class does accept an optimizers instance where you can pass initialized instances.

I think I would be okay with updating the API such that a subclass of DefaultBatchSampler can be used. We'd first have to extend DefaultBatchSampler so it also accepts valid_label_columns and generator (but keeps them unused), and then people who subclass it can use those still.
I think I can have a look at this soon.

  • Tom Aarsen

@alonme
Copy link
Author

alonme commented Jan 7, 2025

@tomaarsen Hey! Thanks for the reply,

I Assume i can also try and tackle that, would that be ok?

@tomaarsen
Copy link
Collaborator

Definitely!
If possible, perhaps the updated DefaultBatchSampler can be the superclass of the other batch samplers as well - that would make it more obvious that it should be subclassed.
If you're interested, we can also apply this approach for the multi-dataset batch samplers.

  • Tom Aarsen

@tomaarsen tomaarsen added the enhancement New feature or request label Jan 7, 2025
@alonme alonme linked a pull request Jan 11, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants