[feat] Add NoDuplicatesFastBatchSampler #3611

hotchpotch · 2026-01-09T07:43:46Z

The current NoDuplicatesBatchSampler can become significantly slow when working with datasets that have many duplicate values across query / positive / negatives columns, especially with large batch sizes (e.g., bs=8192). This is particularly noticeable with triplet or hard negatives data.

Summary of Changes

This PR adds NoDuplicatesFastBatchSampler, which speeds up duplicate checking by pre-computing xxhash 64-bit values for each sample using datasets.map(). It maintains the same batch construction policy as NoDuplicatesBatchSampler (avoiding duplicates within a batch) while significantly improving performance.

Since this approach increases memory usage, both options are provided:

NO_DUPLICATES: Existing sampler (memory-efficient)
NO_DUPLICATES_FAST: New sampler (faster, but uses more memory)

Benchmarks (MS MARCO)

Benchmarked using the following HuggingFace datasets:

sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-hard
sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-50

Conditions

Batch size: 128 and 8192
Hash parallelization: num_proc=8
Progress bar disabled (--no-progress-bar)

The table below summarizes execution time, memory usage, and batch counts. Memory is measured using USS (Unique Set Size). The fast sampler stores hash values as NumPy int64 arrays, which accounts for the increased memory usage. The original NO_DUPLICATES checks values on-the-fly and does not increase memory usage.

dataset	sampler	bs	total_time	hash_time	hash_uss_current	hash_uss_peak	batches (ideal/delta)
triplet-50	NO_DUPLICATES	128	71.386s	n/a	n/a	n/a	3929 (ideal=3929, delta=0)
triplet-50	NO_DUPLICATES_FAST	128	3.496s	3.724s	211.61MiB	211.62MiB	3929 (ideal=3929, delta=0)
triplet-50	NO_DUPLICATES	8192	283.215s	n/a	n/a	n/a	58 (ideal=61, delta=3)
triplet-50	NO_DUPLICATES_FAST	8192	6.835s	3.723s	201.52MiB	201.54MiB	58 (ideal=61, delta=3)
triplet-hard	NO_DUPLICATES	128	405.658s	n/a	n/a	n/a	91114 (ideal=91114, delta=0)
triplet-hard	NO_DUPLICATES_FAST	128	261.424s	4.674s	314.26MiB	510.76MiB	91114 (ideal=91114, delta=0)
triplet-hard	NO_DUPLICATES	8192	171.853s	n/a	n/a	n/a	1423 (ideal=1423, delta=0)
triplet-hard	NO_DUPLICATES_FAST	8192	21.567s	4.579s	313.82MiB	526.93MiB	1423 (ideal=1423, delta=0)

Environment: Ryzen 9 7950 (num_proc=8), Ubuntu 24

Memory Considerations

This implementation stores hash values as int64 NumPy ndarrays, which increases memory usage compared to the current NoDuplicatesBatchSampler.

For reference, using sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1:

triplet-50 (503,302 rows): ~200MiB additional memory
triplet-hard (11,662,655 rows): ~314MiB additional memory

Therefore, users can choose between:

NO_DUPLICATES: Memory-efficient (existing)
NO_DUPLICATES_FAST: Faster (new)

How It Works

On first iteration only: Use datasets.map() to retrieve all values from query / positive / negatives columns
Hash strings using xxhash 64-bit
Store hash arrays per row as NumPy arrays (assumes fixed-length rows, which is valid since query, positive, and negatives columns are consistent within a dataset)
In __iter__, use hash arrays for fast duplicate checking while constructing batches

Implementation Notes

While xxhash64 can theoretically produce hash collisions, the probability is extremely low. Even if a collision occurs, it would only result in excluding a non-duplicate sample from the same batch, which has minimal impact on training. Therefore, this is considered negligible in practice.
Hashing is parallelized using datasets.map(..., num_proc=N) for speed.
I haven't found other places in this project that use multiprocessing in a similar way. If a different implementation style is preferred, or if parallelization should be avoided, please let me know.
The number of parallel workers is capped at 8 even on machines with more cores. Feedback on whether this default is appropriate is welcome.
Suggestions for better optimization approaches or alternative implementations are also welcome.

Benchmark Commands

# triplet-50, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-50, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

Feedback and suggestions are appreciated!

hotchpotch added 6 commits January 8, 2026 18:46

Add fast no-duplicates batch sampler

ffa8a79

Wire NO_DUPLICATES_FAST option

1b55966

Guard hash dataset cleanup

829e45e

Add no-duplicate batch sampler benchmark script

c103b39

Rename hash num_proc parameter

3de2e0c

Simplify xxhash requirement message

c257dd3

hotchpotch marked this pull request as ready for review January 9, 2026 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Add NoDuplicatesFastBatchSampler #3611

[feat] Add NoDuplicatesFastBatchSampler #3611

hotchpotch commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[feat] Add NoDuplicatesFastBatchSampler #3611

Are you sure you want to change the base?

[feat] Add NoDuplicatesFastBatchSampler #3611

Conversation

hotchpotch commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Benchmarks (MS MARCO)

Conditions

Memory Considerations

How It Works

Implementation Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hotchpotch commented Jan 9, 2026 •

edited

Loading