Skip to content

Conversation

@hotchpotch
Copy link

@hotchpotch hotchpotch commented Jan 9, 2026

The current NoDuplicatesBatchSampler can become significantly slow when working with datasets that have many duplicate values across query / positive / negatives columns, especially with large batch sizes (e.g., bs=8192). This is particularly noticeable with triplet or hard negatives data.

Summary of Changes

This PR adds NoDuplicatesFastBatchSampler, which speeds up duplicate checking by pre-computing xxhash 64-bit values for each sample using datasets.map(). It maintains the same batch construction policy as NoDuplicatesBatchSampler (avoiding duplicates within a batch) while significantly improving performance.

Since this approach increases memory usage, both options are provided:

  • NO_DUPLICATES: Existing sampler (memory-efficient)
  • NO_DUPLICATES_FAST: New sampler (faster, but uses more memory)

Benchmarks (MS MARCO)

Benchmarked using the following HuggingFace datasets:

  • sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-hard
  • sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1 / triplet-50

Conditions

  • Batch size: 128 and 8192
  • Hash parallelization: num_proc=8
  • Progress bar disabled (--no-progress-bar)

The table below summarizes execution time, memory usage, and batch counts. Memory is measured using USS (Unique Set Size). The fast sampler stores hash values as NumPy int64 arrays, which accounts for the increased memory usage. The original NO_DUPLICATES checks values on-the-fly and does not increase memory usage.

dataset sampler bs total_time hash_time hash_uss_current hash_uss_peak batches (ideal/delta)
triplet-50 NO_DUPLICATES 128 71.386s n/a n/a n/a 3929 (ideal=3929, delta=0)
triplet-50 NO_DUPLICATES_FAST 128 3.496s 3.724s 211.61MiB 211.62MiB 3929 (ideal=3929, delta=0)
triplet-50 NO_DUPLICATES 8192 283.215s n/a n/a n/a 58 (ideal=61, delta=3)
triplet-50 NO_DUPLICATES_FAST 8192 6.835s 3.723s 201.52MiB 201.54MiB 58 (ideal=61, delta=3)
triplet-hard NO_DUPLICATES 128 405.658s n/a n/a n/a 91114 (ideal=91114, delta=0)
triplet-hard NO_DUPLICATES_FAST 128 261.424s 4.674s 314.26MiB 510.76MiB 91114 (ideal=91114, delta=0)
triplet-hard NO_DUPLICATES 8192 171.853s n/a n/a n/a 1423 (ideal=1423, delta=0)
triplet-hard NO_DUPLICATES_FAST 8192 21.567s 4.579s 313.82MiB 526.93MiB 1423 (ideal=1423, delta=0)

Environment: Ryzen 9 7950 (num_proc=8), Ubuntu 24

Memory Considerations

This implementation stores hash values as int64 NumPy ndarrays, which increases memory usage compared to the current NoDuplicatesBatchSampler.

For reference, using sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1:

  • triplet-50 (503,302 rows): ~200MiB additional memory
  • triplet-hard (11,662,655 rows): ~314MiB additional memory

Therefore, users can choose between:

  • NO_DUPLICATES: Memory-efficient (existing)
  • NO_DUPLICATES_FAST: Faster (new)

How It Works

  1. On first iteration only: Use datasets.map() to retrieve all values from query / positive / negatives columns
  2. Hash strings using xxhash 64-bit
  3. Store hash arrays per row as NumPy arrays (assumes fixed-length rows, which is valid since query, positive, and negatives columns are consistent within a dataset)
  4. In __iter__, use hash arrays for fast duplicate checking while constructing batches

Implementation Notes

  • While xxhash64 can theoretically produce hash collisions, the probability is extremely low. Even if a collision occurs, it would only result in excluding a non-duplicate sample from the same batch, which has minimal impact on training. Therefore, this is considered negligible in practice.
  • Hashing is parallelized using datasets.map(..., num_proc=N) for speed.
  • I haven't found other places in this project that use multiprocessing in a similar way. If a different implementation style is preferred, or if parallelization should be avoided, please let me know.
  • The number of parallel workers is capped at 8 even on machines with more cores. Feedback on whether this default is appropriate is welcome.
  • Suggestions for better optimization approaches or alternative implementations are also welcome.

Benchmark Commands
# triplet-50, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-50, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-50 --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=128
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 128 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

# triplet-hard, bs=8192
python examples/sentence_transformer/evaluation/evaluation_no_dup_batch_sampler_speed.py \
  --dataset-subset triplet-hard --batch-size 8192 --target default --target fast --no-progress-bar --measure-hash-uss --num-proc 8

Feedback and suggestions are appreciated!

@hotchpotch hotchpotch marked this pull request as ready for review January 9, 2026 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant