Update `LogMergePolicy` to skip to a target number of documents #2627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

kryesh wants to merge 6 commits into quickwit-oss:main from kryesh:skip_merge

+140 −55

Contributor

kryesh commented Apr 18, 2025 •

edited

Loading

From discord conversation:
https://discord.com/channels/908281611840282624/915785344396439552/1341705839668625468

This updated logic makes LogMergePolicy aim for a specific target number of documents, and opportunistically skip merge operations to reach that target document count.
Pros:

Reduced IO/CPU usage from skipping intermediate merge operations
No longer susceptible to creating huge merge operations by merging many large segments into a single segment many times larger than the target size (previously max_docs_before_merge)
- Oversized merge operations can lead to other issues such as Attempt to multiply with overflow #2577
- Limits applied to input segment size aren't effective if many segments are being merged
- The theoretical maximum size of a segment with this updated logic is (target_segment_size * 2) - 2

Cons:

If an index has a little over target_segment_size total docs then it may get merged to a single segment and thus not parallelize well when searching

kryesh added 5 commits

April 18, 2025 21:24


          Update LogMergePolicy to skip to a target number of documents per s…

67c49eb

…egment when possible


          Update comment for set_target_segment_size

c44b585


          Rename DEFAULT_MAX_DOCS_BEFORE_MERGE to DEFAULT_TARGET_SEGMENT_SIZE

14c164a


          Merge perf_2 branch

96ff2eb


          Clean up and add comments

f42abec

fulmicoton reviewed

View reviewed changes

src/indexer/log_merge_policy.rs

    
                                  batch_docs = 0;

                                  candidates.push(MergeCandidate(

                                      // drain to reuse the buffer

                                      batch.drain(..).map(|seg| seg.id()).collect(),

Collaborator

fulmicoton Jan 2, 2026

Nothing prevents this merge to have 1000 segments.

Contributor Author

kryesh Jan 2, 2026

Should that even be prevented? The logic here only runs when there are enough unmerged documents to create a segment above the target_segment_size so it will keep adding segments to merge until this op hits target_segment_size. This is an optimisation to prevent documents in those small segments from being merged over and over again.

fulmicoton reviewed

View reviewed changes

src/indexer/log_merge_policy.rs

    
                                  // If there aren't enough documents to create another segment of the target size

                                  // then break

                                  if unmerged_docs <= self.target_segment_size {

                                      break;

Collaborator

fulmicoton Jan 2, 2026

Ok we break here. Which means we keep going through the rest of the function.

But the levels are now empty, because we popped them... This is terrible logic.

Collaborator

fulmicoton Jan 2, 2026

aaaaaaaaaaaarg the original merge policy has the same flaw!

Contributor Author

kryesh Jan 2, 2026

We don't pop all the segments, this loop only pops segments while there are enough unmerged docs to create a segment above target_segment_size. This is the exit condition for if we need to go around again after creating a segment at the target size. Breaking here means that the remaining segments smaller than target_segment_size can go through the log merge logic rather than this skip logic.

fulmicoton reviewed

View reviewed changes

src/indexer/log_merge_policy.rs Outdated

    
                          .collect()

                          .into_iter()

                          .for_each(|(_, group)| {

                              let mut hit_delete_threshold = false;

Collaborator

fulmicoton Jan 2, 2026

for loop would be nicer than for_each here.

Contributor Author

kryesh Jan 2, 2026

Sure, will update

fulmicoton reviewed

View reviewed changes

src/indexer/log_merge_policy.rs Outdated

    
                          .into_iter()

                          .for_each(|(_, group)| {

                              let mut hit_delete_threshold = false;

                              group.for_each(|(_, seg)| {

Collaborator

fulmicoton Jan 2, 2026

same. Iterator are great. Abusing them is just making code less readable.

Contributor Author

kryesh Jan 2, 2026

as above, will update

fulmicoton reviewed

View reviewed changes

src/indexer/log_merge_policy.rs Outdated

    
                      // Filter for segments that have less than the target number of docs, count total unmerged

                      // docs, and sort in descending order

                      let mut unmerged_docs = 0;

                      let mut levels = segments

Collaborator

fulmicoton Jan 2, 2026

this vec does not contains "levels" anymore. Why is it named levels?

Contributor Author

kryesh Jan 2, 2026

Yeah it shouldn't be called that anymore, I'll rename it


          Apply feedback

d516fc5

Contributor Author

kryesh commented Jan 2, 2026

Have added a commit addressing the feedback @fulmicoton, it looks like there might need to be some clarification on how the logic works though so I'll run through an example input starting from here:

tantivy/src/indexer/log_merge_policy.rs

Line 105 in d516fc5

let mut candidates = Vec::new();

target_segment_size = 1000
min_num_segments = 2
sorted_segments =  vec![250, 250, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
unmerged_docs = 1500

First - the condition at

tantivy/src/indexer/log_merge_policy.rs

Line 106 in d516fc5

if unmerged_docs >= self.target_segment_size {

will be met so it starts building a batch.

Next - the while loop will pop the smallest segments off the end to make the merge candidate, since it needs 1000 events it will pop the last 10 segments with 100 docs, so on the 10th iteration of the loop we'll be in this state when we hit this check

tantivy/src/indexer/log_merge_policy.rs

Line 116 in d516fc5

if batch_docs >= self.target_segment_size {

:

sorted_segments =  vec![250, 250]
unmerged_docs = 1500
batch = vec![100, 100, 100, 100, 100, 100, 100, 100, 100, 100]]
batch_docs = 1000

Next - the skip batch is created and a new merge candidate with all the segments in the batch will be added to candidates, unmerged_docs will be reduced by the amount in the batch which will make it 500, causing the check on

tantivy/src/indexer/log_merge_policy.rs

Line 125 in d516fc5

if unmerged_docs <= self.target_segment_size {

to break out of the while loop. Then the remaining 2 segments will be merged by the log merge logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet