Skip to content

Conversation

@kryesh
Copy link
Contributor

@kryesh kryesh commented Apr 18, 2025

From discord conversation:
https://discord.com/channels/908281611840282624/915785344396439552/1341705839668625468

This updated logic makes LogMergePolicy aim for a specific target number of documents, and opportunistically skip merge operations to reach that target document count.
Pros:

  • Reduced IO/CPU usage from skipping intermediate merge operations
  • No longer susceptible to creating huge merge operations by merging many large segments into a single segment many times larger than the target size (previously max_docs_before_merge)
    • Oversized merge operations can lead to other issues such as Attempt to multiply with overflow #2577
    • Limits applied to input segment size aren't effective if many segments are being merged
    • The theoretical maximum size of a segment with this updated logic is (target_segment_size * 2) - 2

Cons:

  • If an index has a little over target_segment_size total docs then it may get merged to a single segment and thus not parallelize well when searching

batch_docs = 0;
candidates.push(MergeCandidate(
// drain to reuse the buffer
batch.drain(..).map(|seg| seg.id()).collect(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing prevents this merge to have 1000 segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that even be prevented? The logic here only runs when there are enough unmerged documents to create a segment above the target_segment_size so it will keep adding segments to merge until this op hits target_segment_size. This is an optimisation to prevent documents in those small segments from being merged over and over again.

// If there aren't enough documents to create another segment of the target size
// then break
if unmerged_docs <= self.target_segment_size {
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we break here. Which means we keep going through the rest of the function.

But the levels are now empty, because we popped them... This is terrible logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaaaaaaaaaarg the original merge policy has the same flaw!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't pop all the segments, this loop only pops segments while there are enough unmerged docs to create a segment above target_segment_size. This is the exit condition for if we need to go around again after creating a segment at the target size. Breaking here means that the remaining segments smaller than target_segment_size can go through the log merge logic rather than this skip logic.

.collect()
.into_iter()
.for_each(|(_, group)| {
let mut hit_delete_threshold = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loop would be nicer than for_each here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update

.into_iter()
.for_each(|(_, group)| {
let mut hit_delete_threshold = false;
group.for_each(|(_, seg)| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. Iterator are great. Abusing them is just making code less readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, will update

// Filter for segments that have less than the target number of docs, count total unmerged
// docs, and sort in descending order
let mut unmerged_docs = 0;
let mut levels = segments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this vec does not contains "levels" anymore. Why is it named levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it shouldn't be called that anymore, I'll rename it

@kryesh
Copy link
Contributor Author

kryesh commented Jan 2, 2026

Have added a commit addressing the feedback @fulmicoton, it looks like there might need to be some clarification on how the logic works though so I'll run through an example input starting from here:

let mut candidates = Vec::new();

target_segment_size = 1000
min_num_segments = 2
sorted_segments =  vec![250, 250, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
unmerged_docs = 1500

First - the condition at

if unmerged_docs >= self.target_segment_size {
will be met so it starts building a batch.

Next - the while loop will pop the smallest segments off the end to make the merge candidate, since it needs 1000 events it will pop the last 10 segments with 100 docs, so on the 10th iteration of the loop we'll be in this state when we hit this check

if batch_docs >= self.target_segment_size {
:

sorted_segments =  vec![250, 250]
unmerged_docs = 1500
batch = vec![100, 100, 100, 100, 100, 100, 100, 100, 100, 100]]
batch_docs = 1000

Next - the skip batch is created and a new merge candidate with all the segments in the batch will be added to candidates, unmerged_docs will be reduced by the amount in the batch which will make it 500, causing the check on

if unmerged_docs <= self.target_segment_size {
to break out of the while loop. Then the remaining 2 segments will be merged by the log merge logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants