-
-
Notifications
You must be signed in to change notification settings - Fork 838
Update LogMergePolicy to skip to a target number of documents
#2627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| batch_docs = 0; | ||
| candidates.push(MergeCandidate( | ||
| // drain to reuse the buffer | ||
| batch.drain(..).map(|seg| seg.id()).collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing prevents this merge to have 1000 segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that even be prevented? The logic here only runs when there are enough unmerged documents to create a segment above the target_segment_size so it will keep adding segments to merge until this op hits target_segment_size. This is an optimisation to prevent documents in those small segments from being merged over and over again.
| // If there aren't enough documents to create another segment of the target size | ||
| // then break | ||
| if unmerged_docs <= self.target_segment_size { | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok we break here. Which means we keep going through the rest of the function.
But the levels are now empty, because we popped them... This is terrible logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaaaaaaaaaaarg the original merge policy has the same flaw!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't pop all the segments, this loop only pops segments while there are enough unmerged docs to create a segment above target_segment_size. This is the exit condition for if we need to go around again after creating a segment at the target size. Breaking here means that the remaining segments smaller than target_segment_size can go through the log merge logic rather than this skip logic.
src/indexer/log_merge_policy.rs
Outdated
| .collect() | ||
| .into_iter() | ||
| .for_each(|(_, group)| { | ||
| let mut hit_delete_threshold = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for loop would be nicer than for_each here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update
src/indexer/log_merge_policy.rs
Outdated
| .into_iter() | ||
| .for_each(|(_, group)| { | ||
| let mut hit_delete_threshold = false; | ||
| group.for_each(|(_, seg)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. Iterator are great. Abusing them is just making code less readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, will update
src/indexer/log_merge_policy.rs
Outdated
| // Filter for segments that have less than the target number of docs, count total unmerged | ||
| // docs, and sort in descending order | ||
| let mut unmerged_docs = 0; | ||
| let mut levels = segments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this vec does not contains "levels" anymore. Why is it named levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it shouldn't be called that anymore, I'll rename it
|
Have added a commit addressing the feedback @fulmicoton, it looks like there might need to be some clarification on how the logic works though so I'll run through an example input starting from here: tantivy/src/indexer/log_merge_policy.rs Line 105 in d516fc5
First - the condition at tantivy/src/indexer/log_merge_policy.rs Line 106 in d516fc5
Next - the while loop will pop the smallest segments off the end to make the merge candidate, since it needs 1000 events it will pop the last 10 segments with 100 docs, so on the 10th iteration of the loop we'll be in this state when we hit this check tantivy/src/indexer/log_merge_policy.rs Line 116 in d516fc5
Next - the skip batch is created and a new merge candidate with all the segments in the batch will be added to candidates, unmerged_docs will be reduced by the amount in the batch which will make it 500, causing the check on tantivy/src/indexer/log_merge_policy.rs Line 125 in d516fc5
|
From discord conversation:
https://discord.com/channels/908281611840282624/915785344396439552/1341705839668625468
This updated logic makes
LogMergePolicyaim for a specific target number of documents, and opportunistically skip merge operations to reach that target document count.Pros:
max_docs_before_merge)(target_segment_size * 2) - 2Cons:
target_segment_sizetotal docs then it may get merged to a single segment and thus not parallelize well when searching