-
Notifications
You must be signed in to change notification settings - Fork 97
Implement two-stage vocabulary merging to avoid file descriptor limits #2428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
joka921
wants to merge
12
commits into
ad-freiburg:master
Choose a base branch
from
joka921:two-stage-vocab-merging
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds batched vocabulary merging to handle cases where the number of partial vocabulary files exceeds the operating system's file descriptor limit (typically 1024-4096). **Problem:** The vocabulary merger previously opened all input files and all output mapping files simultaneously. With thousands of input files (e.g., 5000+), this exceeded OS file descriptor limits. **Solution:** Implement two-stage merging when numFiles > MAX_NUM_FILES_FOR_DIRECT_MERGE (default: 2000): Stage 1: Merge input files in batches - Process batches of up to 2000 files at a time - Each batch is merged into intermediate files (same format as originals) - Create internal ID mappings (original file ID → batch-local ID) Stage 2: Merge batch results into final vocabulary - Merge the batch files (much fewer than original file count) - Generate batch-to-global ID mappings Stage 3: Compose final ID mappings - For each original file: compose (original → batch) and (batch → global) to create (original → global) mappings - Clean up all intermediate files **Optimization:** When numFiles ≤ MAX_NUM_FILES_FOR_DIRECT_MERGE, the original single-pass algorithm is used (no overhead). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Extract common patterns into reusable helper functions to reduce duplication and improve maintainability: **Extracted helpers:** - `makeComparators()`: Creates lessThan predicates from word comparator - `makePartialVocabGenerator()`: Creates input stream from partial vocab file - `makeBatchVocabGenerator()`: Creates input stream from batch vocab file - `processQueueWord()`: Unified word processing logic for all merge paths **Impact:** - Eliminated ~140 lines of duplicated code - Comparator creation logic: 3 copies → 1 function - File generator logic: 2 copies → 2 specialized functions - Word processing logic: 2 copies → 1 function - Main merge, batch merge, and two-stage merge now share common code **Benefits:** - Easier to maintain (changes in one place) - More consistent behavior across merge strategies - Better testability of individual components - Clearer separation of concerns 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Signed-off-by: Johannes Kalmbach <[email protected]>
**Refactoring improvements:** 1. Extract `getTargetIdForLastComponent()` - Eliminates duplication of ID computation logic across merge paths 2. Extract `setupParallelMerge()` - Single place for parallel merge setup 3. Extract `composeIdMappings()` - Stage 3 ID composition as separate function 4. Add `BATCH_TO_GLOBAL_IDMAP_INFIX` constant for temporary file naming **Configuration:** - Make `maxFilesPerBatch` configurable parameter (default: 2000) - Propagated through mergeVocabulary() signature - Enables testing with smaller batch sizes **Benefits:** - Further reduces code duplication (60+ lines eliminated) - Better separation of concerns - Easier to test individual components - More flexible configuration for testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add `mergeVocabularyTwoStage` test that forces two-stage merging by setting `maxFilesPerBatch` to 1. This tests: - Two-stage merge produces identical results to single-stage merge - ID mappings are correctly composed through batch intermediate files - All metadata (language tags, internal entities) is preserved - Works with small batch sizes (edge case testing) The test reuses the existing `MergeVocabularyTest` fixture and verifies that the output vocabulary and ID mappings match the expected values from the single-stage merge test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Ensure batch-to-global ID map files are closed before reading them back in Stage 3. Previously, the IdMapWriter objects were still in scope when composeIdMappings tried to read the files, causing read failures. Fix: Clear batchToGlobalIdMaps vector before Stage 3 to properly flush and close all file handles. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> EOF )
Add `MergeVocabularyMultiFileTest` fixture with three comprehensive test cases: **1. twoFilesPerBatch Test:** - 4 files total, batch size 2 (2 batches) - Tests intra-batch duplicates: "delta" in files 0&1, "foxtrot" in files 2&3 - Verifies correct ID mapping composition within batches **2. wordAcrossBatches Test:** - 4 files total, batch size 2 (2 batches) - Tests cross-batch duplicates: "shared" appears in batch 0 (file 0) and batch 1 (file 2) - Verifies correct ID mapping composition across batches **3. complexMultiBatchScenario Test:** - 6 files total, batch size 2 (3 batches) - Tests complex overlapping: - "shared1" appears in files 0, 1 (batch 0) and file 4 (batch 2) - "shared2" appears in file 0 (batch 0) and file 2 (batch 1) - "shared3" appears in files 2, 3 (batch 1) and file 5 (batch 2) - Verifies all three levels of ID resolution: - Local file ID → Batch-local ID → Global ID All tests verify that: - Vocabulary is correctly merged and sorted - Duplicate words receive the same global ID regardless of which file/batch they appear in - ID mapping composition works correctly through all stages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add --max-files-per-batch command-line parameter to IndexBuilderMain and propagate it through the configuration chain: - Added to IndexBuilderConfig in Qlever.h (default: 2000) - Propagated to IndexImpl via setMaxFilesPerBatch() setter - Passed to mergeVocabulary() function call in IndexImpl.cpp Also fix log message in mergeSingleBatch() to show correct upper bound (endFile - 1, since endFile is exclusive). This allows users to tune the batch size for two-stage vocabulary merging based on their system's file descriptor limits. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Signed-off-by: Johannes Kalmbach <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2428 +/- ##
==========================================
- Coverage 91.44% 91.40% -0.05%
==========================================
Files 462 462
Lines 46907 47201 +294
Branches 5240 5270 +30
==========================================
+ Hits 42896 43142 +246
- Misses 2511 2522 +11
- Partials 1500 1537 +37 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Johannes Kalmbach <[email protected]>
…nto two-stage-vocab-merging
Overview
Conformance check passed ✅No test result changes. |
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


This commit adds batched vocabulary merging to handle cases where the number of partial vocabulary files exceeds the operating system's file descriptor limit (typically 1024-4096).
Problem:
The vocabulary merger previously opened all input files and all output mapping files simultaneously. With thousands of input files (e.g., 5000+), this exceeded OS file descriptor limits.
Solution:
Implement two-stage merging when numFiles > MAX_NUM_FILES_FOR_DIRECT_MERGE (default: 2000):
Stage 1: Merge input files in batches
Stage 2: Merge batch results into final vocabulary
Stage 3: Compose final ID mappings
Optimization:
When numFiles ≤ MAX_NUM_FILES_FOR_DIRECT_MERGE, the original single-pass algorithm is used (no overhead).
🤖 Generated with Claude Code