Implement two-stage vocabulary merging to avoid file descriptor limits #2428

joka921 · 2025-10-13T08:38:06Z

This commit adds batched vocabulary merging to handle cases where the number of partial vocabulary files exceeds the operating system's file descriptor limit (typically 1024-4096).

Problem:
The vocabulary merger previously opened all input files and all output mapping files simultaneously. With thousands of input files (e.g., 5000+), this exceeded OS file descriptor limits.

Solution:
Implement two-stage merging when numFiles > MAX_NUM_FILES_FOR_DIRECT_MERGE (default: 2000):

Stage 1: Merge input files in batches

Process batches of up to 2000 files at a time
Each batch is merged into intermediate files (same format as originals)
Create internal ID mappings (original file ID → batch-local ID)

Stage 2: Merge batch results into final vocabulary

Merge the batch files (much fewer than original file count)
Generate batch-to-global ID mappings

Stage 3: Compose final ID mappings

For each original file: compose (original → batch) and (batch → global) to create (original → global) mappings
Clean up all intermediate files

Optimization:
When numFiles ≤ MAX_NUM_FILES_FOR_DIRECT_MERGE, the original single-pass algorithm is used (no overhead).

🤖 Generated with Claude Code

This commit adds batched vocabulary merging to handle cases where the number of partial vocabulary files exceeds the operating system's file descriptor limit (typically 1024-4096). **Problem:** The vocabulary merger previously opened all input files and all output mapping files simultaneously. With thousands of input files (e.g., 5000+), this exceeded OS file descriptor limits. **Solution:** Implement two-stage merging when numFiles > MAX_NUM_FILES_FOR_DIRECT_MERGE (default: 2000): Stage 1: Merge input files in batches - Process batches of up to 2000 files at a time - Each batch is merged into intermediate files (same format as originals) - Create internal ID mappings (original file ID → batch-local ID) Stage 2: Merge batch results into final vocabulary - Merge the batch files (much fewer than original file count) - Generate batch-to-global ID mappings Stage 3: Compose final ID mappings - For each original file: compose (original → batch) and (batch → global) to create (original → global) mappings - Clean up all intermediate files **Optimization:** When numFiles ≤ MAX_NUM_FILES_FOR_DIRECT_MERGE, the original single-pass algorithm is used (no overhead). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Extract common patterns into reusable helper functions to reduce duplication and improve maintainability: **Extracted helpers:** - `makeComparators()`: Creates lessThan predicates from word comparator - `makePartialVocabGenerator()`: Creates input stream from partial vocab file - `makeBatchVocabGenerator()`: Creates input stream from batch vocab file - `processQueueWord()`: Unified word processing logic for all merge paths **Impact:** - Eliminated ~140 lines of duplicated code - Comparator creation logic: 3 copies → 1 function - File generator logic: 2 copies → 2 specialized functions - Word processing logic: 2 copies → 1 function - Main merge, batch merge, and two-stage merge now share common code **Benefits:** - Easier to maintain (changes in one place) - More consistent behavior across merge strategies - Better testability of individual components - Clearer separation of concerns 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Signed-off-by: Johannes Kalmbach <[email protected]>

**Refactoring improvements:** 1. Extract `getTargetIdForLastComponent()` - Eliminates duplication of ID computation logic across merge paths 2. Extract `setupParallelMerge()` - Single place for parallel merge setup 3. Extract `composeIdMappings()` - Stage 3 ID composition as separate function 4. Add `BATCH_TO_GLOBAL_IDMAP_INFIX` constant for temporary file naming **Configuration:** - Make `maxFilesPerBatch` configurable parameter (default: 2000) - Propagated through mergeVocabulary() signature - Enables testing with smaller batch sizes **Benefits:** - Further reduces code duplication (60+ lines eliminated) - Better separation of concerns - Easier to test individual components - More flexible configuration for testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add `mergeVocabularyTwoStage` test that forces two-stage merging by setting `maxFilesPerBatch` to 1. This tests: - Two-stage merge produces identical results to single-stage merge - ID mappings are correctly composed through batch intermediate files - All metadata (language tags, internal entities) is preserved - Works with small batch sizes (edge case testing) The test reuses the existing `MergeVocabularyTest` fixture and verifies that the output vocabulary and ID mappings match the expected values from the single-stage merge test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Ensure batch-to-global ID map files are closed before reading them back in Stage 3. Previously, the IdMapWriter objects were still in scope when composeIdMappings tried to read the files, causing read failures. Fix: Clear batchToGlobalIdMaps vector before Stage 3 to properly flush and close all file handles. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> EOF )

Add `MergeVocabularyMultiFileTest` fixture with three comprehensive test cases: **1. twoFilesPerBatch Test:** - 4 files total, batch size 2 (2 batches) - Tests intra-batch duplicates: "delta" in files 0&1, "foxtrot" in files 2&3 - Verifies correct ID mapping composition within batches **2. wordAcrossBatches Test:** - 4 files total, batch size 2 (2 batches) - Tests cross-batch duplicates: "shared" appears in batch 0 (file 0) and batch 1 (file 2) - Verifies correct ID mapping composition across batches **3. complexMultiBatchScenario Test:** - 6 files total, batch size 2 (3 batches) - Tests complex overlapping: - "shared1" appears in files 0, 1 (batch 0) and file 4 (batch 2) - "shared2" appears in file 0 (batch 0) and file 2 (batch 1) - "shared3" appears in files 2, 3 (batch 1) and file 5 (batch 2) - Verifies all three levels of ID resolution: - Local file ID → Batch-local ID → Global ID All tests verify that: - Vocabulary is correctly merged and sorted - Duplicate words receive the same global ID regardless of which file/batch they appear in - ID mapping composition works correctly through all stages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add --max-files-per-batch command-line parameter to IndexBuilderMain and propagate it through the configuration chain: - Added to IndexBuilderConfig in Qlever.h (default: 2000) - Propagated to IndexImpl via setMaxFilesPerBatch() setter - Passed to mergeVocabulary() function call in IndexImpl.cpp Also fix log message in mergeSingleBatch() to show correct upper bound (endFile - 1, since endFile is exclusive). This allows users to tune the batch size for two-stage vocabulary merging based on their system's file descriptor limits. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Signed-off-by: Johannes Kalmbach <[email protected]>

codecov · 2025-10-14T14:12:46Z

Codecov Report

❌ Patch coverage is 86.08696% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.40%. Comparing base (463d9ab) to head (8f81555).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
src/index/VocabularyMergerImpl.h	86.01%	12 Missing and 35 partials ⚠️
src/index/VocabularyMerger.h	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2428      +/-   ##
==========================================
- Coverage   91.44%   91.40%   -0.05%     
==========================================
  Files         462      462              
  Lines       46907    47201     +294     
  Branches     5240     5270      +30     
==========================================
+ Hits        42896    43142     +246     
- Misses       2511     2522      +11     
- Partials     1500     1537      +37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Johannes Kalmbach <[email protected]>

…nto two-stage-vocab-merging

sparql-conformance · 2025-10-21T12:30:35Z

Overview

Number of Tests	Passed ✅	Failed ❌	Intended ⚠️	Not tested
522	439	16	67	0

Conformance check passed ✅

No test result changes.

Details: https://qlever.dev/sparql-conformance-ui?cur=8f81555ef96004eaa3fbf8d3d3c8dbad75002ea2&prev=b105427beac0ccd8ed85dd3eb6000786f7219f22

sonarqubecloud · 2025-10-21T13:40:06Z

Quality Gate failed

Failed conditions
5.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

joka921 and others added 9 commits October 13, 2025 10:37

Some refactorings and some comments for claude...

b9df053

Signed-off-by: Johannes Kalmbach <[email protected]>

Fix the private/public issue.

06392fe

Signed-off-by: Johannes Kalmbach <[email protected]>

joka921 added 3 commits October 15, 2025 11:23

Fix a warning on GCC

4c54469

Signed-off-by: Johannes Kalmbach <[email protected]>

Claude-generated updates

88a7409

Merge branch 'two-stage-vocab-merging' of github.com:joka921/QLever i…

8f81555

…nto two-stage-vocab-merging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement two-stage vocabulary merging to avoid file descriptor limits #2428

Implement two-stage vocabulary merging to avoid file descriptor limits #2428

Uh oh!

joka921 commented Oct 13, 2025

Uh oh!

codecov bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

sparql-conformance bot commented Oct 21, 2025

Uh oh!

sonarqubecloud bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement two-stage vocabulary merging to avoid file descriptor limits #2428

Are you sure you want to change the base?

Implement two-stage vocabulary merging to avoid file descriptor limits #2428

Uh oh!

Conversation

joka921 commented Oct 13, 2025

Uh oh!

codecov bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sparql-conformance bot commented Oct 21, 2025

Overview

Conformance check passed ✅

Uh oh!

sonarqubecloud bot commented Oct 21, 2025

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Oct 14, 2025 •

edited

Loading