Skip to content

Conversation

@joka921
Copy link
Member

@joka921 joka921 commented Oct 13, 2025

This commit adds batched vocabulary merging to handle cases where the number of partial vocabulary files exceeds the operating system's file descriptor limit (typically 1024-4096).

Problem:
The vocabulary merger previously opened all input files and all output mapping files simultaneously. With thousands of input files (e.g., 5000+), this exceeded OS file descriptor limits.

Solution:
Implement two-stage merging when numFiles > MAX_NUM_FILES_FOR_DIRECT_MERGE (default: 2000):

Stage 1: Merge input files in batches

  • Process batches of up to 2000 files at a time
  • Each batch is merged into intermediate files (same format as originals)
  • Create internal ID mappings (original file ID → batch-local ID)

Stage 2: Merge batch results into final vocabulary

  • Merge the batch files (much fewer than original file count)
  • Generate batch-to-global ID mappings

Stage 3: Compose final ID mappings

  • For each original file: compose (original → batch) and (batch → global) to create (original → global) mappings
  • Clean up all intermediate files

Optimization:
When numFiles ≤ MAX_NUM_FILES_FOR_DIRECT_MERGE, the original single-pass algorithm is used (no overhead).

🤖 Generated with Claude Code

joka921 and others added 9 commits October 13, 2025 10:37
This commit adds batched vocabulary merging to handle cases where the
number of partial vocabulary files exceeds the operating system's file
descriptor limit (typically 1024-4096).

**Problem:**
The vocabulary merger previously opened all input files and all output
mapping files simultaneously. With thousands of input files (e.g., 5000+),
this exceeded OS file descriptor limits.

**Solution:**
Implement two-stage merging when numFiles > MAX_NUM_FILES_FOR_DIRECT_MERGE
(default: 2000):

Stage 1: Merge input files in batches
- Process batches of up to 2000 files at a time
- Each batch is merged into intermediate files (same format as originals)
- Create internal ID mappings (original file ID → batch-local ID)

Stage 2: Merge batch results into final vocabulary
- Merge the batch files (much fewer than original file count)
- Generate batch-to-global ID mappings

Stage 3: Compose final ID mappings
- For each original file: compose (original → batch) and (batch → global)
  to create (original → global) mappings
- Clean up all intermediate files

**Optimization:**
When numFiles ≤ MAX_NUM_FILES_FOR_DIRECT_MERGE, the original single-pass
algorithm is used (no overhead).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Extract common patterns into reusable helper functions to reduce
duplication and improve maintainability:

**Extracted helpers:**
- `makeComparators()`: Creates lessThan predicates from word comparator
- `makePartialVocabGenerator()`: Creates input stream from partial vocab file
- `makeBatchVocabGenerator()`: Creates input stream from batch vocab file
- `processQueueWord()`: Unified word processing logic for all merge paths

**Impact:**
- Eliminated ~140 lines of duplicated code
- Comparator creation logic: 3 copies → 1 function
- File generator logic: 2 copies → 2 specialized functions
- Word processing logic: 2 copies → 1 function
- Main merge, batch merge, and two-stage merge now share common code

**Benefits:**
- Easier to maintain (changes in one place)
- More consistent behavior across merge strategies
- Better testability of individual components
- Clearer separation of concerns

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
**Refactoring improvements:**
1. Extract `getTargetIdForLastComponent()` - Eliminates duplication of ID
   computation logic across merge paths
2. Extract `setupParallelMerge()` - Single place for parallel merge setup
3. Extract `composeIdMappings()` - Stage 3 ID composition as separate function
4. Add `BATCH_TO_GLOBAL_IDMAP_INFIX` constant for temporary file naming

**Configuration:**
- Make `maxFilesPerBatch` configurable parameter (default: 2000)
- Propagated through mergeVocabulary() signature
- Enables testing with smaller batch sizes

**Benefits:**
- Further reduces code duplication (60+ lines eliminated)
- Better separation of concerns
- Easier to test individual components
- More flexible configuration for testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add `mergeVocabularyTwoStage` test that forces two-stage merging by
setting `maxFilesPerBatch` to 1. This tests:

- Two-stage merge produces identical results to single-stage merge
- ID mappings are correctly composed through batch intermediate files
- All metadata (language tags, internal entities) is preserved
- Works with small batch sizes (edge case testing)

The test reuses the existing `MergeVocabularyTest` fixture and verifies
that the output vocabulary and ID mappings match the expected values
from the single-stage merge test.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Ensure batch-to-global ID map files are closed before reading them
back in Stage 3. Previously, the IdMapWriter objects were still in
scope when composeIdMappings tried to read the files, causing read
failures.

Fix: Clear batchToGlobalIdMaps vector before Stage 3 to properly
flush and close all file handles.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
EOF
)
Add `MergeVocabularyMultiFileTest` fixture with three comprehensive test cases:

**1. twoFilesPerBatch Test:**
- 4 files total, batch size 2 (2 batches)
- Tests intra-batch duplicates: "delta" in files 0&1, "foxtrot" in files 2&3
- Verifies correct ID mapping composition within batches

**2. wordAcrossBatches Test:**
- 4 files total, batch size 2 (2 batches)
- Tests cross-batch duplicates: "shared" appears in batch 0 (file 0) and
  batch 1 (file 2)
- Verifies correct ID mapping composition across batches

**3. complexMultiBatchScenario Test:**
- 6 files total, batch size 2 (3 batches)
- Tests complex overlapping:
  - "shared1" appears in files 0, 1 (batch 0) and file 4 (batch 2)
  - "shared2" appears in file 0 (batch 0) and file 2 (batch 1)
  - "shared3" appears in files 2, 3 (batch 1) and file 5 (batch 2)
- Verifies all three levels of ID resolution:
  - Local file ID → Batch-local ID → Global ID

All tests verify that:
- Vocabulary is correctly merged and sorted
- Duplicate words receive the same global ID regardless of which file/batch
  they appear in
- ID mapping composition works correctly through all stages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add --max-files-per-batch command-line parameter to IndexBuilderMain
and propagate it through the configuration chain:
- Added to IndexBuilderConfig in Qlever.h (default: 2000)
- Propagated to IndexImpl via setMaxFilesPerBatch() setter
- Passed to mergeVocabulary() function call in IndexImpl.cpp

Also fix log message in mergeSingleBatch() to show correct upper bound
(endFile - 1, since endFile is exclusive).

This allows users to tune the batch size for two-stage vocabulary
merging based on their system's file descriptor limits.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Johannes Kalmbach <[email protected]>
@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 86.08696% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.40%. Comparing base (463d9ab) to head (8f81555).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/index/VocabularyMergerImpl.h 86.01% 12 Missing and 35 partials ⚠️
src/index/VocabularyMerger.h 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2428      +/-   ##
==========================================
- Coverage   91.44%   91.40%   -0.05%     
==========================================
  Files         462      462              
  Lines       46907    47201     +294     
  Branches     5240     5270      +30     
==========================================
+ Hits        42896    43142     +246     
- Misses       2511     2522      +11     
- Partials     1500     1537      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sparql-conformance
Copy link

Overview

Number of Tests Passed ✅ Failed ❌ Intended ⚠️ Not tested
522 439 16 67 0

Conformance check passed ✅

No test result changes.

Details: https://qlever.dev/sparql-conformance-ui?cur=8f81555ef96004eaa3fbf8d3d3c8dbad75002ea2&prev=b105427beac0ccd8ed85dd3eb6000786f7219f22

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
5.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant