Add block prefiltering for OptionalJoin, MultiColumnJoin, and Minus#2673
Add block prefiltering for OptionalJoin, MultiColumnJoin, and Minus#2673joka921 wants to merge 10 commits intoad-freiburg:masterfrom
Conversation
Extends the existing block-level prefiltering mechanism (currently only in Join) to three additional join-like operations: OptionalJoin, MultiColumnJoin, and Minus. Key changes: - Extended CompressedRelationReader::getBlocksForJoin to support multi-column filtering (2-3 columns) - Created JoinWithIndexScanHelpers.h with shared infrastructure for prefiltering across different join semantics - Implemented prefiltering for OptionalJoin (only right child can be prefiltered due to semantic constraints) - Implemented prefiltering for MultiColumnJoin (both children can be prefiltered) - Implemented prefiltering for Minus (only right child can be prefiltered due to semantic constraints) - Made IndexScan methods (getMetadataForScan, getLazyScan) public for use by join operations - Added comprehensive unit tests for all three operations Technical details: - Multi-column block filtering uses tuple-based comparison on block metadata - Prefiltering is applied when one or both children are IndexScans (detected as direct children) - For OPTIONAL and MINUS, only the right child is prefiltered to maintain correct semantics - For inner joins (MultiColumnJoin), both sides can be prefiltered - Supports both lazy and materialized inputs with appropriate prefiltering strategies Tests added: - OptionalJoin::prefilteringWithTwoIndexScans - MultiColumnJoin::prefilteringWithTwoIndexScans - Minus::prefilteringWithTwoIndexScans All tests verify correctness of results and confirm IndexScan prefiltering is applied. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2673 +/- ##
==========================================
- Coverage 91.56% 90.56% -1.00%
==========================================
Files 480 481 +1
Lines 41275 42014 +739
Branches 5491 5576 +85
==========================================
+ Hits 37793 38050 +257
- Misses 1904 2370 +466
- Partials 1578 1594 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Extends OptionalJoin prefiltering to handle the case where the left input is lazy and the right is an IndexScan. This enables prefiltering of the right IndexScan while ensuring ALL left input is re-yielded (maintaining OPTIONAL semantics). Key changes: - Added IndexScan::prefilterTablesForOptional() method that ensures all left input is re-yielded without filtering - Added IndexScan::createPrefilteredJoinSideForOptional() helper that passes through all input unchanged - Implemented OptionalJoin::computeResultForIndexScanOnRightLazy() to use prefiltering instead of falling back to regular lazy optional join - The mechanism uses a state machine similar to Join's prefilterTables but guarantees all left rows are output (critical for OPTIONAL semantics) Technical details: - For OPTIONAL semantics, the left side generator re-yields ALL input (never skips any rows, even those without matching blocks) - The right IndexScan is still prefiltered based on the left's join column values, reducing unnecessary block reads - Uses shared_ptr wrappers for generators to enable const lambda capture in runLazyJoinAndConvertToGenerator - Only supports single join column for now (multi-column support can be added later) Test added: - OptionalJoin::prefilteringWithLazyLeftAndIndexScanRight Verifies correctness with lazy left input and IndexScan right, ensuring all 20 left rows are output with 10 matches and 10 UNDEFs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Similar to the OptionalJoin implementation, this enables prefiltering when the left input is lazy and the right is an IndexScan. The key difference from regular joins is that for MINUS semantics, we must process ALL left input to correctly determine which rows should be excluded. This implementation reuses the IndexScan::prefilterTablesForOptional method which passes through all left rows while prefiltering the right IndexScan based on block metadata. The MINUS vs OPTIONAL semantics difference is handled by the MinusRowHandler, not by the block-level prefiltering. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
joka921
left a comment
There was a problem hiding this comment.
Some chances to reduce code duplication.
Change function signatures for computeResultForIndexScanOnRight and computeResultForIndexScanOnRightLazy in OptionalJoin and Minus to accept `const IndexScan&` instead of `std::shared_ptr<IndexScan>`. This makes it clearer that the dynamic_pointer_cast has already been performed by the caller, and follows better C++ practices by avoiding unnecessary shared_ptr copies. The const_cast is needed because some IndexScan methods (like getResult and prefilterTablesForOptional) are not const, but this is acceptable as the operations do modify the IndexScan's runtime information. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add `convertPrefilteredGenerators` helper function in JoinWithIndexScanHelpers to eliminate code duplication between OptionalJoin and Minus when handling lazy left + IndexScan right joins. This helper handles the common pattern of: - Creating identity permutation for left side (all columns) - Creating join column permutation for right side - Converting Result::LazyResult generators to CachingTransformInputRange with IdTableAndFirstCol format Reduces code duplication by 36 lines while maintaining identical functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add `firstRowHasUndef` helper function in JoinWithIndexScanHelpers to eliminate duplicate UNDEF checking logic. This consolidates the pattern of checking if the first row of a table contains UNDEF values in any of the join columns, which was duplicated in: - OptionalJoin::computeResultForIndexScanOnRight - getBlocksForJoinOfColumnsWithScan (for 1, 2, and 3 column cases) The helper simplifies the code and makes it more maintainable by providing a single, clear function for this common check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…oIndexScans Change `computeResultForTwoIndexScans` in OptionalJoin, Minus, and MultiColumnJoin to accept `const IndexScan&` parameters instead of performing dynamic_pointer_cast internally. This addresses PR review feedback by: - Making it clear that the cast is done by the caller - Eliminating duplicate casts inside the functions - Following the same pattern as computeResultForIndexScanOnRight(Lazy) The functions now receive references to already-cast IndexScan objects, reducing code duplication and improving clarity about responsibilities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
src/engine/OptionalJoin.cpp
Outdated
| if (!leftMetaBlocks.has_value()) { | ||
| // If no metadata, fall back to regular computation by returning to caller | ||
| // Caller will handle the regular path | ||
| return {IdTable{getResultWidth(), allocator()}, resultSortedOn(), | ||
| LocalVocab{}}; | ||
| } |
There was a problem hiding this comment.
Not needed, can be removed.
src/engine/OptionalJoin.cpp
Outdated
| leftBlocks.details().numBlocksAll_ = | ||
| leftMetaBlocks.value().sizeBlockMetadata_; | ||
|
|
There was a problem hiding this comment.
Also not needed, can be removed.
src/engine/OptionalJoin.cpp
Outdated
| // Get filtered blocks for the right (optional) side based on left's ranges | ||
| auto rightBlocks = | ||
| getBlocksForJoinOfTwoScans(leftScan, rightScan, _joinColumns.size()); | ||
|
|
There was a problem hiding this comment.
The right blocks are only the second entry in the result, directly extract the index here.
src/engine/OptionalJoin.cpp
Outdated
| // Create result generator | ||
| // Wrap generators in shared_ptr to allow const lambda capture | ||
| auto leftBlocksPtr = | ||
| std::make_shared<CompressedRelationReader::IdTableGeneratorInputRange>( | ||
| std::move(leftBlocks)); | ||
| auto rightBlocksPtr = | ||
| std::make_shared<CompressedRelationReader::IdTableGeneratorInputRange>( | ||
| std::move(rightBlocks[1])); |
There was a problem hiding this comment.
In the src/util directory, add a helper function
toSharedPtr, s.t. you can write auto rightBlocksPtr = ad_utility::toSharedPtr(std::move(leftBlocks)).
src/engine/OptionalJoin.cpp
Outdated
| Result OptionalJoin::computeResultForTwoIndexScans( | ||
| bool requestLaziness, const IndexScan& leftScan, | ||
| const IndexScan& rightScan) const { |
There was a problem hiding this comment.
The argument should not be const IndexScan& but IndexScan&, then you can get rid of all the ugly const_casts below.
src/engine/OptionalJoin.cpp
Outdated
| auto leftConverted = qlever::joinWithIndexScanHelpers::convertGenerator( | ||
| std::move(*leftBlocksPtr), const_cast<IndexScan&>(leftScan)); | ||
| auto rightConverted = qlever::joinWithIndexScanHelpers::convertGenerator( |
There was a problem hiding this comment.
with a using namespace qlever::join...Helpers, this gets even more readable.
src/engine/OptionalJoin.cpp
Outdated
| const_cast<IndexScan&>(leftScan).runtimeInfo().status_ = | ||
| RuntimeInformation::Status::lazilyMaterializedCompleted; | ||
| const_cast<IndexScan&>(rightScan).runtimeInfo().status_ = | ||
| RuntimeInformation::Status::lazilyMaterializedCompleted; |
There was a problem hiding this comment.
include a helper function in the joinHelpers class used above
s.t. you can write setStatusToLazilyCompleted(leftScan, rightScan) instead of this four redundant lines.
Overview
Conformance check failed ❌Test Status Changes 📊
|


Summary
Extends the existing block-level prefiltering mechanism (currently only in
Join) to three additional join-like operations:OptionalJoin,MultiColumnJoin, andMinus.This enables significant performance improvements when these operations have IndexScan children by filtering index blocks before reading them, reducing unnecessary I/O.
Key Changes
Infrastructure
CompressedRelationReader::getBlocksForJointo support multi-column filtering (2-3 columns) using tuple-based comparisonJoinWithIndexScanHelpers.hwith shared infrastructure for prefiltering across different join semantics (Inner/Optional/Minus)IndexScanmethods public (getMetadataForScan,getLazyScan) for use by join operationsImplementation
OptionalJoin prefiltering: Only right child can be prefiltered (left must be complete due to optional semantics)
computeResultForTwoIndexScans: Both children are IndexScanscomputeResultForIndexScanOnRight: Right is IndexScan, left materializedcomputeResultForIndexScanOnRightLazy: Right is IndexScan, left lazyMultiColumnJoin prefiltering: Both children can be prefiltered (inner join semantics)
Minus prefiltering: Only right child can be prefiltered (similar constraints to Optional)
Technical Details
firstTriple,lastTriple)shared_ptrwrappers for move-only generator types in lambdasTests
Added comprehensive unit tests that verify both correctness and prefiltering application:
OptionalJoin::prefilteringWithTwoIndexScansMultiColumnJoin::prefilteringWithTwoIndexScansMinus::prefilteringWithTwoIndexScansAll tests create datasets where prefiltering can reduce blocks read, verify result correctness, and confirm runtime information is properly tracked.
Files Modified
src/index/CompressedRelation.{h,cpp}- Multi-column block filteringsrc/engine/JoinWithIndexScanHelpers.h- New shared helper infrastructuresrc/engine/IndexScan.h- Made methods publicsrc/engine/OptionalJoin.{h,cpp}- Prefiltering implementationsrc/engine/MultiColumnJoin.{h,cpp}- Prefiltering implementationsrc/engine/Minus.{h,cpp}- Prefiltering implementationtest/engine/OptionalJoinTest.cpp- New testtest/MultiColumnJoinTest.cpp- New testtest/MinusTest.cpp- New testTesting
All new tests pass:
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]