Skip to content

Add block prefiltering for OptionalJoin, MultiColumnJoin, and Minus#2673

Open
joka921 wants to merge 10 commits intoad-freiburg:masterfrom
joka921:prefiltered-optional
Open

Add block prefiltering for OptionalJoin, MultiColumnJoin, and Minus#2673
joka921 wants to merge 10 commits intoad-freiburg:masterfrom
joka921:prefiltered-optional

Conversation

@joka921
Copy link
Member

@joka921 joka921 commented Jan 28, 2026

Summary

Extends the existing block-level prefiltering mechanism (currently only in Join) to three additional join-like operations: OptionalJoin, MultiColumnJoin, and Minus.

This enables significant performance improvements when these operations have IndexScan children by filtering index blocks before reading them, reducing unnecessary I/O.

Key Changes

Infrastructure

  • Extended CompressedRelationReader::getBlocksForJoin to support multi-column filtering (2-3 columns) using tuple-based comparison
  • Created JoinWithIndexScanHelpers.h with shared infrastructure for prefiltering across different join semantics (Inner/Optional/Minus)
  • Made IndexScan methods public (getMetadataForScan, getLazyScan) for use by join operations

Implementation

  • OptionalJoin prefiltering: Only right child can be prefiltered (left must be complete due to optional semantics)

    • computeResultForTwoIndexScans: Both children are IndexScans
    • computeResultForIndexScanOnRight: Right is IndexScan, left materialized
    • computeResultForIndexScanOnRightLazy: Right is IndexScan, left lazy
  • MultiColumnJoin prefiltering: Both children can be prefiltered (inner join semantics)

    • Same method structure as OptionalJoin
    • Filters both sides when both are IndexScans
  • Minus prefiltering: Only right child can be prefiltered (similar constraints to Optional)

    • Same method structure as OptionalJoin
    • Only right side filtered to maintain MINUS semantics

Technical Details

  • Multi-column block filtering uses tuple-based comparison on block metadata (firstTriple, lastTriple)
  • Prefiltering is applied when one or both children are IndexScans (detected as direct children)
  • Semantic constraints respected: For OPTIONAL and MINUS, only the right child is prefiltered to maintain correct semantics
  • Supports both lazy and materialized inputs with appropriate prefiltering strategies
  • Const-correctness handled via shared_ptr wrappers for move-only generator types in lambdas

Tests

Added comprehensive unit tests that verify both correctness and prefiltering application:

  • OptionalJoin::prefilteringWithTwoIndexScans
  • MultiColumnJoin::prefilteringWithTwoIndexScans
  • Minus::prefilteringWithTwoIndexScans

All tests create datasets where prefiltering can reduce blocks read, verify result correctness, and confirm runtime information is properly tracked.

Files Modified

  • src/index/CompressedRelation.{h,cpp} - Multi-column block filtering
  • src/engine/JoinWithIndexScanHelpers.h - New shared helper infrastructure
  • src/engine/IndexScan.h - Made methods public
  • src/engine/OptionalJoin.{h,cpp} - Prefiltering implementation
  • src/engine/MultiColumnJoin.{h,cpp} - Prefiltering implementation
  • src/engine/Minus.{h,cpp} - Prefiltering implementation
  • test/engine/OptionalJoinTest.cpp - New test
  • test/MultiColumnJoinTest.cpp - New test
  • test/MinusTest.cpp - New test

Testing

All new tests pass:

[  PASSED  ] OptionalJoin.prefilteringWithTwoIndexScans (68 ms)
[  PASSED  ] MultiColumnJoin.prefilteringWithTwoIndexScans (68 ms)
[  PASSED  ] Minus.prefilteringWithTwoIndexScans (65 ms)

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

Extends the existing block-level prefiltering mechanism (currently only in Join) to three additional join-like operations: OptionalJoin, MultiColumnJoin, and Minus.

Key changes:
- Extended CompressedRelationReader::getBlocksForJoin to support multi-column filtering (2-3 columns)
- Created JoinWithIndexScanHelpers.h with shared infrastructure for prefiltering across different join semantics
- Implemented prefiltering for OptionalJoin (only right child can be prefiltered due to semantic constraints)
- Implemented prefiltering for MultiColumnJoin (both children can be prefiltered)
- Implemented prefiltering for Minus (only right child can be prefiltered due to semantic constraints)
- Made IndexScan methods (getMetadataForScan, getLazyScan) public for use by join operations
- Added comprehensive unit tests for all three operations

Technical details:
- Multi-column block filtering uses tuple-based comparison on block metadata
- Prefiltering is applied when one or both children are IndexScans (detected as direct children)
- For OPTIONAL and MINUS, only the right child is prefiltered to maintain correct semantics
- For inner joins (MultiColumnJoin), both sides can be prefiltered
- Supports both lazy and materialized inputs with appropriate prefiltering strategies

Tests added:
- OptionalJoin::prefilteringWithTwoIndexScans
- MultiColumnJoin::prefilteringWithTwoIndexScans
- Minus::prefilteringWithTwoIndexScans

All tests verify correctness of results and confirm IndexScan prefiltering is applied.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 31.15438% with 495 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.56%. Comparing base (27eef33) to head (8c5a93b).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/engine/OptionalJoin.cpp 30.24% 109 Missing and 4 partials ⚠️
src/engine/MultiColumnJoin.cpp 28.38% 108 Missing and 3 partials ⚠️
src/engine/Minus.cpp 30.40% 100 Missing and 3 partials ⚠️
src/index/CompressedRelation.cpp 36.05% 87 Missing and 7 partials ⚠️
src/engine/JoinWithIndexScanHelpers.h 37.50% 54 Missing and 1 partial ⚠️
src/engine/IndexScan.cpp 0.00% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2673      +/-   ##
==========================================
- Coverage   91.56%   90.56%   -1.00%     
==========================================
  Files         480      481       +1     
  Lines       41275    42014     +739     
  Branches     5491     5576      +85     
==========================================
+ Hits        37793    38050     +257     
- Misses       1904     2370     +466     
- Partials     1578     1594      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joka921 and others added 2 commits January 28, 2026 16:47
Extends OptionalJoin prefiltering to handle the case where the left input
is lazy and the right is an IndexScan. This enables prefiltering of the
right IndexScan while ensuring ALL left input is re-yielded (maintaining
OPTIONAL semantics).

Key changes:
- Added IndexScan::prefilterTablesForOptional() method that ensures all
  left input is re-yielded without filtering
- Added IndexScan::createPrefilteredJoinSideForOptional() helper that
  passes through all input unchanged
- Implemented OptionalJoin::computeResultForIndexScanOnRightLazy() to use
  prefiltering instead of falling back to regular lazy optional join
- The mechanism uses a state machine similar to Join's prefilterTables but
  guarantees all left rows are output (critical for OPTIONAL semantics)

Technical details:
- For OPTIONAL semantics, the left side generator re-yields ALL input
  (never skips any rows, even those without matching blocks)
- The right IndexScan is still prefiltered based on the left's join column
  values, reducing unnecessary block reads
- Uses shared_ptr wrappers for generators to enable const lambda capture in
  runLazyJoinAndConvertToGenerator
- Only supports single join column for now (multi-column support can be
  added later)

Test added:
- OptionalJoin::prefilteringWithLazyLeftAndIndexScanRight
  Verifies correctness with lazy left input and IndexScan right, ensuring
  all 20 left rows are output with 10 matches and 10 UNDEFs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Similar to the OptionalJoin implementation, this enables prefiltering
when the left input is lazy and the right is an IndexScan. The key
difference from regular joins is that for MINUS semantics, we must
process ALL left input to correctly determine which rows should be
excluded.

This implementation reuses the IndexScan::prefilterTablesForOptional
method which passes through all left rows while prefiltering the right
IndexScan based on block metadata. The MINUS vs OPTIONAL semantics
difference is handled by the MinusRowHandler, not by the block-level
prefiltering.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link
Member Author

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some chances to reduce code duplication.

joka921 and others added 4 commits January 29, 2026 08:29
Change function signatures for computeResultForIndexScanOnRight and
computeResultForIndexScanOnRightLazy in OptionalJoin and Minus to accept
`const IndexScan&` instead of `std::shared_ptr<IndexScan>`.

This makes it clearer that the dynamic_pointer_cast has already been
performed by the caller, and follows better C++ practices by avoiding
unnecessary shared_ptr copies.

The const_cast is needed because some IndexScan methods (like getResult
and prefilterTablesForOptional) are not const, but this is acceptable
as the operations do modify the IndexScan's runtime information.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add `convertPrefilteredGenerators` helper function in
JoinWithIndexScanHelpers to eliminate code duplication between
OptionalJoin and Minus when handling lazy left + IndexScan right joins.

This helper handles the common pattern of:
- Creating identity permutation for left side (all columns)
- Creating join column permutation for right side
- Converting Result::LazyResult generators to CachingTransformInputRange
  with IdTableAndFirstCol format

Reduces code duplication by 36 lines while maintaining identical
functionality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add `firstRowHasUndef` helper function in JoinWithIndexScanHelpers to
eliminate duplicate UNDEF checking logic.

This consolidates the pattern of checking if the first row of a table
contains UNDEF values in any of the join columns, which was duplicated
in:
- OptionalJoin::computeResultForIndexScanOnRight
- getBlocksForJoinOfColumnsWithScan (for 1, 2, and 3 column cases)

The helper simplifies the code and makes it more maintainable by
providing a single, clear function for this common check.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…oIndexScans

Change `computeResultForTwoIndexScans` in OptionalJoin, Minus, and
MultiColumnJoin to accept `const IndexScan&` parameters instead of
performing dynamic_pointer_cast internally.

This addresses PR review feedback by:
- Making it clear that the cast is done by the caller
- Eliminating duplicate casts inside the functions
- Following the same pattern as computeResultForIndexScanOnRight(Lazy)

The functions now receive references to already-cast IndexScan objects,
reducing code duplication and improving clarity about responsibilities.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
9.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link
Member Author

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

Comment on lines 122 to 127
if (!leftMetaBlocks.has_value()) {
// If no metadata, fall back to regular computation by returning to caller
// Caller will handle the regular path
return {IdTable{getResultWidth(), allocator()}, resultSortedOn(),
LocalVocab{}};
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, can be removed.

Comment on lines 130 to 132
leftBlocks.details().numBlocksAll_ =
leftMetaBlocks.value().sizeBlockMetadata_;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not needed, can be removed.

Comment on lines 133 to 136
// Get filtered blocks for the right (optional) side based on left's ranges
auto rightBlocks =
getBlocksForJoinOfTwoScans(leftScan, rightScan, _joinColumns.size());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right blocks are only the second entry in the result, directly extract the index here.

Comment on lines 139 to 146
// Create result generator
// Wrap generators in shared_ptr to allow const lambda capture
auto leftBlocksPtr =
std::make_shared<CompressedRelationReader::IdTableGeneratorInputRange>(
std::move(leftBlocks));
auto rightBlocksPtr =
std::make_shared<CompressedRelationReader::IdTableGeneratorInputRange>(
std::move(rightBlocks[1]));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the src/util directory, add a helper function
toSharedPtr, s.t. you can write auto rightBlocksPtr = ad_utility::toSharedPtr(std::move(leftBlocks)).

Comment on lines 109 to 111
Result OptionalJoin::computeResultForTwoIndexScans(
bool requestLaziness, const IndexScan& leftScan,
const IndexScan& rightScan) const {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument should not be const IndexScan& but IndexScan&, then you can get rid of all the ugly const_casts below.

Comment on lines 155 to 157
auto leftConverted = qlever::joinWithIndexScanHelpers::convertGenerator(
std::move(*leftBlocksPtr), const_cast<IndexScan&>(leftScan));
auto rightConverted = qlever::joinWithIndexScanHelpers::convertGenerator(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a using namespace qlever::join...Helpers, this gets even more readable.

Comment on lines 164 to 167
const_cast<IndexScan&>(leftScan).runtimeInfo().status_ =
RuntimeInformation::Status::lazilyMaterializedCompleted;
const_cast<IndexScan&>(rightScan).runtimeInfo().status_ =
RuntimeInformation::Status::lazilyMaterializedCompleted;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include a helper function in the joinHelpers class used above
s.t. you can write setStatusToLazilyCompleted(leftScan, rightScan) instead of this four redundant lines.

@sparql-conformance
Copy link

Overview

Number of Tests Passed ✅ Intended ✅ Failed ❌ Not tested
547 445 68 34 0

Conformance check failed ❌

Test Status Changes 📊

Number of Tests Previous Status Current Status
5 Passed Failed
5 Intended Failed

Details: https://qlever.dev/sparql-conformance-ui?cur=661eb655231675b1f613f39e757ab0a02fcf7c3f&prev=8c2d7c0ae8710cd555004525bedd27ffac060b1b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant