feat(native-pos) Use HashTable caching in Broadcast joins #26806

shrinidhijoshi · 2025-12-15T05:38:28Z

Summary:
Velox introduced ability to Cache Hash Tables in the HashBuild operator.
This is useful in Broadcast joins as we can built the HashTable once per worker
and re-use it for all the join tasks that land on that worker.

Velox PRs: facebookincubator/velox#15754
and facebookincubator/velox#15768

This diff enables setting the useCachedHashTable=true during
velox HashBuild node construction in the cases that it is a
broadcast join

Differential Revision: D88900941

sourcery-ai · 2025-12-15T05:38:38Z

Reviewer's Guide

Enables Velox hash table caching for broadcast (replicated) joins in Presto native execution by passing a useCachedHashTable flag on HashJoinNodes for appropriate join types, and wires up new system configs to control broadcast hash table caching and exchange eager fetch behavior, exposing them to Velox query configuration.

Sequence diagram for broadcast join planning with hash table caching

sequenceDiagram
  actor User
  participant PrestoPlanner
  participant VeloxQueryPlanConverterBase as PlanConverter
  participant SystemConfig
  participant core_HashJoinNode as HashJoinNode
  participant VeloxEngine as VeloxEngine

  User->>PrestoPlanner: submit query with broadcast join
  PrestoPlanner->>PlanConverter: toVeloxQueryPlan(joinNode)

  PlanConverter->>SystemConfig: broadcastJoinTableCachingEnabled()
  SystemConfig-->>PlanConverter: bool enabled

  PlanConverter->>PlanConverter: detect REPLICATED distribution
  PlanConverter->>PlanConverter: compute useCachedHashTable = isBroadcastJoin && enabled

  alt useCachedHashTable true
    PlanConverter->>HashJoinNode: new HashJoinNode(..., useCachedHashTable=true)
  else useCachedHashTable false
    PlanConverter->>HashJoinNode: new HashJoinNode(..., useCachedHashTable omitted)
  end

  PlanConverter-->>PrestoPlanner: HashJoinNode wrapped in ProjectNode
  PrestoPlanner-->>VeloxEngine: submit Velox plan

  VeloxEngine->>VeloxEngine: HashBuild uses cached hash table per worker when useCachedHashTable=true

Class diagram for updated SystemConfig and HashJoinNode usage

classDiagram
  class SystemConfig {
    +static kBroadcastJoinTableCachingEnabled : string_view
    +static kExchangeEagerFetchEnabled : string_view
    +broadcastJoinTableCachingEnabled() bool
    +exchangeEagerFetchEnabled() bool
  }

  class VeloxQueryPlanConverterBase {
    +toVeloxQueryPlan(node SemiJoinNodePtr, tableWriteInfo TableWriteInfo, taskId PrestoTaskId) core_PlanNodePtr
    +toVeloxQueryPlan(node JoinNodePtr, tableWriteInfo TableWriteInfo, taskId PrestoTaskId) core_PlanNodePtr
    +toVeloxQueryPlan(node LeftSemiJoinNodePtr, tableWriteInfo TableWriteInfo, taskId PrestoTaskId) core_PlanNodePtr
    -exprConverter_ ExprConverter
    -typeParser_ TypeParser
  }

  class core_HashJoinNode {
    +id : PlanNodeId
    +joinType : JoinType
    +nullAware : bool
    +leftKeys : vector_ExprPtr
    +rightKeys : vector_ExprPtr
    +filter : ExprPtr
    +left : core_PlanNodePtr
    +right : core_PlanNodePtr
    +outputType : RowTypePtr
    +useCachedHashTable : bool
    +HashJoinNode(id PlanNodeId, joinType JoinType, nullAware bool, leftKeys vector_ExprPtr, rightKeys vector_ExprPtr, filter ExprPtr, left core_PlanNodePtr, right core_PlanNodePtr, outputType RowTypePtr)
    +HashJoinNode(id PlanNodeId, joinType JoinType, nullAware bool, leftKeys vector_ExprPtr, rightKeys vector_ExprPtr, filter ExprPtr, left core_PlanNodePtr, right core_PlanNodePtr, outputType RowTypePtr, useCachedHashTable bool)
  }

  VeloxQueryPlanConverterBase --> SystemConfig : uses
  VeloxQueryPlanConverterBase --> core_HashJoinNode : constructs

  class SemiJoinNode {
    +id : PlanNodeId
    +distributionType : optional_DistributionType
  }

  class JoinNode {
    +id : PlanNodeId
    +distributionType : optional_JoinDistributionType
    +filter : optional_Expr
    +left : PlanNodePtr
    +right : PlanNodePtr
    +outputVariables : vector_Variable
  }

  class LeftSemiJoinNode {
    +id : PlanNodeId
    +distributionType : optional_DistributionType
    +left : PlanNodePtr
    +right : PlanNodePtr
  }

  VeloxQueryPlanConverterBase --> SemiJoinNode : converts
  VeloxQueryPlanConverterBase --> JoinNode : converts
  VeloxQueryPlanConverterBase --> LeftSemiJoinNode : converts

Flow diagram for system config mapping to Velox query config

flowchart LR
  A(SystemConfig properties)
  B(kBroadcastJoinTableCachingEnabled)
  C(kExchangeEagerFetchEnabled)
  D(PrestoToVeloxQueryConfig.updateFromSystemConfigs)
  E(Velox core_QueryConfig)
  F(core_HashBuild and exchanges)

  A --> B
  A --> C
  B --> D
  C --> D
  D --> E
  E --> F

File-Level Changes

Change	Details	Files
Enable useCachedHashTable on HashJoinNode for broadcast semi/anti joins based on distribution type and system config	Detect semi/anti broadcast joins via DistributionType::REPLICATED on the semi join node Gate hash table caching on a new SystemConfig::broadcastJoinTableCachingEnabled flag Construct HashJoinNode instances with the useCachedHashTable boolean parameter for eligible semi/anti joins Refactor project node construction to reuse a pre-built HashJoinNode pointer instead of inlining the constructor call	`presto-native-execution/presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp`
Enable useCachedHashTable on HashJoinNode for regular broadcast joins based on distribution type and system config	Detect regular broadcast joins via JoinDistributionType::REPLICATED on the join node Gate hash table caching on SystemConfig::broadcastJoinTableCachingEnabled Return a HashJoinNode constructed with useCachedHashTable enabled when conditions are met	`presto-native-execution/presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp`
Enable useCachedHashTable on HashJoinNode for broadcast semi-project joins based on distribution type and system config	Detect broadcast semi-project joins via DistributionType::REPLICATED on the join node Gate hash table caching on SystemConfig::broadcastJoinTableCachingEnabled Return a HashJoinNode with JoinType::kLeftSemiProject and useCachedHashTable enabled when applicable	`presto-native-execution/presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp`
Introduce new system configuration flags for broadcast join hash table caching and exchange eager fetch and expose them to Velox query config	Add kBroadcastJoinTableCachingEnabled and kExchangeEagerFetchEnabled keys to SystemConfig and default them to true Provide SystemConfig::broadcastJoinTableCachingEnabled and SystemConfig::exchangeEagerFetchEnabled accessors Map the Presto exchange-eager-fetch-enabled system property into Velox core::QueryConfig::kExchangeEagerFetchEnabled in PrestoToVeloxQueryConfig	`presto-native-execution/presto_cpp/main/common/Configs.h` `presto-native-execution/presto_cpp/main/common/Configs.cpp` `presto-native-execution/presto_cpp/main/PrestoToVeloxQueryConfig.cpp`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Summary: Velox introduced ability to Cache Hash Tables in the HashBuild operator. This is useful in Broadcast joins as we can built the HashTable once per worker and re-use it for all the join tasks that land on that worker. Velox PRs: facebookincubator/velox#15754 and facebookincubator/velox#15768 This diff enables setting the `useCachedHashTable=true` during velox `HashBuild` node construction in the cases that it is a broadcast join Differential Revision: D88900941

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The broadcast-join useCachedHashTable logic is duplicated in several places (semi/anti join, regular join, semi-project); consider extracting a small helper or utility function to encapsulate the isBroadcastJoin && broadcastJoinTableCachingEnabled() check and HashJoinNode construction to keep the plan converter logic DRY and easier to maintain.
The new #include "presto_cpp/main/types/PrestoTaskId.h" in PrestoToVeloxQueryPlan.cpp does not appear to be used in this diff; if it’s unnecessary, removing it would keep dependencies minimal.
In the new HashJoinNode constructions, the expression joinType == core::JoinType::kAnti ? true : false can be simplified to joinType == core::JoinType::kAnti to reduce noise and improve readability.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The broadcast-join `useCachedHashTable` logic is duplicated in several places (semi/anti join, regular join, semi-project); consider extracting a small helper or utility function to encapsulate the `isBroadcastJoin && broadcastJoinTableCachingEnabled()` check and HashJoinNode construction to keep the plan converter logic DRY and easier to maintain.
- The new `#include "presto_cpp/main/types/PrestoTaskId.h"` in `PrestoToVeloxQueryPlan.cpp` does not appear to be used in this diff; if it’s unnecessary, removing it would keep dependencies minimal.
- In the new HashJoinNode constructions, the expression `joinType == core::JoinType::kAnti ? true : false` can be simplified to `joinType == core::JoinType::kAnti` to reduce noise and improve readability.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

yingsu00 · 2025-12-18T05:39:50Z

presto-native-execution/presto_cpp/main/PrestoToVeloxQueryConfig.cpp

           }},
+
+      {.prestoSystemConfig =
+           std::string(SystemConfig::kExchangeEagerFetchEnabled),


Why is kExchangeEagerFetchEnabled in this PR?

yingsu00 · 2025-12-18T05:42:15Z

presto-native-execution/presto_cpp/main/common/Configs.h

      "order-by-spill-enabled"};
  static constexpr std::string_view kMaxSpillBytes{"max-spill-bytes"};

+  /// When enabled, hash tables built for broadcast joins are cached and reused


Why is kBroadcastJoinTableCachingEnabled in this PR? Is HashTable caching dependent on it? It seems this change belongs to another PR.

yingsu00 · 2025-12-18T05:53:06Z

presto-native-execution/presto_cpp/main/types/PrestoToVeloxQueryPlan.cpp

    rightKeys.emplace_back(exprConverter_.toVeloxExpr(right));
  }

+  // Check if this is a broadcast join (REPLICATED distribution)


This logic is repeated 3 times. Is it possible to extract a common function e.g. createHashJoinNode(,...) for it?

shrinidhijoshi requested review from a team as code owners December 15, 2025 05:38

prestodb-ci added the from:Meta PR from Meta label Dec 15, 2025

facebook-github-bot added fb-exported meta-exported labels Dec 15, 2025

shrinidhijoshi changed the title ~~[presto_cpp] use hashTable caching in Broadcast joins~~ (native) use hashTable caching in Broadcast joins Dec 15, 2025

shrinidhijoshi changed the title ~~(native) use hashTable caching in Broadcast joins~~ (native-pos) use hashTable caching in Broadcast joins Dec 15, 2025

shrinidhijoshi changed the title ~~(native-pos) use hashTable caching in Broadcast joins~~ feat(native-pos) Use HashTable caching in Broadcast joins Dec 15, 2025

shrinidhijoshi force-pushed the export-D88900941 branch from 4525c88 to 63c1e5a Compare December 15, 2025 05:43

sourcery-ai bot reviewed Dec 15, 2025

View reviewed changes

yingsu00 reviewed Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(native-pos) Use HashTable caching in Broadcast joins #26806

feat(native-pos) Use HashTable caching in Broadcast joins #26806

Uh oh!

shrinidhijoshi commented Dec 15, 2025

Uh oh!

sourcery-ai bot commented Dec 15, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

yingsu00 Dec 18, 2025

Uh oh!

yingsu00 Dec 18, 2025

Uh oh!

yingsu00 Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(native-pos) Use HashTable caching in Broadcast joins #26806

Are you sure you want to change the base?

feat(native-pos) Use HashTable caching in Broadcast joins #26806

Uh oh!

Conversation

shrinidhijoshi commented Dec 15, 2025

Uh oh!

sourcery-ai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for broadcast join planning with hash table caching

Class diagram for updated SystemConfig and HashJoinNode usage

Flow diagram for system config mapping to Velox query config

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yingsu00 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

yingsu00 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

yingsu00 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sourcery-ai bot commented Dec 15, 2025 •

edited

Loading