-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
1c99dc6 to
d47ffd5
Compare
|
run buildall |
5642997 to
3fc502e
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
3627661 to
3647221
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 29.15 s |
FE UT Coverage ReportIncrement line coverage |
34a95f7 to
087f4e0
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 28.24 s |
FE Regression Coverage ReportIncrement line coverage |
0d12c7d to
33c5e80
Compare
|
run buildall |
33c5e80 to
f059d14
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 27.8 s |
FE Regression Coverage ReportIncrement line coverage |
f059d14 to
bb96ea9
Compare
TPC-H: Total hot run time: 34014 ms |
TPC-DS: Total hot run time: 187859 ms |
ClickBench: Total hot run time: 28.3 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance by selectively reading only the required sub-columns instead of reading entire complex type fields.
Key changes:
- Added FE logic to calculate and track access paths for complex type fields
- Implemented BE selective reading using columnAccessPath information from FE
- Added session variable
enable_prune_nested_columnto control the feature - Added thrift structures (TColumnAccessPath, TDataAccessPath, TMetaAccessPath) to pass pruning information between FE and BE
Reviewed Changes
Copilot reviewed 126 out of 153 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| Descriptors.thrift | Added access path structures for column pruning |
| descriptors.proto | Added protobuf definitions for access paths |
| SessionVariable.java | Added enable_prune_nested_column session variable |
| PlanNode.java | Added printNestedColumns method and renamed typo method |
| SlotDescriptor.java | Added access path fields and getters/setters |
| LogicalOlapScan.java | Implemented SupportPruneNestedColumn interface |
| LogicalFileScan.java | Added nested column pruning support |
| Multiple FE rules | Added PushDownProject and NestedColumnPruning rules |
| Multiple BE files | Implemented selective column reading logic |
Files not reviewed (1)
- .idea/vcs.xml: Language not supported
Comments suppressed due to low confidence (9)
gensrc/thrift/Descriptors.thrift:1
- The comment states 'only access the keys' but should say 'only access the values' for the VALUES case.
gensrc/proto/descriptors.proto:1 - The comment states 'only access the keys' but should say 'only access the values' for the VALUES case.
fe/fe-core/src/main/java/org/apache/doris/planner/PlanNode.java:1 - Corrected typo in method name from 'getplanNodeExplainString' to 'getPlanNodeExplainString'.
fe/fe-core/src/main/java/org/apache/doris/planner/PlanNode.java:1 - The null check for
slot.getDisplayAllAccessPaths()is duplicated on lines 944 and 945. Remove one of the duplicate checks.
fe/fe-core/src/main/java/org/apache/doris/planner/PlanNode.java:1 - The null check for
slot.getDisplayPredicateAccessPaths()is duplicated on lines 966 and 967. Remove one of the duplicate checks.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/TreeNode.java:1 - The method was incorrectly calling
foreachinstead offoreachUp, causing infinite recursion or incorrect behavior. The fix correctly callsforeachUpto maintain the bottom-up traversal order.
be/test/vec/exec/format/table/hive/hive_reader_test.cpp:1 - Comment contains Chinese characters. Should be in English: 'profile uses STRUCT type'.
be/test/vec/exec/format/table/hive/hive_reader_test.cpp:1 - Comment contains Chinese characters. Should be in English: 'profile uses STRUCT type'.
be/test/vec/exec/format/table/hive/hive_reader_test.cpp:1 - Test name contains typo 'rrc' instead of 'orc'. Should be 'read_hive_orc_file'.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1,7 +0,0 @@ | |||
| <?xml version="1.0" encoding="UTF-8"?> | |||
| <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> | |||
| <svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="512px" height="512px" style="shape-rendering:geometricPrecision; text-rendering:geometricPrecision; image-rendering:optimizeQuality; fill-rule:evenodd; clip-rule:evenodd" xmlns:xlink="http://www.w3.org/1999/xlink"> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add back this folder
|
|
||
| /** AccessPathInfo */ | ||
| @Data | ||
| @AllArgsConstructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better that do not use lombok
| private List<TColumnAccessPath> allAccessPaths; | ||
| private List<TColumnAccessPath> predicateAccessPaths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to explain what is allAccessPaths and what is predicateAccessPaths
| private List<TColumnAccessPath> allAccessPaths; | ||
| private List<TColumnAccessPath> predicateAccessPaths; | ||
| private List<TColumnAccessPath> displayAllAccessPaths; | ||
| private List<TColumnAccessPath> displayPredicateAccessPaths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could them be final?
| @@ -185,6 +186,8 @@ public int compare(TFileRangeDesc o1, TFileRangeDesc o2) { | |||
| } | |||
| output.append(String.format("numNodes=%s", numNodes)).append("\n"); | |||
|
|
|||
| printNestedColumns(output, prefix, getTupleDesc()); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this print
| private <C extends Collection<E>, E extends Expression> Pair<Boolean, C> replaceExpressions( | ||
| C expressions, boolean propagateType, boolean fillAccessPaths) { | ||
| ImmutableCollection.Builder<E> newExprs; | ||
| if (expressions instanceof List) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need check C is not a queue?
| } | ||
|
|
||
| private Expression replaceSlot(Expression e, boolean fillAccessPath) { | ||
| return MoreFieldsThread.keepFunctionSignature(false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, it maybe change function's signature? what will happen if we meet a round(struct_element(x, 'a'), 3) that x is coming from another struct_element
| StatementContext statementContext = jobContext.getCascadesContext().getStatementContext(); | ||
| SessionVariable sessionVariable = statementContext.getConnectContext().getSessionVariable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need check null?
| return result; | ||
| } | ||
|
|
||
| /** DataTypeAccessTree */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to explain each var
| } | ||
|
|
||
| /** DataTypeAccessTree */ | ||
| public static class DataTypeAccessTree { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class need ut for it
| "Disable debug points. please check config::enable_debug_points"); | ||
| } | ||
| std::string result = status.to_json(); | ||
| LOG(INFO) << "handle request result:" << result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why del the code
|
|
||
| Status MapFileColumnIterator::init(const ColumnIteratorOptions& opts) { | ||
| if (_reading_flag == ReadingFlag::SKIP_READING) { | ||
| LOG(INFO) << "Map column iterator column " << _column_name << " skip reading."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better use DLOG(INFO)
| create_block_with_nested_columns(Block(arguments), numbers, false); | ||
| auto return_type = get_return_type_impl( | ||
| ColumnsWithTypeAndName(nested_block.begin(), nested_block.end())); | ||
| if (!return_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what case the return type is nullptr ?
| @@ -0,0 +1,172 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The license header of this file is wrong
.gitmodules
Outdated
| path = contrib/apache-orc | ||
| url = https://github.com/apache/doris-thirdparty.git | ||
| branch = orc | ||
| branch = cq_nested_column_prune_external_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to merge into orc's main branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
1f09aae to
69d64fd
Compare
|
run buildall |
69d64fd to
ecbd3dd
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
…uning. Co-authored-by: 924060929 <[email protected]> Co-authored-by: Jerry Hu <[email protected]>
ecbd3dd to
f50cf6b
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Problem Summary:
Release note
Optimize Complex Type Column Reading with Column Pruning
Description
This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed.
Key changes:
FE (Frontend): Added column access path calculation and type pruning
BE (Backend): Added selective column reading
Why
Performance Improvement: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with
struct<int a, int b> s, when onlys.ais referenced, we can avoid readings.bentirely.Technical Benefits: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals .
TODO & Future Optimizations
array_size()operations!= nullchecksCheck List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)