-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[enhancement](jni)Adjust the statistical time of JNI appenddata. #58224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 35174 ms |
TPC-DS: Total hot run time: 187734 ms |
ClickBench: Total hot run time: 27.24 s |
FE UT Coverage ReportIncrement line coverage `` 🎉 |
| break; | ||
| } | ||
| StructLike row = reader.next(); | ||
| records.add(reader.next()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May this consume more memory? Cause we have to save all records before doing the append?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think it will consume much memory, because at most there will only be one batch of data at a time. The reason for storing this data before calling append is that I think some data-retrieval methods might be lazily loaded, so I separated the reader.next() call from the append method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the statistical time measurement for JNI appendData operations across all JniScanner implementations. Previously, PaimonJniScanner called System.nanoTime() multiple times per batch (rows × columns × 2), which significantly increased CPU usage. The refactoring changes the timing approach to measure once per batch, and extends consistent timing measurement to all JniScanner instances.
Key changes:
- Refactored Paimon scanners to separate data reading from data appending, enabling batch-level timing measurement
- Added
appendDataTimetiming statistics to all JniScanner implementations (MaxCompute, Mock, Iceberg, Hudi, Avro, Trino) - Reduced timing overhead from O(rows × columns) System.nanoTime() calls to O(batches) calls
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| TrinoConnectorJniScanner.java | Aggregates per-column timing statistics into total appendDataTime in close() method |
| PaimonSysTableJniScanner.java | Refactored to separate record reading (readNextBatch()) from processing, enabling batch-level timing |
| PaimonJniScanner.java | Similar refactoring as PaimonSysTableJniScanner, with batch-level timing measurement |
| MaxComputeJniScanner.java | Added batch-level timing measurement around the data appending loop |
| MockJniScanner.java | Added batch-level timing measurement for test/mock scanner |
| IcebergSysTableJniScanner.java | Refactored to collect records first, then append with batch-level timing |
| HadoopHudiJniScanner.java | Refactored to separate record deserialization from appending, with batch-level timing |
| AvroJNIScanner.java | Refactored with new readNextBatch() method and batch-level timing measurement |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // for (int i = 0; i < requiredFields.length; i++) { | ||
| // Object fieldData = rowRecord.get(requiredFields[i]); | ||
| // if (fieldData == null) { | ||
| // appendData(i, null); | ||
| // } else { | ||
| // AvroColumnValue fieldValue = new AvroColumnValue(fieldInspectors[i], fieldData); | ||
| // appendData(i, fieldValue); | ||
| // } | ||
| // } |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code. This code block appears to be old implementation that has been replaced by the refactored version in the getNext() method. Commented code should be removed rather than left in the codebase.
| // for (int i = 0; i < requiredFields.length; i++) { | |
| // Object fieldData = rowRecord.get(requiredFields[i]); | |
| // if (fieldData == null) { | |
| // appendData(i, null); | |
| // } else { | |
| // AvroColumnValue fieldValue = new AvroColumnValue(fieldInspectors[i], fieldData); | |
| // appendData(i, fieldValue); | |
| // } | |
| // } |
| int rows = 0; | ||
| private List<InternalRow> readNextBatch() throws IOException { | ||
| List<InternalRow> records = new ArrayList<>(); | ||
| // int rows = 0; |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code. The comment // int rows = 0; serves no purpose and should be deleted.
| // int rows = 0; |
| List<Object> records = new ArrayList<>(); | ||
| int numRows = 0; |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The records list is created but remains empty when fields.length == 0 (virtual table case). While this doesn't cause incorrect behavior, it's inefficient to create an unused list. Consider moving the list creation inside the if (fields.length > 0) block to avoid unnecessary allocation.
| List<Object> records = new ArrayList<>(); | |
| int numRows = 0; | |
| int numRows = 0; | |
| List<Object> records = null; | |
| if (fields.length > 0) { | |
| records = new ArrayList<>(); | |
| } |
What problem does this PR solve?
Problem Summary:
PR #49688 adds a profile statistics interface for
appendDataTimeto JNI.This PR has two main benefits:
JniScannerinstances;PaimonJniScanner. The old method calledSystem.nanoTimemultiple times (rows * columns *2 times), which significantly increased CPU usage and reduced overall performance.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)