Skip to content

Conversation

@hantangwangd
Copy link
Member

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 14, 2025

Reviewer's Guide

Adds support for an optional sorted_by argument to the Iceberg data_rewrite_files distributed procedure, plumbs the chosen sort order through the distributed procedure handle so rewritten files can be produced in a validated sort order compatible with the table’s internal sort order, and extends tests to cover sort-order behavior and the new argument wiring.

Sequence diagram for begin flow of data_rewrite_files with sorted_by

sequenceDiagram
    participant U as User
    participant S as SQLParserPlanner
    participant P as TableDataRewriteDistributedProcedure
    participant R as RewriteDataFilesProcedure
    participant Ctx as IcebergProcedureContext
    participant L as IcebergTableLayoutHandle
    participant T as Table
    participant H as IcebergDistributedProcedureHandle

    U->>S: CALL iceberg.system.data_rewrite_files(schema, table, filter, sorted_by, options)
    S->>P: begin(session, procedureContext, tableLayoutHandle, arguments)
    P->>P: locate schemaIndex, tableNameIndex, filterIndex, sortOrderIndex
    P->>R: begin(session, procedureContext, tableLayoutHandle, arguments, sortOrderIndex)

    R->>Ctx: getTable()
    Ctx-->>R: Table
    R->>L: getTable()
    L-->>R: IcebergTableHandle

    R->>T: sortOrder()
    T-->>R: SortOrder tableSortOrder

    alt sorted_by argument present
        R->>R: read arguments[sortOrderIndex]
        R->>R: parseSortFields(schema, sortFieldStrings)
        R->>R: specifiedSortOrder.satisfies(tableSortOrder)
        alt compatible sort order
            R->>R: sortOrder = specifiedSortOrder
        else incompatible
            R->>R: throw PrestoException(NOT_SUPPORTED)
        end
    else no sorted_by argument
        R->>R: sortOrder = tableSortOrder or empty
    end

    R->>R: getSupportedSortFields(schema, sortOrder)
    R->>H: new IcebergDistributedProcedureHandle(..., sortFields, tableLayoutHandle, relevantData)
    R-->>P: ConnectorDistributedProcedureHandle
    P-->>S: ConnectorDistributedProcedureHandle
    S-->>U: Distributed procedure handle for rewrite task execution
Loading

Class diagram for updated data_rewrite_files distributed procedure and handle

classDiagram
    class TableDataRewriteDistributedProcedure {
        <<class>>
        +static String SCHEMA
        +static String TABLE_NAME
        +static String FILTER
        +static String SORT_ORDER
        -BeginCallDistributedProcedure beginCallDistributedProcedure
        -FinishCallDistributedProcedure finishCallDistributedProcedure
        -int schemaIndex
        -int tableNameIndex
        -OptionalInt filterIndex
        -OptionalInt sortOrderIndex
        +TableDataRewriteDistributedProcedure(String schema, String name, List~Argument~ arguments, BeginCallDistributedProcedure beginCallDistributedProcedure, FinishCallDistributedProcedure finishCallDistributedProcedure)
        +ConnectorDistributedProcedureHandle begin(ConnectorSession session, ConnectorProcedureContext procedureContext, ConnectorTableLayoutHandle tableLayoutHandle, Object[] arguments)
        +String getSchema(Object[] parameters)
        +String getTableName(Object[] parameters)
        +String getFilter(Object[] parameters)
        +OptionalInt getSortOrderIndex()
    }

    class BeginCallDistributedProcedure {
        <<interface>>
        +ConnectorDistributedProcedureHandle begin(ConnectorSession session, ConnectorProcedureContext procedureContext, ConnectorTableLayoutHandle tableLayoutHandle, Object[] arguments, OptionalInt sortOrderIndex)
    }

    class FinishCallDistributedProcedure {
        <<interface>>
        +void finish(ConnectorSession session, ConnectorProcedureContext procedureContext, ConnectorTableHandle tableHandle, Collection~ShardInfo~ fragments)
    }

    class RewriteDataFilesProcedure {
        <<class>>
        +DistributedProcedure get()
        -ConnectorDistributedProcedureHandle beginCallDistributedProcedure(ConnectorSession session, IcebergProcedureContext procedureContext, IcebergTableLayoutHandle layoutHandle, Object[] arguments, OptionalInt sortOrderIndex)
    }

    class IcebergDistributedProcedureHandle {
        <<class>>
        -IcebergTableLayoutHandle tableLayoutHandle
        -Map~String, String~ relevantData
        +IcebergDistributedProcedureHandle(String schemaName, String tableName, String tableLocation, List~String~ dataColumns, List~String~ partitionColumns, String fileFormat, HiveCompressionCodec compressionCodec, Map~String, String~ storageProperties, List~SortField~ sortOrder, IcebergTableLayoutHandle tableLayoutHandle, Map~String, String~ relevantData)
    }

    class SortOrder {
        <<class>>
        +boolean satisfies(SortOrder other)
    }

    class SortField {
        <<class>>
    }

    class IcebergProcedureContext {
        <<class>>
        +Table getTable()
    }

    class IcebergTableLayoutHandle {
        <<class>>
        +IcebergTableHandle getTable()
    }

    class IcebergTableHandle {
        <<class>>
        +String getSchemaName()
        +String getIcebergTableName()
        +String getTableLocation()
    }

    class Table {
        <<class>>
        +SortOrder sortOrder()
        +Schema schema()
        +Map~String, String~ properties()
    }

    TableDataRewriteDistributedProcedure ..> BeginCallDistributedProcedure : uses
    TableDataRewriteDistributedProcedure ..> FinishCallDistributedProcedure : uses
    BeginCallDistributedProcedure <|.. RewriteDataFilesProcedure : implements

    RewriteDataFilesProcedure ..> IcebergProcedureContext : uses
    RewriteDataFilesProcedure ..> IcebergTableLayoutHandle : uses
    RewriteDataFilesProcedure ..> IcebergDistributedProcedureHandle : creates
    RewriteDataFilesProcedure ..> SortOrder : uses
    RewriteDataFilesProcedure ..> SortField : uses
    RewriteDataFilesProcedure ..> Table : uses

    IcebergProcedureContext ..> Table : returns
    IcebergTableLayoutHandle ..> IcebergTableHandle : returns
    Table ..> SortOrder : returns

    IcebergDistributedProcedureHandle ..> IcebergTableLayoutHandle : has
    IcebergDistributedProcedureHandle ..> SortField : has
Loading

File-Level Changes

Change Details Files
Support optional sorted_by argument for table data rewrite distributed procedures and thread its index through the begin callback.
  • Extend TableDataRewriteDistributedProcedure to recognize a new sorted_by argument name and store its position as an OptionalInt
  • Modify the begin() method to pass the resolved sortOrderIndex into the connector-specific begin callback
  • Update the BeginCallDistributedProcedure functional interface signature to include the OptionalInt sortOrderIndex parameter
  • Adjust various test-only TableDataRewriteDistributedProcedure usages and lambdas to accept the new parameter
presto-spi/src/main/java/com/facebook/presto/spi/procedure/TableDataRewriteDistributedProcedure.java
presto-tests/src/test/java/com/facebook/presto/tests/TestProcedureCreation.java
presto-analyzer/src/test/java/com/facebook/presto/sql/analyzer/TestBuiltInQueryPreparer.java
presto-main-base/src/test/java/com/facebook/presto/sql/analyzer/AbstractAnalyzerTest.java
presto-main-base/src/test/java/com/facebook/presto/sql/planner/TestLogicalPlanner.java
Implement sorted_by handling in Iceberg rewrite_data_files procedure with validation against the table’s internal sort order and propagation to the distributed procedure handle.
  • Add sorted_by array(varchar) as an optional argument to the Iceberg rewrite_data_files procedure definition
  • Change the Iceberg beginCallDistributedProcedure callback signature to accept the OptionalInt sortOrderIndex and extract the corresponding argument from the raw arguments array
  • Parse user-provided sort field strings into an Iceberg SortOrder and validate that it satisfies the table’s existing internal sort order, otherwise throw a NOT_SUPPORTED error
  • Derive supported SortField list from the resolved SortOrder and pass it into the IcebergDistributedProcedureHandle constructor so downstream rewrite can use it
presto-iceberg/src/main/java/com/facebook/presto/iceberg/procedure/RewriteDataFilesProcedure.java
Extend IcebergDistributedProcedureHandle to carry the chosen sort order fields instead of always passing an empty list.
  • Update the IcebergDistributedProcedureHandle constructor to accept a List sortOrder JSON property
  • Pass the provided sortOrder list to the super constructor instead of an empty ImmutableList
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergDistributedProcedureHandle.java
Add integration tests to verify rewrite_data_files behavior with sorted_by on regular, partitioned, and pre-sorted Iceberg tables, including compatibility checks.
  • Add tests that call system.rewrite_data_files with sorted_by on non-partitioned tables and assert that all data files are merged into one and sorted ascending or descending as requested
  • Add tests on partitioned tables to verify each partition’s files are individually sorted according to the specified order and the file count per partition remains expected
  • Add a test ensuring that when the table already has a compatible internal sort order, specifying a compatible extended sort order is accepted and preserves the primary sort column ordering
  • Add a test asserting that specifying an incompatible sorted_by (conflicting direction or leading column) fails with the expected NOT_SUPPORTED message
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hantangwangd hantangwangd force-pushed the support_sort_order_for_rewrite branch 2 times, most recently from cf6071b to abbc90a Compare December 18, 2025 08:17
@hantangwangd hantangwangd linked an issue Dec 18, 2025 that may be closed by this pull request
@hantangwangd hantangwangd force-pushed the support_sort_order_for_rewrite branch from abbc90a to 4ff8a33 Compare December 20, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support sorted_by argument for rewrite_data_files procedure

1 participant