Skip to content

Conversation

@murali-db
Copy link
Contributor

@murali-db murali-db commented Jan 22, 2026

Summary

Implement proper handling of column names containing dots (literal column names vs. nested fields) for Iceberg server-side planning. This PR adds backtick escaping for literal dotted columns in projections while correctly using raw field names for filters (where Iceberg's binding logic handles disambiguation automatically).

Background

Column names can contain dots in various scenarios:

  • Unity Catalog tables with dots in column names
  • Flattened nested schemas
  • Data from systems that allow dots in column names (e.g., address.city as a single column name)

It's important to distinguish between:

  • Literal dotted column names: address.city is a single top-level column
  • Nested field references: address.intCol refers to field intCol within struct address

Key Insights

After thorough investigation of both Spark and Iceberg behavior:

  1. Projections: Need backtick escaping to distinguish in JSON

    • Example: ["address.city", "address.intCol"]
    • Escaped literal columns vs. unescaped nested field paths
  2. Filters: Don't need escaping - Iceberg's schema-aware binding handles it

    • Iceberg Expression API uses raw field names: Expressions.equal("address.city", value)
    • Binder.bind(schema, expression) resolves ambiguity using schema structure
    • Binding checks for literal field match FIRST, then tries nested parsing
    • JSON sent over HTTP also uses raw field names (no escaping)

Implementation

Projection Column Escaping

File: ServerSidePlannedTable.scala (only file changed)

Added logic to escape literal dotted columns in projections:

  • escapeProjectedColumns(): Processes required schema fields
  • escapeColumnNameIfNeeded(): Recursively handles nested structures
  • Logic: If a column name contains dots and exists as a top-level field in schema, wrap it in backticks
  • Nested field access (e.g., parent.child) is handled recursively without escaping the parent

Filter Handling

File: SparkToIcebergExpressionConverter.scala (no changes needed)

Filters correctly use raw field names (no escaping):

  • All filter operators use plain column names as-is
  • Iceberg's Binder.bind() resolves literal vs. nested using schema
  • Server-side binding prioritizes literal field names over nested parsing
  • No client-side escaping needed

Test Coverage

Comprehensive Test Suite ✅

  1. TestSchemas.scala: Defines schema with BOTH literal dotted columns AND nested structs

    • Literal: address.city, a.b.c, location.state, etc.
    • Nested: address struct with intCol, metadata struct with stringCol
  2. SparkToIcebergExpressionConverterSuite.scala:

    • 17 test cases covering all operators with dotted columns
    • Verifies raw field names are used (no backticks in Expression objects)
  3. IcebergRESTCatalogPlanningClientSuite.scala:

    • Tests filter/projection sent to REST server
    • Verifies Binder.bind() correctly resolves dotted names
    • Tests both literal and nested dotted columns
  4. ServerSidePlannedTableSuite.scala:

    • End-to-end tests with filter and projection pushdown
    • 12 tests covering full query execution

Testing

✅ All unit tests pass:

  • 22 iceberg tests (SparkToIcebergExpressionConverterSuite, IcebergRESTCatalogPlanningClientSuite)
  • 12 spark tests (ServerSidePlannedTableSuite)
  • Total: 34 tests passing
  • Scalastyle checks: 0 errors, 0 warnings
  • Tested with Java 17

Files Changed

Single file modified:

  • ServerSidePlannedTable.scala: Added projection escaping logic (+51 lines)

Total: 1 file changed, 51 insertions(+), 1 deletion(-)

…n Iceberg filter conversion

Add comprehensive test coverage for Iceberg filter conversion when column
names contain dots. Validates that Iceberg correctly handles both nested
field access and literal column names containing dots.

Test Coverage:
- Literal column names with dots (e.g., "address.city" as single column name)
- All filter operators: equality, comparison, null checks, IN, string operations
- Logical operators (AND, OR) with mixed column names
- Distinction between nested field access vs literal column names with dots

Key Finding:
Iceberg's expression API already correctly handles literal column names
containing dots without requiring special escaping. The schema can contain
both nested structs (address.intCol) and literal dotted names (address.city),
and Iceberg distinguishes them correctly.

Changes:
- Add test schema with literal column names containing dots
- Add 17 new test cases covering all filter operators with dotted names
- Update test documentation to clarify nested vs literal column names
@murali-db murali-db force-pushed the escape-dots-in-column-names branch from 79b7a6e to 3c20876 Compare January 22, 2026 12:50
@murali-db murali-db changed the title [Server-Side Planning] Escape dots in column names for Iceberg filter pushdown [Server-Side Planning] Add test coverage for column names with dots in Iceberg filter conversion Jan 22, 2026
@murali-db murali-db changed the title [Server-Side Planning] Add test coverage for column names with dots in Iceberg filter conversion [Server-Side Planning] Escape dots in column names when sending to Iceberg REST API Jan 22, 2026
…umn names

Add test coverage for column names containing dots (literal column names, not nested fields).
Tests verify that Iceberg correctly handles both literal dotted column names (e.g., address.city)
and nested field references (e.g., address.intCol) without requiring backtick escaping.

Key insight: Iceberg's internal schema and REST API handle dotted column names natively.
Backticks are only needed in SQL/parser contexts, not in Iceberg expressions or REST protocol.

Changes:
- Add literal dotted column names to TestSchemas (address.city, a.b.c, location.state, etc.)
- Add 17 test cases in SparkToIcebergExpressionConverterSuite covering all operators
- Update IcebergRESTCatalogPlanningClientSuite.populateTestData to include all 21 fields
- Add test case for literal dotted column name in filter+projection

All tests pass with Java 17.
@murali-db murali-db force-pushed the escape-dots-in-column-names branch from 48aa85e to fb9ad0e Compare January 22, 2026 14:04
@murali-db murali-db changed the title [Server-Side Planning] Escape dots in column names when sending to Iceberg REST API [Server-Side Planning] Add comprehensive test coverage for dotted column names Jan 22, 2026
@murali-db murali-db changed the title [Server-Side Planning] Add comprehensive test coverage for dotted column names [Server-Side Planning] Column names containing period Jan 22, 2026
@murali-db murali-db force-pushed the escape-dots-in-column-names branch from a0cba46 to 06c4eaf Compare January 23, 2026 12:23
…n names

Add backtick escaping for column names containing dots when sending
projections to Iceberg REST API. This distinguishes between:
- Literal dotted columns: "address.city" as a single field -> "`address.city`"
- Nested field access: address.intCol (parent.child) -> "address.intCol"

Implementation:
- Added escapeProjectedColumns() to process required schema fields
- Added escapeColumnNameIfNeeded() for recursive nested field handling
- Escaping happens in ServerSidePlannedTable before calling planScan()
- No changes to filter conversion (Iceberg's Binder handles disambiguation)

All tests passing (34 total: 22 iceberg + 12 spark)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@murali-db murali-db force-pushed the escape-dots-in-column-names branch from 06c4eaf to 8dee476 Compare January 23, 2026 12:26
murali-db and others added 4 commits January 23, 2026 12:34
Use a.b.c (which is unambiguous - no struct named 'a') instead of address.city
for the literal dotted column test case to make the test clearer.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…Type directly

Instead of using fieldNames (which only gives top-level names), traverse
the StructType directly to build flattened dot-notation paths. This correctly
handles nested structs by recursively flattening them.

Example: If requiredSchema has struct 'address' with field 'intCol',
we now correctly generate "address.intCol" instead of just "address".

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…s with dots

CRITICAL FIX: The previous implementation only escaped dotted field names at the
top level. This fails for nested structs that have fields with dots in their names.

Example: parent (struct) { "child.name" (string) }
Previous: would generate "parent.child.name" (ambiguous!)
Fixed: now generates "parent.`child.name`" (correctly escaped)

Changes:
1. Simplified flattenSchema() logic: ANY field with dots gets escaped,
   regardless of nesting level
2. Simplified test schema: removed redundant dotted columns, added critical
   test case for nested field with dots (parent."child.name")
3. Updated all test cases to reference fields that exist in simplified schema

All tests passing (34 total: 22 iceberg + 12 spark)
Scalastyle: 0 errors, 0 warnings

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…ility

- Move escapeProjectedColumns and flattenSchema methods to companion object
- Make methods package-private for direct unit testing
- Add comprehensive unit test covering essential dotted field patterns:
  * Top-level with dots (e.g., `a.b.c`)
  * Normal nested (e.g., parent.child)
  * Multi-level nested (e.g., level1.level2.level3)
  * Nested with dotted leaf (e.g., data.`field.name`)
  * Struct with dots (e.g., `root.struct`.value)
- Add test to verify Spark's behavior with struct columns
- Restore metadata.stringCol test case alongside parent.child.name
- All tests passing (14 spark + 22 iceberg = 36 total)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant