Skip to content

Spark, Flink: Add null engineSchema fallback for format model writers#15688

Merged
pvary merged 8 commits intoapache:mainfrom
joyhaldar:engine-schema-null-fallback
Mar 20, 2026
Merged

Spark, Flink: Add null engineSchema fallback for format model writers#15688
pvary merged 8 commits intoapache:mainfrom
joyhaldar:engine-schema-null-fallback

Conversation

@joyhaldar
Copy link
Contributor

@joyhaldar joyhaldar commented Mar 19, 2026

When engineSchema is not set, format model writers now derive it from the Iceberg schema instead of failing with a null error.

Discussed in Add TCK for File Format API.

Spark (4.1, 4.0, 3.5, 3.4): Added SparkAvroWriter constructor with null fallback. Updated SparkFormatModels. Parquet already had this, ORC ignores engineSchema.

Flink (2.1, 2.0, 1.20): Added null fallback in FlinkAvroWriter, FlinkParquetWriters, and FlinkOrcWriter. Updated FlinkFormatModels.

Tests: Added testDataWriterEngineWriteWithoutEngineSchema and testEqualityDeleteWriterEngineWriteWithoutEngineSchema in BaseFormatModelTests to cover the null engineSchema path. Existing tests with explicit engineSchema are kept.

Part of #15415

@pvary
Copy link
Contributor

pvary commented Mar 19, 2026

Update the tests to remove the unnecessary engineSchema settings

@joyhaldar
Copy link
Contributor Author

Update the tests to remove the unnecessary engineSchema settings

Removing .engineSchema() from BaseFormatModelTests will also break older Spark and Flink versions that don't have the fallback yet. Should I wait until backports are done?

}

public SparkAvroWriter(org.apache.iceberg.Schema icebergSchema, StructType engineSchema) {
this(engineSchema != null ? engineSchema : SparkSchemaUtil.convert(icebergSchema));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests to cover both the null and not null path?

Copy link
Contributor Author

@joyhaldar joyhaldar Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @huaxingao.

I have added two new tests in BaseFormatModelTests that cover the null path based on your and @pvary's suggestions, testDataWriterEngineWriteWithoutEngineSchema and testEqualityDeleteWriterEngineWriteWithoutEngineSchema. The existing tests with explicit engineSchema are also present.

@pvary
Copy link
Contributor

pvary commented Mar 20, 2026

Removing .engineSchema() from BaseFormatModelTests will also break older Spark and Flink versions that don't have the fallback yet. Should I wait until backports are done?

We can update all of the Spark and Flink versions in one PR, and then we can have a test where it is removed.

Also, reflecting on @huaxingao’s comment, I think we should keep the other tests as well.

@github-actions github-actions bot added the data label Mar 20, 2026
@joyhaldar
Copy link
Contributor Author

Removing .engineSchema() from BaseFormatModelTests will also break older Spark and Flink versions that don't have the fallback yet. Should I wait until backports are done?

We can update all of the Spark and Flink versions in one PR, and then we can have a test where it is removed.

Also, reflecting on @huaxingao’s comment, I think we should keep the other tests as well.

Thank you Péter. I have applied the fallback to all Spark and Flink versions. I have also added new tests for the null engineSchema path while keeping existing tests that set it explicitly.

Comment on lines +169 to +179
// Read back and verify
InputFile inputFile = encryptedFile.encryptingOutputFile().toInputFile();
List<Record> readRecords;
try (CloseableIterable<Record> reader =
FormatModelRegistry.readBuilder(fileFormat, Record.class, inputFile)
.project(schema)
.build()) {
readRecords = ImmutableList.copyOf(reader);
}

DataTestHelpers.assertEquals(schema.asStruct(), genericRecords, readRecords);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we generalize this to a method?
I think @Guosmilesmile already did this in his #15633 PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have extracted readAndAssertGenericRecords helper. Inspired by @Guosmilesmile's writeGenericRecords pattern in #15633.

@pvary pvary merged commit 2874fc4 into apache:main Mar 20, 2026
34 checks passed
@pvary
Copy link
Contributor

pvary commented Mar 20, 2026

Merged to main.
Thanks @joyhaldar for the PR and @huaxingao for the reivew!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants