Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp #402

srawat98-dev · 2025-11-26T07:01:31Z

Summary

I extended the existing TableStatsCollectionSparkApp to implement the logic for populating the openhouseTableCommitEventsPartitions table.

This new table will serve as the partition-level source of truth for commit-related metadata across all OpenHouse datasets, including:

Commit ID (snapshot_id)
Commit timestamp (committed_at)
Commit operation (APPEND, DELETE, OVERWRITE, REPLACE)
Partition data (typed column values for all partition columns)
Spark App ID and Spark App Name
Table identifier (database, table, cluster, location, partition spec)

This enables granular tracking of which partitions were affected by each commit, providing:

Partition-level lineage - Track exactly which partitions changed in each commit
Fine-grained auditing - Monitor data changes at partition granularity
Optimized queries - Query only relevant partitions for specific time ranges
Incremental processing - Identify changed partitions for downstream pipelines

Output

This PR populates the openhouseTableCommitEventsPartitions table by querying the Iceberg all_entries and snapshots metadata tables for all OpenHouse datasets.

Key Features:

One Row Per (Commit, Partition) Pair

Creates one CommitEventTablePartitions record for each unique (snapshot_id, partition) combination
Example: 1 commit affecting 3 partitions → 3 records

Parallel Execution

Runs simultaneously with table stats and commit events collection
~2x performance improvement over sequential execution
Uses CompletableFuture for non-blocking parallel processing

Type-Safe Partition Data

Partition values stored as typed ColumnData objects:
LongColumnData for Integer/Long values (e.g., year=2024)
DoubleColumnData for Float/Double values
StringColumnData for String/Date/Timestamp values
Runtime type detection using instanceof checks

Robust Error Handling

✅ Unpartitioned tables return empty list (no errors)
✅ Null values logged and skipped
✅ Unknown commit operations set to null with warning
✅ Invalid partition values logged and skipped
✅ Timestamp conversion handles both seconds and milliseconds

Stateless Design

Processes all active (non-expired) commit-partition pairs at every job run
No state tracking between runs (matches existing openhouseTableCommitEvents behavior)
Duplicates across partitions (same commit-partition pair in multiple event_timestamp partitions)
Deduplication handled at query time in downstream consumers (use DISTINCT or GROUP BY)

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…ests

cbb330

looks same exact structure as PR1 in this series and business logic is mostly same. no concern with the delta.

LGTM!

Add partition-level commit event collection and publishing

8aa4fc3

srawat98-dev marked this pull request as ready for review November 26, 2025 07:01

srawat98-dev changed the title ~~Add partition-level commit event collection and publishing~~ Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp Nov 26, 2025

Enhance partition event handling with transformation and validation t…

d693925

…ests

cbb330 approved these changes Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp #402

Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp #402

Uh oh!

srawat98-dev commented Nov 26, 2025 •

edited

Loading

Uh oh!

cbb330 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp #402

Are you sure you want to change the base?

Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp #402

Uh oh!

Conversation

srawat98-dev commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Output

Key Features:

Changes

Testing Done

Additional Information

Uh oh!

cbb330 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srawat98-dev commented Nov 26, 2025 •

edited

Loading