Skip to content

Conversation

@srawat98-dev
Copy link
Contributor

@srawat98-dev srawat98-dev commented Nov 26, 2025

Summary

I extended the existing TableStatsCollectionSparkApp to implement the logic for populating the openhouseTableCommitEventsPartitions table.

This new table will serve as the partition-level source of truth for commit-related metadata across all OpenHouse datasets, including:

  1. Commit ID (snapshot_id)
  2. Commit timestamp (committed_at)
  3. Commit operation (APPEND, DELETE, OVERWRITE, REPLACE)
  4. Partition data (typed column values for all partition columns)
  5. Spark App ID and Spark App Name
  6. Table identifier (database, table, cluster, location, partition spec)

This enables granular tracking of which partitions were affected by each commit, providing:

  1. Partition-level lineage - Track exactly which partitions changed in each commit
  2. Fine-grained auditing - Monitor data changes at partition granularity
  3. Optimized queries - Query only relevant partitions for specific time ranges
  4. Incremental processing - Identify changed partitions for downstream pipelines

Output

This PR populates the openhouseTableCommitEventsPartitions table by querying the Iceberg all_entries and snapshots metadata tables for all OpenHouse datasets.

Key Features:

  1. One Row Per (Commit, Partition) Pair
  • Creates one CommitEventTablePartitions record for each unique (snapshot_id, partition) combination
  • Example: 1 commit affecting 3 partitions → 3 records
  1. Parallel Execution
  • Runs simultaneously with table stats and commit events collection
  • ~2x performance improvement over sequential execution
  • Uses CompletableFuture for non-blocking parallel processing
  1. Type-Safe Partition Data
  • Partition values stored as typed ColumnData objects:
  • LongColumnData for Integer/Long values (e.g., year=2024)
  • DoubleColumnData for Float/Double values
  • StringColumnData for String/Date/Timestamp values
  • Runtime type detection using instanceof checks
  1. Robust Error Handling
  • ✅ Unpartitioned tables return empty list (no errors)
  • ✅ Null values logged and skipped
  • ✅ Unknown commit operations set to null with warning
  • ✅ Invalid partition values logged and skipped
  • ✅ Timestamp conversion handles both seconds and milliseconds
  1. Stateless Design
  • Processes all active (non-expired) commit-partition pairs at every job run
  • No state tracking between runs (matches existing openhouseTableCommitEvents behavior)
  • Duplicates across partitions (same commit-partition pair in multiple event_timestamp partitions)
  • Deduplication handled at query time in downstream consumers (use DISTINCT or GROUP BY)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@srawat98-dev srawat98-dev marked this pull request as ready for review November 26, 2025 07:01
@srawat98-dev srawat98-dev changed the title Add partition-level commit event collection and publishing Implementation[OpenhouseCommitEventTablePartitions]: Add partition-level commit event collection and publishing in TableStatsCollectionSparkApp Nov 26, 2025
Copy link
Collaborator

@cbb330 cbb330 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks same exact structure as PR1 in this series and business logic is mostly same. no concern with the delta.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants