Skip to content

Conversation

@srawat98-dev
Copy link
Contributor

@srawat98-dev srawat98-dev commented Nov 20, 2025

Summary

I extended the existing TableStatsCollectionSparkApp to implement the logic for populating the openhouseTableCommitEvents table.
This new table will serve as the single source of truth for commit-related metadata across all OpenHouse datasets, including:

  • Commit ID
  • Commit timestamp
  • Commit operation
  • Spark App ID
  • Spark App Name

This enables a unified, consistent, and efficient way to access commit events for all OpenHouse tables.

Output / Result

  1. This PR populates the openhouseTableCommitEvents table by pushing commit events from Snapshot Metadata table for all OH datasets.
  2. Creates one row per commit across all OpenHouse tables.
  3. Table will be updated daily via the TableStatsCollection job.
  4. At every scheduled run, we will be processing all the active commit events(non-expired) in the Snapshot Metadata table.
  5. Every Partition will have Commit events for all the non-expired Snapshots at the time of Job run.
  6. This will have a lot of duplicates across partitions, but we can handle it at query time in the downstream consumer.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

srawat98-dev and others added 30 commits November 3, 2025 11:02
Add IcebergCommitEventStats model for capturing commit events and sta…
…es with dedicated fields for long, string, and double types.
…troduce CommitMetadata class for enhanced commit tracking
…r improved clarity; update documentation to reflect naming and relationship with commit events
…ats/model/BaseEventModels.java

Co-authored-by: Stas Pak <[email protected]>
…ats/model/CommitMetadata.java

Co-authored-by: Stas Pak <[email protected]>
…ats/model/CommitEventPartitions.java

Co-authored-by: Stas Pak <[email protected]>
…ats/model/CommitEventPartitions.java

Co-authored-by: Stas Pak <[email protected]>
…ats/model/PartitionStats.java

Co-authored-by: Stas Pak <[email protected]>
…introduce interface for column statistics with specific implementations
- baseCommitEvent has-a CommitMetadata.
…ats/model/BaseEventModels.java

Co-authored-by: Sumedh Sakdeo <[email protected]>
…ats/model/CommitEvent.java

Co-authored-by: Sumedh Sakdeo <[email protected]>
…ats/model/CommitEvent.java

Co-authored-by: Sumedh Sakdeo <[email protected]>
…ats/model/CommitEventTablePartitions.java

Co-authored-by: Sumedh Sakdeo <[email protected]>
@srawat98-dev srawat98-dev requested a review from cbb330 November 20, 2025 22:56
cbb330
cbb330 previously approved these changes Nov 20, 2025
srawat added 2 commits November 21, 2025 12:08
…ddingCommitJobForFreshnessinStatsCollector
…ent utility method for database name extraction
…ynchronous execution method with timing and logging
@abhisheknath2011
Copy link
Member

Summary

I extended the existing TableStatsCollectionSparkApp to implement the logic for populating the openhouseTableCommitEvents table. This new table will serve as the single source of truth for commit-related metadata across all OpenHouse datasets, including:

  • Commit ID
  • Commit timestamp
  • Commit operation
  • Spark App ID
  • Spark App Name

This enables a unified, consistent, and efficient way to access commit events for all OpenHouse tables.

Reference Documents 1. OpenHouse Core Model for Tracking Dataset Commits and Metadata https://docs.google.com/document/d/1HsKGVE4kwGUydv8saiCGBNdqJGrN0aRYiIq9c9vn8IY/edit 2. Commit Data Model Implementation – State Management https://docs.google.com/document/d/1p33IQI7EpgztyH4cya9GleNNNO72Ihlu04ll1mYNf6U/edit

Output / Result

  1. This PR populates the openhouseTableCommitEvents table by pushing commit events from Snapshot Metadata table for all OH datasets.
  2. Creates one row per commit across all OpenHouse tables.
  3. Table will be updated daily via the TableStatsCollection job.
  4. At every scheduled run, we will be processing all the active commit events(non-expired) in the Snapshot Metadata table.
  5. Every Partition will have Commit events for all the non-expired Snapshots at the time of Job run.
  6. This will have a lot of duplicates across partitions, but we can handle it at query time in the downstream consumer.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

As this is OSS codebase, can we remove the internal google doc link from the PR?

@abhisheknath2011 abhisheknath2011 merged commit 61776c2 into linkedin:main Nov 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants