Skip to content

Conversation

@raymondlam12
Copy link

Summary

[Issue][(https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.](#240)

In order to implement FileIOs that are TableOperations-specific, creating a FileIO instance on a per TableOperations-basis is required.

This is similar to how other Iceberg-compliant Catalogs are implemented: apache/iceberg#10893 and implemented in the same way (taken from the HiveCatalog / GlueCatalog implementations before it got refactored into FileIOTracker)

This feature is needed to implement data access tokens.

Changes

  • Client-facing API Changes
  • [ x ] Internal API Changes
  • Bug Fixes
  • [ x ] New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

TODO

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Copy link
Collaborator

@HotSushi HotSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to the OH community! Aligned on PR direction but left some comments


protected Map<String, String> properties;

private Cache<TableOperations, FileIO> fileIOCloser;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these look like temporary duplication of code that will be available in upstream iceberg with version bump. Can we add a todo identifiying the PR and version of iceberg that introduced the fileIOcloser and instructions on removing the code duplication later?

.snapshotApi(snapshotApi)
.cluster(cluster)
.build();
FileIO tableOperationsLocalFileIO = loadFileIO(properties);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also change the signature of loadFileIO to following:

protected FileIO loadFileIO(TableIdentifier tableIdentifier, Map<String, String> properties) {

reason:
Each table can have its own storage, table1 -> hdfs, table2 -> s3, table3 -> adls . We'd need to instantiate/load the right FileIO per table and properties.

}

@Override
public void close() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add unit tests to validate this logic?

exclude group: 'org.apache.commons', module: 'commons-lang3'
}
implementation("com.github.ben-manes.caffeine:caffeine:" + caffeineVersion)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also do output jar check to see if the newly added library is appropriately relocated? the output jar should look as follows:
image

@@ -1,7 +1,10 @@
package com.linkedin.openhouse.javaclient;

import static com.linkedin.openhouse.javaclient.OpenHouseTableOperations.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


protected Map<String, String> properties;

private Cache<TableOperations, FileIO> fileIOCloser;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. How long the entries are present in the cache? Wondering if the cache will be populated with too many objects if there are more tables operation done in a short period of time. Do you think is going to consume more memory?

private Cache<TableOperations, FileIO> newFileIOCloser() {
return Caffeine.newBuilder()
.weakKeys()
.removalListener(
Copy link
Member

@abhisheknath2011 abhisheknath2011 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the listener is invoked to perform close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants