-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Added the Experimental ExternalTableDataset for Databricks #885
feat(datasets): Added the Experimental ExternalTableDataset for Databricks #885
Conversation
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
809522d
to
46858c3
Compare
Hey @noklam, @ankatiyar, |
@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally. |
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
f860c5f
to
8354b00
Compare
@noklam I just added some tests to cover the logic specific to |
Signed-off-by: Minura Punchihewa <[email protected]>
789c57d
to
31aecc8
Compare
Is this an expected fail? |
Signed-off-by: Minura Punchihewa <[email protected]>
@noklam It has been my experience that when overwriting data to an external table of a format other than Delta, the location has to be provided. Since I am an external location fixture here which requires an environment variable called I should have specified a different format though. I just made a commit with that change. Can you give me an explanation of how these tests are running? There should be a Spark environment available for them to run at all, right? |
Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version). |
Signed-off-by: Minura Punchihewa <[email protected]>
aa8c9e1
to
761f963
Compare
Got it. I just moved the tests; I had to copy the |
tests are now passing and looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried this out, but the code looks good to me!
Can you please add this addition to the release notes along with your name under contributors? Similar to how it looks here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/RELEASE.md#major-features-and-improvements-1
names_and_ages@spark: | ||
type: databricks.ExternalTableDataset | ||
format: parquet | ||
table: names_and_ages | ||
names_and_ages@pandas: | ||
type: databricks.ExternalTableDataset | ||
format: parquet | ||
table: names_and_ages | ||
dataframe_type: pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you add some spaces between the entries? It's a bit hard to read where one catalog entry starts and ends like this.
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
4a87478
to
1577f9a
Compare
@merelcht I've made the requested changes. Do let me know if I've updated the RELEASE document correctly. |
Signed-off-by: Merel Theisen <[email protected]>
I've moved it to the section for upcoming release, because |
Oh actually @MinuraPunchihewa I just realised this dataset wasn't added to https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L176 or https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L285. Can you update the |
Also here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/docs/source/api/kedro_datasets_experimental.rst :) |
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Hey guys, |
Signed-off-by: Minura Punchihewa <[email protected]>
…ricks (kedro-org#885) * added the experimental ExternalTableDataset * fixed lint issues * added the missing location attr to the docstring * removed unused code from the tests for ManagedTableDataset * added tests for ExternalTableDataset * moved all fixtures to conftest.py * updated the format for the test for save_overwrite to Parquet * moved tests to the kedro_datasets_experimental pkg * updated the release doc * added the dataset to the documentation * listed the dependencies for the dataset --------- Signed-off-by: Minura Punchihewa <[email protected]> Signed-off-by: tdhooghe <[email protected]>
Description
This PR adds the
ExternalTableDataset
to support interactions with external tables in Databricks (Unity Catalog).Fixes #817
Development notes
The ExternalTableDataset has been implemented by extending the
BaseTableDataset
that was added here.These changes have been tested,
Checklist
RELEASE.md
file