feat(datasets): Created table_args to pass to `create_table`, `create_view`, and `table` methods #909

mark-druffel · 2024-10-25T22:16:26Z

Description

Development notes

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

…o avoid breaking changes Signed-off-by: Mark Druffel <[email protected]>

Signed-off-by: Mark Druffel <[email protected]>

deepyaman

Just leaving initial comments; happy to review later once it's ready.

kedro-datasets/RELEASE.md

deepyaman · 2024-10-27T20:14:11Z

kedro-datasets/kedro_datasets/ibis/table_dataset.py


    def save(self, data: ir.Table) -> None:
        if self._table_name is None:
            raise DatasetError("Must provide `table_name` for materialization.")

        writer = getattr(self.connection, f"create_{self._materialized}")
-        writer(self._table_name, data, **self._save_args)
+        writer(self._table_name, data, **self._table_args)


Is this right? I think the table args should only apply to the table call, but haven't looked into it deeply before commenting now.

@deepyaman Sorry this is a little confusing so just adding a bit more context.

This PR

The table method takes the database argument, butcreate_table & create_view methods both take the database and overwrite arguments. The overwrite argument is already in save_args, but I'm assuming save_args will be removed from TableDataset in version 6. To avoid breaking changes, but also minimize change between this release and version 6 I just added the new parameters (database) to table_args and left the old parameters alone. is already in the save_args they both also have overwrite which is already in _save_args.

To avoid breaking changes but still allow create_table and create_view arguments to flow through, I combined _save_args and _table_args here.

Version 6

I am assuming that save_args & load_args will be dropped from TableDataset in version 6. In that change, I'd assume the arguments still used from load_args and save_args would be added to table_args. To make TableDataset and FileDataset look / feel similar, we could consider just making a commensurate file_args. I've not used 5.1 enough yet to say with certainty, but I can't think of a reason a user would want different values in load_args than save_args now that it's split from TableDataset (i.e. the filepath, file_type, sep, etc. would be same for load and save)? I may be totally overlooking some things though 🤷‍♂️

bronze_tracks: type: ibis.FileDataset # use `to_<file_format>` (write) & `read_<file_format>` (read) connection: backend: pyspark file_args: filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv file_format: csv materialized: view overwrite: True table_name: tracks #`to_<file_format>` in ibis has no database parameter so there's no ability to write to a specific catalog / db schema atm, `to_<file_format>` just writes to w/e is active sep: "," silver_tracks: type: ibis.TableDataset # would use `create_<materialized>` (write) & `table` (read) connection: backend: pyspark table_args: name: tracks database: spotify.silver overwrite: True

Signed-off-by: Mark Druffel <[email protected]>

Signed-off-by: Deepyaman Datta <[email protected]>

Signed-off-by: Mark Druffel <[email protected]>

…ark-druffel/kedro-plugins into fix/datasets/ibis-TableDataset

mark-druffel · 2024-11-01T22:55:30Z

@deepyaman I changed this to ready for review, but I'm failing a bunch of steps. I tried to follow the guidelines, but when I run the make tests they all fail saying No rule. Any chance you can take a look and give me a bit of guidance? Sorry just not sure where to go from here 😬

Aside from the failing checks, I tested this version of table_dataset.py on a duckdb pipeline, a pyspark pipeline, and a pyspark pipeline on databricks and it seems to be working. My only open question relates to my musing above about the expected format of TableDataset and FileDataset above.

mark-druffel · 2024-11-05T18:54:23Z

@jakepenzak For visibility

Signed-off-by: Mark Druffel <[email protected]>

deepyaman · 2024-11-13T14:17:50Z

Sorry, I saw this yesterday and started drafting an apology. 🙈 I will review it later today. 🤞

…

On Wed, Nov 13, 2024, 6:16 AM Merel Theisen ***@***.***> wrote: @merelcht <https://github.com/merelcht> requested your review on: #909 <#909> feat(datasets): Created table_args to pass to create_table, create_view, and table methods. — Reply to this email directly, view it on GitHub <#909 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADK3W3SOIHESNW3FEMOTGED2ANGKTAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGI4DGMBQGQYTQMQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

mark-druffel · 2024-11-13T20:09:56Z

Sorry, I saw this yesterday and started drafting an apology. 🙈 I will review it later today. 🤞
…
On Wed, Nov 13, 2024, 6:16 AM Merel Theisen @.> wrote: @merelcht https://github.com/merelcht requested your review on: #909 <#909> feat(datasets): Created table_args to pass to create_table, create_view, and table methods. — Reply to this email directly, view it on GitHub <#909 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK3W3SOIHESNW3FEMOTGED2ANGKTAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGI4DGMBQGQYTQMQ . You are receiving this because your review was requested.Message ID: @.>

No worries @deepyaman, really appreciate your help! Let me know what I can do to support, just trying to make sure the yaml changes I'm introducing make sense and figure out how to get through the PR process :)

Regarding my issues with make, I was able to figure out my initial question, but still ran into some errors running when linting and testing. Not sure what went wrong, but for linting I get errors related to some other datasets. Perhaps from rebasing 🤷

For testing, unfortunately I don't think the tests will work on my personal machine because I'm on an old processor that doesn't support AVX2. I know that causes issues when running polars, not sure if that's the root cause or not, the setps seem to fail because of a python illegal instruction error in execnet...

deepyaman · 2024-11-13T21:20:41Z

@mark-druffel Actually, putting aside the issues with local development, if you look at the CI failure on kedro-datasets / unit-tests, you'll see that the main thing is not having unit tests covering lines 150 and 155 in kedro_datasets/ibis/table_dataset.py:

kedro_datasets/ibis/table_dataset.py                            66      2    97%   150, 155

deepyaman

Looks good on the whole, but one comment re how database is handled.

Let me know if I can help with any of the technical aspects of resolving merge conflicts, adding tests, etc.!

deepyaman · 2024-11-14T19:28:49Z

kedro-datasets/kedro_datasets/ibis/table_dataset.py

+            if table_args is not None:
+                save_args["database"] = table_args.get("database", None)


This feels a bit magical to me. It's not really consistent with the docstring, either, which says that arguments will be passed to create_{materialized}; in reality, the user needs to know that just database will be passed.

I personally would recommend one of two approaches. One is to not do anything special here; the user can pass database in save_args and database in table_args, and, while it may feel duplicative, at least it's explicit. The other approach to make an explicit database keyword for the dataset, and likely raise an error if database is specified in save_args and/or table_args if also passed explicitly.

@mark-druffel does this make sense, and do you have a preference?

@deepyaman As discussed yesterday, I've moved database to the top-level as discussed. I'm trying to push the changes, but I'm getting blocked by pre-commit now that I have it setup properly.

When it ran, it changed a bunch of files I never touched. I staged those as well (not sure if I should've), but my commit still failed because of Black. I've run black manually on the file I changed too to try to lint the file. Any suggestions how I can get this working properly? 😬

deepyaman · 2024-11-15T23:24:16Z

Based on the screenshot, it's only reformatting one file. Maybe you can do a `git diff` to see what's changed? You can also just add that change, and I cam take a look. Also happy to help debug the workflow on a quick call, if that would help!

…

On Fri, Nov 15, 2024, 3:48 PM Mark Druffel ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In kedro-datasets/kedro_datasets/ibis/table_dataset.py <#909 (comment)> : > + if table_args is not None: + save_args["database"] = table_args.get("database", None) @deepyaman <https://github.com/deepyaman> As discussed yesterday, I've moved database to the top-level as discussed. I'm trying to push the changes, but I'm getting blocked by pre-commit now that I have it setup properly. When it ran, it changed a bunch of files I never touched. I staged those as well (not sure if I should've), but my commit still failed because of Black. I've run black manually on the file I changed too to try to lint the file. Any suggestions how I can get this working properly? 😬 image.png (view on web) <https://github.com/user-attachments/assets/94b397cc-7263-4eaf-871f-0405a5cc59ee> — Reply to this email directly, view it on GitHub <#909 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADK3W3X364WF4MLCOIUU5K32AZ23BAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMZZHA4DMOJQGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mark-druffel changed the title ~~Created table_args to pass to create_table, create_view, and table methods~~ Fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024

mark-druffel changed the title ~~Fix(datasets): Created table_args to pass to create_table, create_view, and table methods~~ fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024

mark-druffel added 2 commits October 25, 2024 15:44

Added table_args, combined save_args and table_args for save method t…

add5a38

…o avoid breaking changes Signed-off-by: Mark Druffel <[email protected]>

Added docstring and release note

ef3712e

Signed-off-by: Mark Druffel <[email protected]>

mark-druffel force-pushed the fix/datasets/ibis-TableDataset branch from 47331ff to ef3712e Compare October 25, 2024 22:44

deepyaman reviewed Oct 27, 2024

View reviewed changes

mark-druffel changed the title ~~fix(datasets): Created table_args to pass to create_table, create_view, and table methods~~ feat(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 28, 2024

Updated release notes as feature, not bug

0a40d23

Signed-off-by: Mark Druffel <[email protected]>

mark-druffel closed this Oct 28, 2024

mark-druffel deleted the fix/datasets/ibis-TableDataset branch October 28, 2024 19:39

deepyaman reopened this Oct 28, 2024

deepyaman and others added 5 commits October 28, 2024 16:58

Merge branch 'main' into fix/datasets/ibis-TableDataset

d6ea74e

Signed-off-by: Deepyaman Datta <[email protected]>

Changes to fix bug with table_args & save_args

8897a01

Signed-off-by: Mark Druffel <[email protected]>

Merge branch 'main' into fix/datasets/ibis-TableDataset

e0e2df9

Linting on table_dataset.py

6b86358

Signed-off-by: Mark Druffel <[email protected]>

Merge branch 'fix/datasets/ibis-TableDataset' of https://github.com/m…

b738a92

…ark-druffel/kedro-plugins into fix/datasets/ibis-TableDataset

mark-druffel marked this pull request as ready for review November 1, 2024 22:37

mark-druffel added 2 commits November 11, 2024 09:55

Ran table_dataset through black

4a47336

Signed-off-by: Mark Druffel <[email protected]>

Merge branch 'main' into fix/datasets/ibis-TableDataset

dd0459d

merelcht requested a review from deepyaman November 13, 2024 13:16

deepyaman reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Created table_args to pass to `create_table`, `create_view`, and `table` methods #909

feat(datasets): Created table_args to pass to `create_table`, `create_view`, and `table` methods #909

mark-druffel commented Oct 25, 2024 •

edited

Loading

deepyaman left a comment

deepyaman Oct 27, 2024

mark-druffel Oct 28, 2024 •

edited

Loading

mark-druffel commented Nov 1, 2024

mark-druffel commented Nov 5, 2024

deepyaman commented Nov 13, 2024 via email

mark-druffel commented Nov 13, 2024 •

edited

Loading

deepyaman commented Nov 13, 2024

deepyaman left a comment

deepyaman Nov 14, 2024

mark-druffel Nov 15, 2024

deepyaman commented Nov 15, 2024 via email

		if table_args is not None:
		save_args["database"] = table_args.get("database", None)

feat(datasets): Created table_args to pass to create_table, create_view, and table methods #909

Are you sure you want to change the base?

feat(datasets): Created table_args to pass to create_table, create_view, and table methods #909

Conversation

mark-druffel commented Oct 25, 2024 • edited Loading

Description

Development notes

Checklist

deepyaman left a comment

Choose a reason for hiding this comment

deepyaman Oct 27, 2024

Choose a reason for hiding this comment

mark-druffel Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

This PR

Version 6

mark-druffel commented Nov 1, 2024

mark-druffel commented Nov 5, 2024

deepyaman commented Nov 13, 2024 via email

mark-druffel commented Nov 13, 2024 • edited Loading

deepyaman commented Nov 13, 2024

deepyaman left a comment

Choose a reason for hiding this comment

deepyaman Nov 14, 2024

Choose a reason for hiding this comment

mark-druffel Nov 15, 2024

Choose a reason for hiding this comment

deepyaman commented Nov 15, 2024 via email

feat(datasets): Created table_args to pass to `create_table`, `create_view`, and `table` methods #909

feat(datasets): Created table_args to pass to `create_table`, `create_view`, and `table` methods #909

mark-druffel commented Oct 25, 2024 •

edited

Loading

mark-druffel Oct 28, 2024 •

edited

Loading

mark-druffel commented Nov 13, 2024 •

edited

Loading