Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Improved Dependency Management for Spark-based Datasets #911

Merged

Conversation

MinuraPunchihewa
Copy link
Contributor

@MinuraPunchihewa MinuraPunchihewa commented Oct 26, 2024

Description

This PR a _utils sub-package to house modules with common utility functions that are used across Spark-based datasets. This avoids the need for pyspark to be installed for datasets that will run on Databricks.

Fixes #849

Development notes

The new _utils package organized the utility functions in three main modules,

  1. databricks_utils.py
  2. spark_utils.py

Additional modules can be added to this sub-package to house code that is used in multiple datasets.

These changes have been tested,

  1. Manually, by running the code locally using ManagedTableDataset and ExternalTableDataset.
  2. Via the existing unit tests.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@MinuraPunchihewa MinuraPunchihewa changed the title Improved Dependency Management for Spark-based Datasets feat(datasets): Improved Dependency Management for Spark-based Datasets Oct 26, 2024
@MinuraPunchihewa MinuraPunchihewa marked this pull request as ready for review October 27, 2024 05:24
@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam,
I have to test this a little more thoroughly, but can you give me your opinion on the approach taken here?

@merelcht merelcht requested a review from noklam October 31, 2024 10:18
Comment on lines 1 to 18
import os


def parse_glob_pattern(pattern: str) -> str:
special = ("*", "?", "[")
clean = []
for part in pattern.split("/"):
if any(char in part for char in special):
break
clean.append(part)
return "/".join(clean)


def split_filepath(filepath: str | os.PathLike) -> tuple[str, str]:
split_ = str(filepath).split("://", 1)
if len(split_) == 2: # noqa: PLR2004
return split_[0] + "://", split_[1]
return "", split_[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, these are used for Spark only. I do not want other datasets to use it because it's more like a workaround to handle Spark/Dtabricks magic path.

Most of the dataset should be using the util coming from kedro instead.

Version,
get_filepath_str,
get_protocol_and_path,

Copy link
Contributor Author

@MinuraPunchihewa MinuraPunchihewa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Understood. I will move these to the spark_utils.py module.

Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam,
I've now had the opportunity to test out these changes and they seem to work fine. I've tested both ManagedTableDataset and ExternalTableDataset with the reduced dependencies without any issues.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments I left but I only notice it wasn't sent properly 😅

return sorted(matched)


def get_dbutils(spark: SparkSession) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always a SparkSession? I think it can be a DatabricksSession

from pyspark.sql import SparkSession


def get_spark() -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_spark() -> Any:
def get_spark() -> Any:

I think this should be SparkSession and DatabricksSession? You should also wrap it inside TYPE_CHECKING to avoid import error.

@MinuraPunchihewa
Copy link
Contributor Author

Some more comments I left but I only notice it wasn't sent properly 😅

Haha no problem. I've made the suggested improvements to the type hints, including a couple more involving DBUtils.

@noklam noklam self-requested a review November 7, 2024 12:53
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments and your contribution on the databricks dataset, great stuff!

@DimedS DimedS added the Community Issue/PR opened by the open-source community label Nov 8, 2024
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @MinuraPunchihewa ! ⭐ Can you update the release notes and add your change + your name to contributors?

@MinuraPunchihewa
Copy link
Contributor Author

Thanks for this contribution @MinuraPunchihewa ! ⭐ Can you update the release notes and add your change + your name to contributors?

Thanks, @merelcht. I've just updated the release notes.

Signed-off-by: Merel Theisen <[email protected]>
@merelcht merelcht enabled auto-merge (squash) November 14, 2024 08:53
@merelcht merelcht merged commit 07aef5a into kedro-org:main Nov 14, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kedro-datasets: Improve dependency management for spark Datasets
4 participants