feat(datasets): Improved Dependency Management for Spark-based Datasets #911

MinuraPunchihewa · 2024-10-26T18:30:57Z

Description

This PR a _utils sub-package to house modules with common utility functions that are used across Spark-based datasets. This avoids the need for pyspark to be installed for datasets that will run on Databricks.

Fixes #849

Development notes

The new _utils package organized the utility functions in three main modules,

databricks_utils.py
spark_utils.py

Additional modules can be added to this sub-package to house code that is used in multiple datasets.

These changes have been tested,

Manually, by running the code locally using ManagedTableDataset and ExternalTableDataset.
Via the existing unit tests.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Minura Punchihewa <[email protected]>

…asets Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2024-10-27T05:26:35Z

Hey @noklam,
I have to test this a little more thoroughly, but can you give me your opinion on the approach taken here?

noklam · 2024-10-31T15:30:02Z

kedro-datasets/kedro_datasets/_utils/file_utils.py

+import os
+
+
+def parse_glob_pattern(pattern: str) -> str:
+    special = ("*", "?", "[")
+    clean = []
+    for part in pattern.split("/"):
+        if any(char in part for char in special):
+            break
+        clean.append(part)
+    return "/".join(clean)
+
+
+def split_filepath(filepath: str | os.PathLike) -> tuple[str, str]:
+    split_ = str(filepath).split("://", 1)
+    if len(split_) == 2:  # noqa: PLR2004
+        return split_[0] + "://", split_[1]
+    return "", split_[0]


AFAIK, these are used for Spark only. I do not want other datasets to use it because it's more like a workaround to handle Spark/Dtabricks magic path.

Most of the dataset should be using the util coming from kedro instead.

kedro-plugins/kedro-datasets/kedro_datasets/pandas/csv_dataset.py

Lines 17 to 19 in 1a18c5a

Version,

get_filepath_str,

get_protocol_and_path,

@noklam Understood. I will move these to the spark_utils.py module.

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2024-10-31T19:32:49Z

Hey @noklam,
I've now had the opportunity to test out these changes and they seem to work fine. I've tested both ManagedTableDataset and ExternalTableDataset with the reduced dependencies without any issues.

noklam

Some more comments I left but I only notice it wasn't sent properly 😅

noklam · 2024-10-31T15:36:36Z

kedro-datasets/kedro_datasets/_utils/databricks_utils.py

+    return sorted(matched)
+
+
+def get_dbutils(spark: SparkSession) -> Any:


Is this always a SparkSession? I think it can be a DatabricksSession

noklam · 2024-10-31T17:22:35Z

kedro-datasets/kedro_datasets/_utils/spark_utils.py

+from pyspark.sql import SparkSession
+
+
+def get_spark() -> Any:


Suggested change

def get_spark() -> Any:

def get_spark() -> Any:

I think this should be SparkSession and DatabricksSession? You should also wrap it inside TYPE_CHECKING to avoid import error.

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2024-10-31T20:43:35Z

Some more comments I left but I only notice it wasn't sent properly 😅

Haha no problem. I've made the suggested improvements to the type hints, including a couple more involving DBUtils.

noklam

Thanks for addressing all the comments and your contribution on the databricks dataset, great stuff!

merelcht

Thanks for this contribution @MinuraPunchihewa ! ⭐ Can you update the release notes and add your change + your name to contributors?

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2024-11-13T18:21:43Z

Thanks for this contribution @MinuraPunchihewa ! ⭐ Can you update the release notes and add your change + your name to contributors?

Thanks, @merelcht. I've just updated the release notes.

Signed-off-by: Merel Theisen <[email protected]>

MinuraPunchihewa added 6 commits October 26, 2024 23:49

added the skeleton for the utils sub pkg

e4a9e47

Signed-off-by: Minura Punchihewa <[email protected]>

moved the utility funcs from spark_dataset to relevant modules in _utils

1ab51ad

Signed-off-by: Minura Punchihewa <[email protected]>

updated the use of utility funcs in spark_dataset

0187cf1

Signed-off-by: Minura Punchihewa <[email protected]>

fixed import in databricks_utils

84675bf

Signed-off-by: Minura Punchihewa <[email protected]>

renamed _strip_dbfs_prefix to strip_dbfs_prefix

ae3829d

Signed-off-by: Minura Punchihewa <[email protected]>

updated the other modules that import from spark_dataset to use _utils

c5bb7c5

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa changed the title ~~Improved Dependency Management for Spark-based Datasets~~ feat(datasets): Improved Dependency Management for Spark-based Datasets Oct 26, 2024

MinuraPunchihewa added 3 commits October 27, 2024 00:04

updated the use of strip_dbfs_prefix in spark_dataset

48df490

Signed-off-by: Minura Punchihewa <[email protected]>

fixed lint issues

f562190

Signed-off-by: Minura Punchihewa <[email protected]>

removed the base deps for spark, pandas and delta from databricks dat…

0a89638

…asets Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa marked this pull request as ready for review October 27, 2024 05:24

Merge branch 'main' into feature/improve_spark_dependencies

72e1925

merelcht requested a review from noklam October 31, 2024 10:18

Merge branch 'main' into feature/improve_spark_dependencies

406adfa

noklam reviewed Oct 31, 2024

View reviewed changes

MinuraPunchihewa added 9 commits October 31, 2024 22:35

moved the file based utility funcs to databricks_utils

8f7842f

Signed-off-by: Minura Punchihewa <[email protected]>

fixed the imports of the file based utility funcs

6e50319

Signed-off-by: Minura Punchihewa <[email protected]>

fixed lint issues

63834a9

Signed-off-by: Minura Punchihewa <[email protected]>

fixed the use of _get_spark() in tests

6df7741

Signed-off-by: Minura Punchihewa <[email protected]>

fixed uses of databricks utils in tests

80f9fb5

Signed-off-by: Minura Punchihewa <[email protected]>

fixed more tests

03da501

Signed-off-by: Minura Punchihewa <[email protected]>

fixed more lint issues

0534907

Signed-off-by: Minura Punchihewa <[email protected]>

fixed more tests

65d1c30

Signed-off-by: Minura Punchihewa <[email protected]>

fixed more tests

f7f0b5e

Signed-off-by: Minura Punchihewa <[email protected]>

noklam reviewed Oct 31, 2024

View reviewed changes

MinuraPunchihewa added 3 commits November 1, 2024 01:45

improved type hints for spark & databricks utility funcs

608d2a2

Signed-off-by: Minura Punchihewa <[email protected]>

fixed more lint issues

e28cdf1

Signed-off-by: Minura Punchihewa <[email protected]>

further improved type hints for utility funcs

781a7ef

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa added 2 commits November 1, 2024 01:55

fixed a couple of incorrect type hints

768e8ae

Signed-off-by: Minura Punchihewa <[email protected]>

fixed several incorrect type hints

c44e32a

Signed-off-by: Minura Punchihewa <[email protected]>

Merge branch 'main' into feature/improve_spark_dependencies

4b31967

noklam self-requested a review November 7, 2024 12:53

noklam approved these changes Nov 7, 2024

View reviewed changes

DimedS added the Community Issue/PR opened by the open-source community label Nov 8, 2024

Merge branch 'main' into feature/improve_spark_dependencies

4c427cc

merelcht approved these changes Nov 13, 2024

View reviewed changes

merelcht and others added 2 commits November 13, 2024 14:29

Merge branch 'main' into feature/improve_spark_dependencies

7334e1b

updated the release notes

d7a385d

Signed-off-by: Minura Punchihewa <[email protected]>

Reorder release notes

0f65bdb

Signed-off-by: Merel Theisen <[email protected]>

merelcht enabled auto-merge (squash) November 14, 2024 08:53

merelcht merged commit 07aef5a into kedro-org:main Nov 14, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Improved Dependency Management for Spark-based Datasets #911

feat(datasets): Improved Dependency Management for Spark-based Datasets #911

MinuraPunchihewa commented Oct 26, 2024 •

edited

Loading

MinuraPunchihewa commented Oct 27, 2024

noklam Oct 31, 2024

MinuraPunchihewa Oct 31, 2024 •

edited

Loading

MinuraPunchihewa commented Oct 31, 2024

noklam left a comment

noklam Oct 31, 2024

noklam Oct 31, 2024

MinuraPunchihewa commented Oct 31, 2024

noklam left a comment

merelcht left a comment

MinuraPunchihewa commented Nov 13, 2024

		return sorted(matched)


		def get_dbutils(spark: SparkSession) -> Any:

feat(datasets): Improved Dependency Management for Spark-based Datasets #911

feat(datasets): Improved Dependency Management for Spark-based Datasets #911

Conversation

MinuraPunchihewa commented Oct 26, 2024 • edited Loading

Description

Development notes

Checklist

MinuraPunchihewa commented Oct 27, 2024

noklam Oct 31, 2024

Choose a reason for hiding this comment

MinuraPunchihewa Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

MinuraPunchihewa commented Oct 31, 2024

noklam left a comment

Choose a reason for hiding this comment

noklam Oct 31, 2024

Choose a reason for hiding this comment

noklam Oct 31, 2024

Choose a reason for hiding this comment

MinuraPunchihewa commented Oct 31, 2024

noklam left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

MinuraPunchihewa commented Nov 13, 2024

MinuraPunchihewa commented Oct 26, 2024 •

edited

Loading

MinuraPunchihewa Oct 31, 2024 •

edited

Loading