Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1186] Polars Integration into Hopsworks - Writing and Reading DataFrames #1221

Merged
merged 28 commits into from
Mar 14, 2024

Conversation

manu-sj
Copy link
Contributor

@manu-sj manu-sj commented Feb 13, 2024

This PR adds support for Polars, specifically writing and reading polars dataframes to a feature store.

Changes:

  1. Write Support for Polars
  • Added support for pyarrow types : large_list, large_string and large_binary - Since lists, string and binary types are converted into large_list, large_string and large_binary formats when a polars dataframe is created from a pandas dataframe.
  • Also added support for pyarrow dictionary types that has value_type which can be any strings type and index_type that can be any integer type. This done because category types are in Polars are represented as a dictionary value_type string and index_type integer
  • Modified function for convert_to_default_dataframe, _write_dataframe_kafka and parse_schema_feature_group to perform same functionality using polars.
  • In function _write_dataframe_kafka conversion of return types of dataframe iterators to native python types are not performed for polars since iter_rows in polars already return data as python types. (https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.iter_rows.html)
  • validate_with_great_expectations function has a patch that converts dataframe to pandas when validating since expectations does not currently support polars.
  1. Read Support for Polars
  • Added support for reading dataframes as list, pandas dataframe, polars dataframe, and numpy arrays from feature group using both Hive and ArrowFlight.
  • Arrow flight client modified to add support for reading polars dataframes directly.
  • Storage connectors for AWS S3 and HopsFS modified to add support for reading dataframes as list, pandas dataframe, polars dataframe, and numpy arrays. Other storage connectors modified to add support for reading dataframes as list, pandas dataframe, and numpy arrays, polars dataframe support not added since these storage connectors are not supported in Python.
  • Added support for reading training data, training - test data, training - validation - test data as a list, pandas dataframe, polars dataframe, and numpy arrays.
  • Vector server modified to allow reading vector and inference helpers as list, pandas dataframe, polars dataframe, and numpy arrays.
  • Training Dataset is not modified since the class is deprecated.
  • Feature View read_changes function not modified since it is deprecated.
  1. Spark Engine also has been modified so that it has the same function signature as python engine for common functions.

LoadTest - https://github.com/logicalclocks/loadtest/pull/286

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1186

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

Copy link
Contributor

@davitbzh davitbzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need any mention of polars in feature_store.py? also you need to provide tutorial for polars in https://github.com/logicalclocks/hopsworks-tutorials

@@ -804,6 +804,7 @@ def get_batch_data(
primary_keys=False,
event_time=False,
inference_helper_columns=False,
dataframe_type: Optional[str] = "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is user interfacing function, is it possible to avoid dataframe_type = "default" or explain clearly what "default" means

@@ -859,8 +860,15 @@ def get_batch_data(
that may not be used in training the model itself but can be used during batch or online inference
for extra information. If inference helper columns were not defined in the feature view
`inference_helper_columns=True` will not any effect. Defaults to `False`, no helper columns.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, not sure what "default" means

@@ -2012,6 +2022,8 @@ def training_data(
extra information. If training helper columns were not defined in the feature view
then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper
columns.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -2070,6 +2083,7 @@ def train_test_split(
primary_keys=False,
event_time=False,
training_helper_columns=False,
dataframe_type: Optional[str] = "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -2176,6 +2190,8 @@ def train_test_split(
extra information. If training helper columns were not defined in the feature view
then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper
columns.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

manu-sj added 23 commits March 11, 2024 07:35
…created during materialization and also reading from online feature store
…type is not spark and dataframe obatined is spark in _return_dataframe_type
@davitbzh davitbzh self-requested a review March 13, 2024 14:26
dataframe, pl.dataframe.frame.DataFrame
):
warnings.warn(
"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow."
"Currently Great Expectations does not support Polars dataframes. This operation will convert to Pandas dataframe that can be slow."


# Returns
`pd.DataFrame` or `List[dict]`. Defaults to `pd.DataFrame`.
`pd.DataFrame`, `polars.DataFrame` or `List[dict]`. Defaults to `pd.DataFrame`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says Defaults topd.DataFrame. but function says dataframe_type: Optional[str] = "default"`. as above set it correctly and explain what default is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in get_inference_helpers function the default value set for return_type earlier was "pandas". That is why it is mentioned as pd.Dataframe. I corrected the return_type comment as Defaults to "pandas".

@@ -118,11 +119,15 @@ def read(
options: Any additional key/value options to be passed to the connector.
path: Path to be read from within the bucket of the storage connector. Not relevant
for JDBC or database based connectors such as Snowflake, JDBC or Redshift.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -288,6 +294,8 @@ def read(
data_format: The file format of the files to be read, e.g. `csv`, `parquet`.
options: Any additional key/value options to be passed to the S3 connector.
path: Path within the bucket to be read.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -471,6 +482,8 @@ def read(
data_format: Not relevant for JDBC based connectors such as Redshift.
options: Any additional key/value options to be passed to the JDBC connector.
path: Not relevant for JDBC based connectors such as Redshift.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -611,6 +627,8 @@ def read(
options: Any additional key/value options to be passed to the ADLS connector.
path: Path within the bucket to be read. For example, path=`path` will read directly from the container specified on connector by constructing the URI as 'abfss://[container-name]@[account_name].dfs.core.windows.net/[path]'.
If no path is specified default container path will be used from connector.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -802,6 +823,8 @@ def read(
data_format: Not relevant for Snowflake connectors.
options: Any additional key/value options to be passed to the engine.
path: Not relevant for Snowflake connectors.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -880,6 +906,8 @@ def read(
data_format: Not relevant for JDBC based connectors.
options: Any additional key/value options to be passed to the JDBC connector.
path: Not relevant for JDBC based connectors.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -1300,6 +1332,8 @@ def read(
data_format: Spark data format. Defaults to `None`.
options: Spark options. Defaults to `None`.
path: GCS path. Defaults to `None`.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas", `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@@ -1479,6 +1516,8 @@ def read(
data_format: Spark data format. Defaults to `None`.
options: Spark options. Defaults to `None`.
path: BigQuery table path. Defaults to `None`.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@davitbzh davitbzh self-requested a review March 14, 2024 08:21
Copy link
Contributor

@davitbzh davitbzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davitbzh davitbzh merged commit 88e21d3 into logicalclocks:master Mar 14, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants