Skip to content

Comments

[FSTORE-1186] Polars Integration into Hopsworks - Writing and Reading DataFrames #1221

Merged
davitbzh merged 28 commits intologicalclocks:masterfrom
manu-sj:FSTORE-1186
Mar 14, 2024
Merged

[FSTORE-1186] Polars Integration into Hopsworks - Writing and Reading DataFrames #1221
davitbzh merged 28 commits intologicalclocks:masterfrom
manu-sj:FSTORE-1186

Conversation

@manu-sj
Copy link
Contributor

@manu-sj manu-sj commented Feb 13, 2024

This PR adds support for Polars, specifically writing and reading polars dataframes to a feature store.

Changes:

  1. Write Support for Polars
  • Added support for pyarrow types : large_list, large_string and large_binary - Since lists, string and binary types are converted into large_list, large_string and large_binary formats when a polars dataframe is created from a pandas dataframe.
  • Also added support for pyarrow dictionary types that has value_type which can be any strings type and index_type that can be any integer type. This done because category types are in Polars are represented as a dictionary value_type string and index_type integer
  • Modified function for convert_to_default_dataframe, _write_dataframe_kafka and parse_schema_feature_group to perform same functionality using polars.
  • In function _write_dataframe_kafka conversion of return types of dataframe iterators to native python types are not performed for polars since iter_rows in polars already return data as python types. (https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.iter_rows.html)
  • validate_with_great_expectations function has a patch that converts dataframe to pandas when validating since expectations does not currently support polars.
  1. Read Support for Polars
  • Added support for reading dataframes as list, pandas dataframe, polars dataframe, and numpy arrays from feature group using both Hive and ArrowFlight.
  • Arrow flight client modified to add support for reading polars dataframes directly.
  • Storage connectors for AWS S3 and HopsFS modified to add support for reading dataframes as list, pandas dataframe, polars dataframe, and numpy arrays. Other storage connectors modified to add support for reading dataframes as list, pandas dataframe, and numpy arrays, polars dataframe support not added since these storage connectors are not supported in Python.
  • Added support for reading training data, training - test data, training - validation - test data as a list, pandas dataframe, polars dataframe, and numpy arrays.
  • Vector server modified to allow reading vector and inference helpers as list, pandas dataframe, polars dataframe, and numpy arrays.
  • Training Dataset is not modified since the class is deprecated.
  • Feature View read_changes function not modified since it is deprecated.
  1. Spark Engine also has been modified so that it has the same function signature as python engine for common functions.

LoadTest - https://github.com/logicalclocks/loadtest/pull/286

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1186

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

Copy link
Contributor

@davitbzh davitbzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need any mention of polars in feature_store.py? also you need to provide tutorial for polars in https://github.com/logicalclocks/hopsworks-tutorials

primary_keys=False,
event_time=False,
inference_helper_columns=False,
dataframe_type: Optional[str] = "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is user interfacing function, is it possible to avoid dataframe_type = "default" or explain clearly what "default" means

for extra information. If inference helper columns were not defined in the feature view
`inference_helper_columns=True` will not any effect. Defaults to `False`, no helper columns.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, not sure what "default" means

then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper
columns.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

primary_keys=False,
event_time=False,
training_helper_columns=False,
dataframe_type: Optional[str] = "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper
columns.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

manu-sj added 23 commits March 11, 2024 07:35
…created during materialization and also reading from online feature store
…type is not spark and dataframe obatined is spark in _return_dataframe_type
@davitbzh davitbzh self-requested a review March 13, 2024 14:26
dataframe, pl.dataframe.frame.DataFrame
):
warnings.warn(
"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow."
"Currently Great Expectations does not support Polars dataframes. This operation will convert to Pandas dataframe that can be slow."


# Returns
`pd.DataFrame` or `List[dict]`. Defaults to `pd.DataFrame`.
`pd.DataFrame`, `polars.DataFrame` or `List[dict]`. Defaults to `pd.DataFrame`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says Defaults topd.DataFrame. but function says dataframe_type: Optional[str] = "default"`. as above set it correctly and explain what default is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in get_inference_helpers function the default value set for return_type earlier was "pandas". That is why it is mentioned as pd.Dataframe. I corrected the return_type comment as Defaults to "pandas".

for JDBC or database based connectors such as Snowflake, JDBC or Redshift.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

path: Path within the bucket to be read.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

path: Not relevant for JDBC based connectors such as Redshift.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

If no path is specified default container path will be used from connector.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

path: Not relevant for Snowflake connectors.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

path: Not relevant for JDBC based connectors.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

options: Spark options. Defaults to `None`.
path: GCS path. Defaults to `None`.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas", `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

options: Spark options. Defaults to `None`.
path: BigQuery table path. Defaults to `None`.
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation of "default" here as well

@davitbzh davitbzh self-requested a review March 14, 2024 08:21
Copy link
Contributor

@davitbzh davitbzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davitbzh davitbzh merged commit 88e21d3 into logicalclocks:master Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants