[FSTORE-1186] Polars Integration into Hopsworks - Writing and Reading DataFrames #1221

manu-sj · 2024-02-13T09:58:58Z

This PR adds support for Polars, specifically writing and reading polars dataframes to a feature store.

Changes:

Write Support for Polars

Added support for pyarrow types : large_list, large_string and large_binary - Since lists, string and binary types are converted into large_list, large_string and large_binary formats when a polars dataframe is created from a pandas dataframe.
Also added support for pyarrow dictionary types that has value_type which can be any strings type and index_type that can be any integer type. This done because category types are in Polars are represented as a dictionary value_type string and index_type integer
Modified function for convert_to_default_dataframe, _write_dataframe_kafka and parse_schema_feature_group to perform same functionality using polars.
In function _write_dataframe_kafka conversion of return types of dataframe iterators to native python types are not performed for polars since iter_rows in polars already return data as python types. (https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.iter_rows.html)
validate_with_great_expectations function has a patch that converts dataframe to pandas when validating since expectations does not currently support polars.

Read Support for Polars

Added support for reading dataframes as list, pandas dataframe, polars dataframe, and numpy arrays from feature group using both Hive and ArrowFlight.
Arrow flight client modified to add support for reading polars dataframes directly.
Storage connectors for AWS S3 and HopsFS modified to add support for reading dataframes as list, pandas dataframe, polars dataframe, and numpy arrays. Other storage connectors modified to add support for reading dataframes as list, pandas dataframe, and numpy arrays, polars dataframe support not added since these storage connectors are not supported in Python.
Added support for reading training data, training - test data, training - validation - test data as a list, pandas dataframe, polars dataframe, and numpy arrays.
Vector server modified to allow reading vector and inference helpers as list, pandas dataframe, polars dataframe, and numpy arrays.
Training Dataset is not modified since the class is deprecated.
Feature View read_changes function not modified since it is deprecated.

Spark Engine also has been modified so that it has the same function signature as python engine for common functions.

LoadTest - https://github.com/logicalclocks/loadtest/pull/286

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1186

How Has This Been Tested?

Unit Tests
Integration Tests
Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

davitbzh

don't you need any mention of polars in feature_store.py? also you need to provide tutorial for polars in https://github.com/logicalclocks/hopsworks-tutorials

davitbzh · 2024-03-05T11:21:04Z

python/hsfs/feature_view.py

@@ -804,6 +804,7 @@ def get_batch_data(
        primary_keys=False,
        event_time=False,
        inference_helper_columns=False,
+        dataframe_type: Optional[str] = "default",


This is user interfacing function, is it possible to avoid dataframe_type = "default" or explain clearly what "default" means

davitbzh · 2024-03-05T11:21:54Z

python/hsfs/feature_view.py

@@ -859,8 +860,15 @@ def get_batch_data(
                that may not be used in training the model itself but can be used during batch or online inference
                for extra information. If inference helper columns were not defined in the feature view
                `inference_helper_columns=True` will not any effect. Defaults to `False`, no helper columns.
+            dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
+                `"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.


see above, not sure what "default" means

davitbzh · 2024-03-05T11:22:09Z

python/hsfs/feature_view.py

@@ -2012,6 +2022,8 @@ def training_data(
                extra information. If training helper columns were not defined in the feature view
                then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper
                columns.
+            dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
+                `"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.


davitbzh · 2024-03-05T11:22:17Z

python/hsfs/feature_view.py

@@ -2070,6 +2083,7 @@ def train_test_split(
        primary_keys=False,
        event_time=False,
        training_helper_columns=False,
+        dataframe_type: Optional[str] = "default",


davitbzh · 2024-03-05T11:22:31Z

python/hsfs/feature_view.py

@@ -2176,6 +2190,8 @@ def train_test_split(
                extra information. If training helper columns were not defined in the feature view
                then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper
                columns.
+            dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
+                `"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.


…connectors

…ature group

…created during materialization and also reading from online feature store

…type is not spark and dataframe obatined is spark in _return_dataframe_type

…hanges in spark engine

…make sure type check works over different polars versions

davitbzh · 2024-03-13T14:35:13Z

python/hsfs/engine/python.py

+            dataframe, pl.dataframe.frame.DataFrame
+        ):
+            warnings.warn(
+                "Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow."


Suggested change

"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow."

"Currently Great Expectations does not support Polars dataframes. This operation will convert to Pandas dataframe that can be slow."

davitbzh · 2024-03-13T16:30:21Z

python/hsfs/feature_view.py


        # Returns
-            `pd.DataFrame` or `List[dict]`.  Defaults to `pd.DataFrame`.
+            `pd.DataFrame`, `polars.DataFrame` or `List[dict]`.  Defaults to `pd.DataFrame`.


It says Defaults topd.DataFrame. but function says dataframe_type: Optional[str] = "default"`. as above set it correctly and explain what default is

in get_inference_helpers function the default value set for return_type earlier was "pandas". That is why it is mentioned as pd.Dataframe. I corrected the return_type comment as Defaults to "pandas".

davitbzh · 2024-03-13T16:32:09Z

python/hsfs/storage_connector.py

@@ -118,11 +119,15 @@ def read(
            options: Any additional key/value options to be passed to the connector.
            path: Path to be read from within the bucket of the storage connector. Not relevant
                for JDBC or database based connectors such as Snowflake, JDBC or Redshift.
+            dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
+                `"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`.