-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSTORE-1186] Polars Integration into Hopsworks - Writing and Reading DataFrames #1221
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need any mention of polars in feature_store.py? also you need to provide tutorial for polars in https://github.com/logicalclocks/hopsworks-tutorials
@@ -804,6 +804,7 @@ def get_batch_data( | |||
primary_keys=False, | |||
event_time=False, | |||
inference_helper_columns=False, | |||
dataframe_type: Optional[str] = "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is user interfacing function, is it possible to avoid dataframe_type = "default"
or explain clearly what "default"
means
python/hsfs/feature_view.py
Outdated
@@ -859,8 +860,15 @@ def get_batch_data( | |||
that may not be used in training the model itself but can be used during batch or online inference | |||
for extra information. If inference helper columns were not defined in the feature view | |||
`inference_helper_columns=True` will not any effect. Defaults to `False`, no helper columns. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, not sure what "default"
means
python/hsfs/feature_view.py
Outdated
@@ -2012,6 +2022,8 @@ def training_data( | |||
extra information. If training helper columns were not defined in the feature view | |||
then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper | |||
columns. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -2070,6 +2083,7 @@ def train_test_split( | |||
primary_keys=False, | |||
event_time=False, | |||
training_helper_columns=False, | |||
dataframe_type: Optional[str] = "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
python/hsfs/feature_view.py
Outdated
@@ -2176,6 +2190,8 @@ def train_test_split( | |||
extra information. If training helper columns were not defined in the feature view | |||
then`training_helper_columns=True` will not have any effect. Defaults to `False`, no training helper | |||
columns. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
…created during materialization and also reading from online feature store
…type is not spark and dataframe obatined is spark in _return_dataframe_type
…hanges in spark engine
…make sure type check works over different polars versions
python/hsfs/engine/python.py
Outdated
dataframe, pl.dataframe.frame.DataFrame | ||
): | ||
warnings.warn( | ||
"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Great Expectations does not support Polars dataframes directly using Great Expectations with Polars datarames can be slow." | |
"Currently Great Expectations does not support Polars dataframes. This operation will convert to Pandas dataframe that can be slow." |
|
||
# Returns | ||
`pd.DataFrame` or `List[dict]`. Defaults to `pd.DataFrame`. | ||
`pd.DataFrame`, `polars.DataFrame` or `List[dict]`. Defaults to `pd.DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says Defaults to
pd.DataFrame. but function says
dataframe_type: Optional[str] = "default"`. as above set it correctly and explain what default is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in get_inference_helpers
function the default value set for return_type
earlier was "pandas". That is why it is mentioned as pd.Dataframe. I corrected the return_type
comment as Defaults to "pandas"
.
@@ -118,11 +119,15 @@ def read( | |||
options: Any additional key/value options to be passed to the connector. | |||
path: Path to be read from within the bucket of the storage connector. Not relevant | |||
for JDBC or database based connectors such as Snowflake, JDBC or Redshift. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
@@ -288,6 +294,8 @@ def read( | |||
data_format: The file format of the files to be read, e.g. `csv`, `parquet`. | |||
options: Any additional key/value options to be passed to the S3 connector. | |||
path: Path within the bucket to be read. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, "polars"`, `"numpy"` or `"python"`, defaults to `"default"`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
@@ -471,6 +482,8 @@ def read( | |||
data_format: Not relevant for JDBC based connectors such as Redshift. | |||
options: Any additional key/value options to be passed to the JDBC connector. | |||
path: Not relevant for JDBC based connectors such as Redshift. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
@@ -611,6 +627,8 @@ def read( | |||
options: Any additional key/value options to be passed to the ADLS connector. | |||
path: Path within the bucket to be read. For example, path=`path` will read directly from the container specified on connector by constructing the URI as 'abfss://[container-name]@[account_name].dfs.core.windows.net/[path]'. | |||
If no path is specified default container path will be used from connector. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
@@ -802,6 +823,8 @@ def read( | |||
data_format: Not relevant for Snowflake connectors. | |||
options: Any additional key/value options to be passed to the engine. | |||
path: Not relevant for Snowflake connectors. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
@@ -880,6 +906,8 @@ def read( | |||
data_format: Not relevant for JDBC based connectors. | |||
options: Any additional key/value options to be passed to the JDBC connector. | |||
path: Not relevant for JDBC based connectors. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
python/hsfs/storage_connector.py
Outdated
@@ -1300,6 +1332,8 @@ def read( | |||
data_format: Spark data format. Defaults to `None`. | |||
options: Spark options. Defaults to `None`. | |||
path: GCS path. Defaults to `None`. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas", `"numpy"` or `"python"`, defaults to `"default"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
python/hsfs/storage_connector.py
Outdated
@@ -1479,6 +1516,8 @@ def read( | |||
data_format: Spark data format. Defaults to `None`. | |||
options: Spark options. Defaults to `None`. | |||
path: BigQuery table path. Defaults to `None`. | |||
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`, | |||
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation of "default"
here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds support for Polars, specifically writing and reading polars dataframes to a feature store.
Changes:
LoadTest - https://github.com/logicalclocks/loadtest/pull/286
JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1186
How Has This Been Tested?
Checklist For The Assigned Reviewer: