-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSTORE-1411] On-Demand Transformation Functions #1371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments about the user facing API.
@@ -519,6 +629,8 @@ def get_udf(self, force_python_udf: bool = False) -> Callable: | |||
# Returns | |||
`Callable`: Pandas UDF in the spark engine otherwise returns a python function for the UDF. | |||
""" | |||
if self.udf_type is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed?
@@ -31,7 +32,18 @@ | |||
from hsfs.transformation_statistics import TransformationStatistics | |||
|
|||
|
|||
def udf(return_type: Union[List[type], type]) -> "HopsworksUdf": | |||
class UDFType(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually transformation type. The UDF is actually the same and can be used for both ODT and MDT. I understand this type is needed for the udf to return different column names, but imo this is better to be done in transformation_function
class and also the validation of the transformation type. But it is not critical, can be refactored later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I will move it out to transformation function class.
request_parameters: Optional[ | ||
Union[List[Dict[str, Any]], Dict[str, Any]] | ||
] = None, | ||
external: Optional[bool] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed? this function won't fetch features from online store right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right. There is no real need for a Vector server to be initlized. I think should probably refactor the code to move the functionality out of VectorServer
def transform( | ||
self, | ||
feature_vector: Union[List[Any], List[List[Any]], pd.DataFrame, pl.DataFrame], | ||
external: Optional[bool] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed? this function won't fetch features from online store right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right it won't fetch features from the online store, but MDT's might require statistics of features. The statistics used while transforming depends on the training_dataset_version set in init_serving
so if init_serving
had not been done we don't know the statistics to be used for applying the transformation function. So I basically copied the code done in get_feature_vector
for initializing vector server.
I could be wrong here but I think there should be some check to see if init_serving
or init_batch_scoring
has been performed.
…d and adapting changes for save functions feature group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (the rest of the comment will be addressed in a separate PR)
This PR adds functionality for On-demand transformation function as per the design log https://hopsworks.atlassian.net/wiki/spaces/FST/pages/461733903/On-Demand+Transformation+Functions+-+User+API+and+Implementation
Changes Done:
TransformationFunctionsAttached
from the python clientcompute_on_demand_features
transform
.JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1411
Priority for Review: -
Related PRs:
https://github.com/logicalclocks/hopsworks-ee/pull/1862
logicalclocks/hopsworks-chef#1148
https://github.com/logicalclocks/loadtest/pull/394
How Has This Been Tested?
Checklist For The Assigned Reviewer: