Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1411] On-Demand Transformation Functions #1371

Merged
merged 4 commits into from
Jul 12, 2024

Conversation

manu-sj
Copy link
Contributor

@manu-sj manu-sj commented Jul 10, 2024

This PR adds functionality for On-demand transformation function as per the design log https://hopsworks.atlassian.net/wiki/spaces/FST/pages/461733903/On-Demand+Transformation+Functions+-+User+API+and+Implementation

Changes Done:

  1. Allow custom UDF's to drop features after transformations. The feature is implemented by adding a parameter drop in the decorator.
@hopsworks.udf(return_type=float, drop=["arg1", "arg2"])
def custom_udf(arg1, arg2, arg3):
    ...
    return value
  1. Removed TransformationFunctionsAttached from the python client
  2. Added ability to attach transformation functions to feature groups. (On-Demand Transformations)
  3. Added to feature view ability to recompute on-demand features using request parameters.
  4. Added to feature view ability to manually compute on-demand features using the function compute_on_demand_features
  5. Added to feature view ability to access manually apply model-dependent transformations to the feature vector transform.
  6. Added the ability to access executable model-dependent transformation function using a dictionary with the key was the output column name. (Only output column name is unique for a model dependent transformation function).
  7. Added the ability to access executable on-demand transformation function using a dictionary with the key was the on-demand feature name. (Only output column name is unique for a model dependent transformation function).

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1411

Priority for Review: -

Related PRs:
https://github.com/logicalclocks/hopsworks-ee/pull/1862
logicalclocks/hopsworks-chef#1148
https://github.com/logicalclocks/loadtest/pull/394

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

@manu-sj manu-sj requested a review from kennethmhc July 10, 2024 07:27
Copy link
Contributor

@kennethmhc kennethmhc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments about the user facing API.

python/hsfs/hopsworks_udf.py Outdated Show resolved Hide resolved
python/hsfs/hopsworks_udf.py Show resolved Hide resolved
python/hsfs/hopsworks_udf.py Outdated Show resolved Hide resolved
@@ -519,6 +629,8 @@ def get_udf(self, force_python_udf: bool = False) -> Callable:
# Returns
`Callable`: Pandas UDF in the spark engine otherwise returns a python function for the UDF.
"""
if self.udf_type is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

@@ -31,7 +32,18 @@
from hsfs.transformation_statistics import TransformationStatistics


def udf(return_type: Union[List[type], type]) -> "HopsworksUdf":
class UDFType(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually transformation type. The UDF is actually the same and can be used for both ODT and MDT. I understand this type is needed for the udf to return different column names, but imo this is better to be done in transformation_function class and also the validation of the transformation type. But it is not critical, can be refactored later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I will move it out to transformation function class.

request_parameters: Optional[
Union[List[Dict[str, Any]], Dict[str, Any]]
] = None,
external: Optional[bool] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed? this function won't fetch features from online store right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right. There is no real need for a Vector server to be initlized. I think should probably refactor the code to move the functionality out of VectorServer

def transform(
self,
feature_vector: Union[List[Any], List[List[Any]], pd.DataFrame, pl.DataFrame],
external: Optional[bool] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed? this function won't fetch features from online store right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right it won't fetch features from the online store, but MDT's might require statistics of features. The statistics used while transforming depends on the training_dataset_version set in init_serving so if init_serving had not been done we don't know the statistics to be used for applying the transformation function. So I basically copied the code done in get_feature_vector for initializing vector server.

I could be wrong here but I think there should be some check to see if init_serving or init_batch_scoring has been performed.

python/hsfs/core/vector_server.py Outdated Show resolved Hide resolved
…d and adapting changes for save functions feature group
Copy link
Contributor

@kennethmhc kennethmhc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (the rest of the comment will be addressed in a separate PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants