-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new nvtext::normalize_characters API #17818
base: branch-25.04
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
… new-normalizer-apis
/ok to test |
/ok to test |
… new-normalizer-apis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. Thanks for the work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit.
std::move(tokens_view)); | ||
} | ||
|
||
character_normalizer::~character_normalizer() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we moved to using unique_ptr, we can default this destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the compiler does not like that. With a ~character_normalizer()=default;
declaration the compiler tries to generate the destructor in the including TU (like test and benchmark .cpps) but complains it does not know the size of the _impl
class type and reports an error. Keeping this empty destructor defined here means the compiler will not try to generate the destructor on its own (and fail at it).
Description
Adds new normalizer APIs as part of the rework for the subword-tokenizer.
The new API is split into 2 parts. First a normalizer object is created with appropriate state: lower-case and special-tokens. The normalizing tables are currently hardcoded inside libcudf. Future versions of the this may load these tables from some other source. The 2nd API is given the input strings column and the normalizer object and returns a normalized strings column. The normalizer object can be reused on all subsequent
normalize_characters
calls.The current
nvtext::normalize_characters
loads the normalizing tables on each call which can be significant overhead. This API will be deprecated and replaced by these 2 new ones. Some utility functions from that implementation have been refactored to be used by both until the old one is removed.The first API creates the normalizer object.
The 2nd API uses the normalizer on a strings column:
Using the python interface:
Checklist