-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new nvtext::normalize_characters API #17818
base: branch-25.04
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
… new-normalizer-apis
/ok to test |
/ok to test |
… new-normalizer-apis
@kingcrimsontianyu and @karthikeyann would you two please lead the review for this contribution? |
cpp/include/nvtext/normalize.hpp
Outdated
~character_normalizer(); | ||
|
||
struct character_normalizer_impl; | ||
character_normalizer_impl* _impl{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use std::unique_ptr<character_normalizer_impl>
for the impl data member?
"P^NP", | ||
"$41.07", | ||
"[a,b]", | ||
"丏丟", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked it up in the dictionary as I thought this is a literature word I didn't know 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what this is but tests the normalizer paths I want. Hopefully it is not something offensive.
T const& value, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
auto const copy_size = std::min(static_cast<std::size_t>(std::distance(first, last)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: Is this because some iterators in libcudf or user-defined iterators may (incorrectly) use int
as the iterator's difference_type
as opposed to std::ptrdiff_t
? I'm curious when this is usually happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the thrust
functions I have 2 different code paths depending on the range of the input iterators -- 32-bit and 64-bit. Unfortunately, building with thrust pulls in both even if we only use 1 of them. This bloats the code upto 2x in some cases. We generally only require 32-bit (size_type
) iterator ranges and we are able to compile out the thrust 64-bit iterator paths with some strategic patching in our cmake -- significantly reducing our binary size. Note, the bloat issue is something the CCCL team is working on so the patch we use is meant to be temporary.
Anyway, this means in places were we actually a need 64-bit iterator range (like for reading large strings character vectors we currently support), we need to workaround the missing 64-bit range thrust APIs. The remove_copy_safe
function and the remove_safe
utility below workaround the limit by calling the underlying thrust functions in batches of max<int>
counts.
Description
Adds new normalizer APIs as part of the rework for the subword-tokenizer.
The new API is split into 2 parts. First a normalizer object is created with appropriate state: lower-case and special-tokens. The normalizing tables are currently hardcoded inside libcudf. Future versions of the this may load these tables from some other source. The 2nd API is given the input strings column and the normalizer object and returns a normalized strings column. The normalizer object can be reused on all subsequent
normalize_characters
calls.The current
nvtext::normalize_characters
loads the normalizing tables on each call which can be significant overhead. This API will be deprecated and replaced by these 2 new ones. Some utility functions from that implementation have been refactored to be used by both until the old one is removed.The first API creates the normalizer object.
The 2nd API uses the normalizer on a strings column:
Using the python interface:
Checklist