Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

Closed
VibhuJawa opened this issue Oct 17, 2023 · 5 comments
Assignees
Labels
feature request New feature or request

Comments

@VibhuJawa
Copy link
Member

Is your feature request related to a problem? Please describe.

We currently rely on the hashed vocab file using cudf.utils.hash_vocab_utils.hash_vocab, we should move to using the vocab file directly.

This will be similar to the API we added here: #13930

Describe the solution you'd like

cudf_tokenizer = cudf.core.subword_tokenizer.SubwordTokenizer(vocab_file=xyz.txt)

Instead of earlier:

from cudf.utils.hash_vocab_utils import hash_vocab
hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
cudf_tokenizer = SubwordTokenizer('voc_hash.txt')

Additional context
This should help the switch from hugging face like tokenizer to be easier.

CC: @davidwendt

@MarcRomeijn, @cwharris, @MarkMoTrin - Given your experience with the cuDF tokenizer, we'd value any feedback or suggestions for enhancing the Subword tokenizer API and features you would like.

@VibhuJawa VibhuJawa added feature request New feature or request Needs Triage Need team to review and classify labels Oct 17, 2023
@VibhuJawa VibhuJawa removed the Needs Triage Need team to review and classify label Oct 17, 2023
@davidwendt
Copy link
Contributor

Besides accepting the vocab file directly instead of the hashed input, specifically want to ask about the parameters and return values for this function: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.subword_tokenizer.subwordtokenizer.__call__/#cudf.core.subword_tokenizer.SubwordTokenizer.__call__
I believe some of these pre-date the lists-column support in cuDF and so I'm wondering if that could used as be a better result than returning the three individual arrays.

@vyasr
Copy link
Contributor

vyasr commented Oct 31, 2024

@davidwendt based on #17208 should we close this if the function is going away entirely?

@davidwendt
Copy link
Contributor

@davidwendt based on #17208 should we close this if the function is going away entirely?

Not yet. We need to get Morpheus to move off the subword-tokenizer I think.
I'd like to deprecate the API in this release to help force this a bit.

@vyasr
Copy link
Contributor

vyasr commented Oct 31, 2024

Sounds good, early deprecation sounds like the right move.

@davidwendt
Copy link
Contributor

Closing this in favor of #17507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants