[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

VibhuJawa · 2023-10-17T20:57:42Z

Is your feature request related to a problem? Please describe.

We currently rely on the hashed vocab file using cudf.utils.hash_vocab_utils.hash_vocab, we should move to using the vocab file directly.

This will be similar to the API we added here: #13930

Describe the solution you'd like

cudf_tokenizer = cudf.core.subword_tokenizer.SubwordTokenizer(vocab_file=xyz.txt)

Instead of earlier:

from cudf.utils.hash_vocab_utils import hash_vocab
hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
cudf_tokenizer = SubwordTokenizer('voc_hash.txt')

Additional context
This should help the switch from hugging face like tokenizer to be easier.

CC: @davidwendt

@MarcRomeijn, @cwharris, @MarkMoTrin - Given your experience with the cuDF tokenizer, we'd value any feedback or suggestions for enhancing the Subword tokenizer API and features you would like.

The text was updated successfully, but these errors were encountered:

davidwendt · 2023-11-17T19:20:08Z

Besides accepting the vocab file directly instead of the hashed input, specifically want to ask about the parameters and return values for this function: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.subword_tokenizer.subwordtokenizer.__call__/#cudf.core.subword_tokenizer.SubwordTokenizer.__call__
I believe some of these pre-date the lists-column support in cuDF and so I'm wondering if that could used as be a better result than returning the three individual arrays.

vyasr · 2024-10-31T00:57:29Z

@davidwendt based on #17208 should we close this if the function is going away entirely?

davidwendt · 2024-10-31T01:18:16Z

@davidwendt based on #17208 should we close this if the function is going away entirely?

Not yet. We need to get Morpheus to move off the subword-tokenizer I think.
I'd like to deprecate the API in this release to help force this a bit.

vyasr · 2024-10-31T20:46:03Z

Sounds good, early deprecation sounds like the right move.

davidwendt · 2025-01-02T16:22:58Z

Closing this in favor of #17507

VibhuJawa added feature request New feature or request Needs Triage Need team to review and classify labels Oct 17, 2023

VibhuJawa assigned davidwendt Oct 17, 2023

VibhuJawa removed the Needs Triage Need team to review and classify label Oct 17, 2023

davidwendt closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

VibhuJawa commented Oct 17, 2023

davidwendt commented Nov 17, 2023

vyasr commented Oct 31, 2024

davidwendt commented Oct 31, 2024

vyasr commented Oct 31, 2024

davidwendt commented Jan 2, 2025

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

Comments

VibhuJawa commented Oct 17, 2023

davidwendt commented Nov 17, 2023

vyasr commented Oct 31, 2024

davidwendt commented Oct 31, 2024

vyasr commented Oct 31, 2024

davidwendt commented Jan 2, 2025