Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Switch Subword Tokenizer to use text file instead of hash file #17507

Open
VibhuJawa opened this issue Dec 4, 2024 · 1 comment
Open
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@VibhuJawa
Copy link
Member

Is your feature request related to a problem? Please describe.
We should move away from using the hashed vocabulary file in subword_tokenizer to use vocabulary text file directly.

This will make migration of pre-trained tokenizers to cuDF easier.

Describe the solution you'd like

Current Workflow:

import cudf
from cudf.utils.hash_vocab_utils import hash_vocab
from cudf.core.subword_tokenizer import SubwordTokenizer

hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
cudf_tokenizer = SubwordTokenizer(vocab_file='voc_hash.txt',
                                   do_lower_case=True)
str_series = cudf.Series(['This is the', 'best book'])
tokenizer_output = cudf_tokenizer(str_series,
                                  max_length=8,
                                  max_num_rows=len(str_series),
                                  padding='max_length',
                                  return_tensors='pt',
                                  truncation=True)

Suggested Workflow:

cudf_tokenizer = SubwordTokenizer(vocab_file='bert-base-cased-vocab.txt',
                                   do_lower_case=True)

### Everything else looks the same
str_series = cudf.Series(['This is the', 'best book'])
tokenizer_output = cudf_tokenizer(str_series,
                                  max_length=8,
                                  max_num_rows=len(str_series),
                                  padding='max_length',
                                  return_tensors='pt',
                                  truncation=True)

Additional context

We have an issue with the hashing overflow error here: #12403, instead of fixing that we should just migrate away from that code to use text file directly.

@VibhuJawa VibhuJawa added the feature request New feature or request label Dec 4, 2024
@davidwendt
Copy link
Contributor

I would rather it accept a strings column like TokenizeVocabulary one does now so the vocab file could be in any format (and any location) and you use cuIO APIs to load it from disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants