[FEA] Switch Subword Tokenizer to use text file instead of hash file #17507

VibhuJawa · 2024-12-04T09:35:37Z

Is your feature request related to a problem? Please describe.
We should move away from using the hashed vocabulary file in subword_tokenizer to use vocabulary text file directly.

This will make migration of pre-trained tokenizers to cuDF easier.

Describe the solution you'd like

Current Workflow:

import cudf
from cudf.utils.hash_vocab_utils import hash_vocab
from cudf.core.subword_tokenizer import SubwordTokenizer

hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
cudf_tokenizer = SubwordTokenizer(vocab_file='voc_hash.txt',
                                   do_lower_case=True)
str_series = cudf.Series(['This is the', 'best book'])
tokenizer_output = cudf_tokenizer(str_series,
                                  max_length=8,
                                  max_num_rows=len(str_series),
                                  padding='max_length',
                                  return_tensors='pt',
                                  truncation=True)

Suggested Workflow:

cudf_tokenizer = SubwordTokenizer(vocab_file='bert-base-cased-vocab.txt',
                                   do_lower_case=True)

### Everything else looks the same
str_series = cudf.Series(['This is the', 'best book'])
tokenizer_output = cudf_tokenizer(str_series,
                                  max_length=8,
                                  max_num_rows=len(str_series),
                                  padding='max_length',
                                  return_tensors='pt',
                                  truncation=True)

Additional context

We have an issue with the hashing overflow error here: #12403, instead of fixing that we should just migrate away from that code to use text file directly.

davidwendt · 2024-12-04T12:35:41Z

I would rather it accept a strings column like TokenizeVocabulary one does now so the vocab file could be in any format (and any location) and you use cuIO APIs to load it from disk.

VibhuJawa added the feature request New feature or request label Dec 4, 2024

davidwendt self-assigned this Dec 4, 2024

Matt711 added the libcudf Affects libcudf (C++/CUDA) code. label Dec 4, 2024

davidwendt mentioned this issue Jan 2, 2025

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Switch Subword Tokenizer to use text file instead of hash file #17507

[FEA] Switch Subword Tokenizer to use text file instead of hash file #17507

VibhuJawa commented Dec 4, 2024

davidwendt commented Dec 4, 2024

[FEA] Switch Subword Tokenizer to use text file instead of hash file #17507

[FEA] Switch Subword Tokenizer to use text file instead of hash file #17507

Comments

VibhuJawa commented Dec 4, 2024

davidwendt commented Dec 4, 2024