You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We should move away from using the hashed vocabulary file in subword_tokenizer to use vocabulary text file directly.
This will make migration of pre-trained tokenizers to cuDF easier.
cudf_tokenizer=SubwordTokenizer(vocab_file='bert-base-cased-vocab.txt',
do_lower_case=True)
### Everything else looks the samestr_series=cudf.Series(['This is the', 'best book'])
tokenizer_output=cudf_tokenizer(str_series,
max_length=8,
max_num_rows=len(str_series),
padding='max_length',
return_tensors='pt',
truncation=True)
Additional context
We have an issue with the hashing overflow error here: #12403, instead of fixing that we should just migrate away from that code to use text file directly.
The text was updated successfully, but these errors were encountered:
I would rather it accept a strings column like TokenizeVocabulary one does now so the vocab file could be in any format (and any location) and you use cuIO APIs to load it from disk.
Is your feature request related to a problem? Please describe.
We should move away from using the hashed vocabulary file in subword_tokenizer to use vocabulary text file directly.
This will make migration of pre-trained tokenizers to cuDF easier.
Describe the solution you'd like
Current Workflow:
Suggested Workflow:
Additional context
We have an issue with the hashing overflow error here: #12403, instead of fixing that we should just migrate away from that code to use text file directly.
The text was updated successfully, but these errors were encountered: