Add fixes for tokenizer download scripts; byte fallback reports; and updated HF json vocab download script#749
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive byte tokenization reporting and analysis capabilities to the tokenization workflow, along with improved batch processing utilities, vocabulary management scripts, and documentation for large-file workflows. The changes enable tracking and analyzing how often byte fallback tokens are used versus regular tokens across different tokenization methods, which is valuable for evaluating tokenizer efficiency on various languages and scripts.
Changes:
- Added byte tokenization reporting infrastructure to track and report byte vs non-byte token usage statistics across tokenizers
- Introduced utility scripts for downloading, normalizing, and managing Hugging Face vocabularies with whitespace marker mapping
- Enhanced batch preparation scripts with parallel processing support and better configuration handling for large-file tokenization workflows
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| data/template/utils/plot_byte_report.py | New visualization script for generating charts from byte tokenization reports (has critical API usage bugs) |
| data/template/utils/partition_file.py | Enhanced to support decimal chunk sizes and added input validation |
| data/template/utils/large_file_prepare.sh | Extended with parallel processing and tokenizer-specific configuration support |
| data/template/utils/batch_prepare.py | Major refactor adding parallel processing, better error handling, and support for char_bpe/json_byte_fallback tokenizers |
| data/template/tokenizers.py | Added byte token reporting methods and tracking to all byte-fallback tokenizers, plus formatting fixes |
| data/template/prepare.py | Integrated byte tokenization reporting with CLI flags and report file output |
| data/template/premade_vocab_sets/json_keys_to_json_array.py | New utility to convert JSON vocab mappings to sorted arrays |
| data/template/premade_vocab_sets/get_qwen.sh | New script to download Qwen vocabularies (has URL bug) |
| data/template/premade_vocab_sets/download_vocab_set.py | New comprehensive script for downloading and normalizing HF tokenizers |
| data/template/README.md | Expanded with Char-BPE documentation and detailed large-file batch tokenization workflows |
| data/flores200-res/phoneticize.sh | Commented out deprecated Shan/Myanmar phoneticization |
| data/flores200-res/get_dataset.sh | Removed text_shn_Mymr from language list |
| data/flores200-res/byte_report.sh | New automation script for running byte reports across languages and tokenizers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,3 @@ | |||
| #!/bin/bash | |||
|
|
|||
| wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json | |||
There was a problem hiding this comment.
The URL uses the 'blob' path which is for viewing files in the browser, but should use 'resolve' or 'raw' for downloading files directly. This will likely result in downloading HTML content instead of the actual JSON file.
| wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json | |
| wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/resolve/main/vocab.json |
| self._build_vocab() | ||
|
|
||
| def is_byte_token(self, token_id): | ||
| return token_id < 256 |
There was a problem hiding this comment.
The @staticmethod decorator should be on its own line, not on the same line as the previous method. This violates PEP 8 style guidelines which require decorators to be on their own line immediately before the function definition.
| return token_id < 256 | |
| return token_id < 256 |
| fig.add_bar(name="byte_percentage", x=labels, y=byte_pct) | ||
| fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct) |
There was a problem hiding this comment.
The method fig.add_bar is not a valid Plotly API call. The correct method is fig.add_trace(go.Bar(...)). This will cause a runtime error when trying to generate bar charts.
| if args.output_tokenization_subdir: | ||
| _write_byte_token_report(tokenizer, "val", output_dir) |
There was a problem hiding this comment.
Same issue as with the train data: the byte token report writing is conditional on args.output_tokenization_subdir rather than having its own control flag. This couples two independent features together unnecessarily.
| with open(output_path, "w", encoding="utf-8") as f: | ||
| json.dump(tokens, f, ensure_ascii=False, indent=2) |
There was a problem hiding this comment.
The script constructs file paths using user input without validation. An attacker could potentially use path traversal (e.g., "../../../etc/passwd") in the --output argument to write files to arbitrary locations on the filesystem. Consider validating the output path or using os.path.abspath and checking that the resolved path is within expected boundaries.
| fig.add_bar(name="byte_percentage", x=labels, y=byte_pct) | ||
| fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct) |
There was a problem hiding this comment.
The method fig.add_bar is not a valid Plotly API call. The correct method is fig.add_trace(go.Bar(...)). This will cause a runtime error when trying to generate bar charts.
| --max_parallel 4 | ||
| ``` | ||
|
|
||
| The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB): |
There was a problem hiding this comment.
This refers to a "helper script" but doesn't clearly identify which script. While context suggests it's 'large_file_prepare.sh', being explicit would improve clarity. Consider changing "The helper script" to "The large_file_prepare.sh helper script".
| The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB): | |
| The `large_file_prepare.sh` helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB): |
| os.makedirs(output_dir) | ||
|
|
||
| chunk_size = chunk_size_mb * 1024 * 1024 # Convert MB to bytes | ||
| chunk_size = max(1, int(chunk_size_mb * 1024 * 1024)) # Convert MB to bytes |
There was a problem hiding this comment.
If chunk_size_mb is between 0 and 1/1048576 (approximately 0.00000095), the conversion to bytes will result in a value less than 1, which gets clamped to 1 by the max() function. However, this creates a misleading situation where the user requests a tiny chunk size but gets 1 byte chunks instead. Consider adding a more reasonable minimum threshold (e.g., 0.001 MB or 1 KB) and raising a clearer error message if the value is too small.
| --max_parallel 4 | ||
| ``` | ||
|
|
||
| Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB): |
There was a problem hiding this comment.
This refers to "Helper script form" but the description in parentheses doesn't match the parameter ordering shown in the earlier example. The earlier text says "3rd arg = method-specific config path, 4th arg = max parallelism, 5th arg = chunk size in MB" but this repeats the same text for the json_byte_fallback section. The description should be consistent or removed if it's redundant.
| Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB): | |
| Helper script form: |
| try: | ||
| tokenizer_path = _download_tokenizer_file(repo_id, filename, revision, tmp_path) | ||
| except HTTPError: | ||
| continue |
There was a problem hiding this comment.
The function catches HTTPError but doesn't catch other potential exceptions like URLError (for network issues), OSError, or IOError. If a network error occurs (e.g., DNS failure, connection timeout), the exception will propagate up rather than trying the next filename. Consider catching a broader set of exceptions or at least URLError in addition to HTTPError.
This pull request introduces several improvements and new features related to tokenization workflows, especially around byte token reporting, vocabulary management, and documentation for large-file processing. The most significant changes are the addition of byte tokenization reporting throughout the tokenizers and CLI, new helper scripts for working with Hugging Face vocabularies, and expanded documentation for Char-BPE and batch tokenization workflows. Some unused language data and code have also been removed for clarity.
Tokenization and reporting enhancements:
tokenizers.pyallow tracking and reporting the proportion of byte vs. non-byte tokens during tokenization, with CLI options to print and save reports. (data/template/tokenizers.py,data/template/prepare.py) [1] [2] [3] [4] [5] [6] [7] [8]--report_byte_tokenizationtoprepare.pyto enable this reporting. (data/template/prepare.py)New helper scripts for vocabulary management:
download_vocab_set.py, a script to download Hugging Face tokenizer files, normalize token lists, and optionally map special whitespace markers to their literal equivalents. (data/template/premade_vocab_sets/download_vocab_set.py)json_keys_to_json_array.py, a utility to convert a JSON vocabulary (mapping) to a sorted JSON array of tokens. (data/template/premade_vocab_sets/json_keys_to_json_array.py)get_qwen.shfor downloading Qwen vocabularies. (data/template/premade_vocab_sets/get_qwen.sh)Documentation and workflow updates:
data/template/README.md) [1] [2] [3]Automation and reproducibility:
byte_report.shto automate byte tokenization reporting across multiple languages and tokenization methods. (data/flores200-res/byte_report.sh)Cleanup and removal of unused language/data:
text_shn_Mymr(Shan/Myanmar) from scripts and language lists, and commented out its phoneticization command, reflecting its deprecation or removal from the workflow. (data/flores200-res/get_dataset.sh,data/flores200-res/phoneticize.sh) [1] [2]