Skip to content

Comments

Add fixes for tokenizer download scripts; byte fallback reports; and updated HF json vocab download script#749

Open
klei22 wants to merge 9 commits intoReaLLMASIC:masterfrom
klei22:add_fixes_for_tokenizer_download_script_3
Open

Add fixes for tokenizer download scripts; byte fallback reports; and updated HF json vocab download script#749
klei22 wants to merge 9 commits intoReaLLMASIC:masterfrom
klei22:add_fixes_for_tokenizer_download_script_3

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Feb 17, 2026

This pull request introduces several improvements and new features related to tokenization workflows, especially around byte token reporting, vocabulary management, and documentation for large-file processing. The most significant changes are the addition of byte tokenization reporting throughout the tokenizers and CLI, new helper scripts for working with Hugging Face vocabularies, and expanded documentation for Char-BPE and batch tokenization workflows. Some unused language data and code have also been removed for clarity.

Tokenization and reporting enhancements:

  • Added byte tokenization reporting to the tokenizers: new methods and attributes in tokenizers.py allow tracking and reporting the proportion of byte vs. non-byte tokens during tokenization, with CLI options to print and save reports. (data/template/tokenizers.py, data/template/prepare.py) [1] [2] [3] [4] [5] [6] [7] [8]
  • Added a new CLI flag --report_byte_tokenization to prepare.py to enable this reporting. (data/template/prepare.py)

New helper scripts for vocabulary management:

  • Added download_vocab_set.py, a script to download Hugging Face tokenizer files, normalize token lists, and optionally map special whitespace markers to their literal equivalents. (data/template/premade_vocab_sets/download_vocab_set.py)
  • Added json_keys_to_json_array.py, a utility to convert a JSON vocabulary (mapping) to a sorted JSON array of tokens. (data/template/premade_vocab_sets/json_keys_to_json_array.py)
  • Added get_qwen.sh for downloading Qwen vocabularies. (data/template/premade_vocab_sets/get_qwen.sh)

Documentation and workflow updates:

  • Expanded the README to document Char-BPE tokenization, including training, reuse, and detailed instructions for large-file batch tokenization using partitioning and parallel processing. (data/template/README.md) [1] [2] [3]

Automation and reproducibility:

  • Added byte_report.sh to automate byte tokenization reporting across multiple languages and tokenization methods. (data/flores200-res/byte_report.sh)

Cleanup and removal of unused language/data:

  • Removed references to text_shn_Mymr (Shan/Myanmar) from scripts and language lists, and commented out its phoneticization command, reflecting its deprecation or removal from the workflow. (data/flores200-res/get_dataset.sh, data/flores200-res/phoneticize.sh) [1] [2]

@klei22 klei22 requested review from Copilot and gkielian February 17, 2026 00:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive byte tokenization reporting and analysis capabilities to the tokenization workflow, along with improved batch processing utilities, vocabulary management scripts, and documentation for large-file workflows. The changes enable tracking and analyzing how often byte fallback tokens are used versus regular tokens across different tokenization methods, which is valuable for evaluating tokenizer efficiency on various languages and scripts.

Changes:

  • Added byte tokenization reporting infrastructure to track and report byte vs non-byte token usage statistics across tokenizers
  • Introduced utility scripts for downloading, normalizing, and managing Hugging Face vocabularies with whitespace marker mapping
  • Enhanced batch preparation scripts with parallel processing support and better configuration handling for large-file tokenization workflows

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
data/template/utils/plot_byte_report.py New visualization script for generating charts from byte tokenization reports (has critical API usage bugs)
data/template/utils/partition_file.py Enhanced to support decimal chunk sizes and added input validation
data/template/utils/large_file_prepare.sh Extended with parallel processing and tokenizer-specific configuration support
data/template/utils/batch_prepare.py Major refactor adding parallel processing, better error handling, and support for char_bpe/json_byte_fallback tokenizers
data/template/tokenizers.py Added byte token reporting methods and tracking to all byte-fallback tokenizers, plus formatting fixes
data/template/prepare.py Integrated byte tokenization reporting with CLI flags and report file output
data/template/premade_vocab_sets/json_keys_to_json_array.py New utility to convert JSON vocab mappings to sorted arrays
data/template/premade_vocab_sets/get_qwen.sh New script to download Qwen vocabularies (has URL bug)
data/template/premade_vocab_sets/download_vocab_set.py New comprehensive script for downloading and normalizing HF tokenizers
data/template/README.md Expanded with Char-BPE documentation and detailed large-file batch tokenization workflows
data/flores200-res/phoneticize.sh Commented out deprecated Shan/Myanmar phoneticization
data/flores200-res/get_dataset.sh Removed text_shn_Mymr from language list
data/flores200-res/byte_report.sh New automation script for running byte reports across languages and tokenizers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,3 @@
#!/bin/bash

wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL uses the 'blob' path which is for viewing files in the browser, but should use 'resolve' or 'raw' for downloading files directly. This will likely result in downloading HTML content instead of the actual JSON file.

Suggested change
wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json
wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/resolve/main/vocab.json

Copilot uses AI. Check for mistakes.
self._build_vocab()

def is_byte_token(self, token_id):
return token_id < 256
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @staticmethod decorator should be on its own line, not on the same line as the previous method. This violates PEP 8 style guidelines which require decorators to be on their own line immediately before the function definition.

Suggested change
return token_id < 256
return token_id < 256

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +68
fig.add_bar(name="byte_percentage", x=labels, y=byte_pct)
fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method fig.add_bar is not a valid Plotly API call. The correct method is fig.add_trace(go.Bar(...)). This will cause a runtime error when trying to generate bar charts.

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +250
if args.output_tokenization_subdir:
_write_byte_token_report(tokenizer, "val", output_dir)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as with the train data: the byte token report writing is conditional on args.output_tokenization_subdir rather than having its own control flag. This couples two independent features together unnecessarily.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +47
with open(output_path, "w", encoding="utf-8") as f:
json.dump(tokens, f, ensure_ascii=False, indent=2)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script constructs file paths using user input without validation. An attacker could potentially use path traversal (e.g., "../../../etc/passwd") in the --output argument to write files to arbitrary locations on the filesystem. Consider validating the output path or using os.path.abspath and checking that the resolved path is within expected boundaries.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +95
fig.add_bar(name="byte_percentage", x=labels, y=byte_pct)
fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method fig.add_bar is not a valid Plotly API call. The correct method is fig.add_trace(go.Bar(...)). This will cause a runtime error when trying to generate bar charts.

Copilot uses AI. Check for mistakes.
--max_parallel 4
```

The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to a "helper script" but doesn't clearly identify which script. While context suggests it's 'large_file_prepare.sh', being explicit would improve clarity. Consider changing "The helper script" to "The large_file_prepare.sh helper script".

Suggested change
The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):
The `large_file_prepare.sh` helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):

Copilot uses AI. Check for mistakes.
os.makedirs(output_dir)

chunk_size = chunk_size_mb * 1024 * 1024 # Convert MB to bytes
chunk_size = max(1, int(chunk_size_mb * 1024 * 1024)) # Convert MB to bytes
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If chunk_size_mb is between 0 and 1/1048576 (approximately 0.00000095), the conversion to bytes will result in a value less than 1, which gets clamped to 1 by the max() function. However, this creates a misleading situation where the user requests a tiny chunk size but gets 1 byte chunks instead. Consider adding a more reasonable minimum threshold (e.g., 0.001 MB or 1 KB) and raising a clearer error message if the value is too small.

Copilot uses AI. Check for mistakes.
--max_parallel 4
```

Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB):
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to "Helper script form" but the description in parentheses doesn't match the parameter ordering shown in the earlier example. The earlier text says "3rd arg = method-specific config path, 4th arg = max parallelism, 5th arg = chunk size in MB" but this repeats the same text for the json_byte_fallback section. The description should be consistent or removed if it's redundant.

Suggested change
Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB):
Helper script form:

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +121
try:
tokenizer_path = _download_tokenizer_file(repo_id, filename, revision, tmp_path)
except HTTPError:
continue
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function catches HTTPError but doesn't catch other potential exceptions like URLError (for network issues), OSError, or IOError. If a network error occurs (e.g., DNS failure, connection timeout), the exception will propagate up rather than trying the next filename. Consider catching a broader set of exceptions or at least URLError in addition to HTTPError.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant