Add fixes for tokenizer download scripts; byte fallback reports; and updated HF json vocab download script by klei22 · Pull Request #749 · ReaLLMASIC/ReaLLM-Forge

klei22 · 2026-02-17T00:06:05Z

This pull request introduces several improvements and new features related to tokenization workflows, especially around byte token reporting, vocabulary management, and documentation for large-file processing. The most significant changes are the addition of byte tokenization reporting throughout the tokenizers and CLI, new helper scripts for working with Hugging Face vocabularies, and expanded documentation for Char-BPE and batch tokenization workflows. Some unused language data and code have also been removed for clarity.

Tokenization and reporting enhancements:

Added byte tokenization reporting to the tokenizers: new methods and attributes in tokenizers.py allow tracking and reporting the proportion of byte vs. non-byte tokens during tokenization, with CLI options to print and save reports. (data/template/tokenizers.py, data/template/prepare.py) [1] [2] [3] [4] [5] [6] [7] [8]
Added a new CLI flag --report_byte_tokenization to prepare.py to enable this reporting. (data/template/prepare.py)

New helper scripts for vocabulary management:

Added download_vocab_set.py, a script to download Hugging Face tokenizer files, normalize token lists, and optionally map special whitespace markers to their literal equivalents. (data/template/premade_vocab_sets/download_vocab_set.py)
Added json_keys_to_json_array.py, a utility to convert a JSON vocabulary (mapping) to a sorted JSON array of tokens. (data/template/premade_vocab_sets/json_keys_to_json_array.py)
Added get_qwen.sh for downloading Qwen vocabularies. (data/template/premade_vocab_sets/get_qwen.sh)

Documentation and workflow updates:

Expanded the README to document Char-BPE tokenization, including training, reuse, and detailed instructions for large-file batch tokenization using partitioning and parallel processing. (data/template/README.md) [1] [2] [3]

Automation and reproducibility:

Added byte_report.sh to automate byte tokenization reporting across multiple languages and tokenization methods. (data/flores200-res/byte_report.sh)

Cleanup and removal of unused language/data:

Removed references to text_shn_Mymr (Shan/Myanmar) from scripts and language lists, and commented out its phoneticization command, reflecting its deprecation or removal from the workflow. (data/flores200-res/get_dataset.sh, data/flores200-res/phoneticize.sh) [1] [2]

Copilot

Pull request overview

This PR adds comprehensive byte tokenization reporting and analysis capabilities to the tokenization workflow, along with improved batch processing utilities, vocabulary management scripts, and documentation for large-file workflows. The changes enable tracking and analyzing how often byte fallback tokens are used versus regular tokens across different tokenization methods, which is valuable for evaluating tokenizer efficiency on various languages and scripts.

Changes:

Added byte tokenization reporting infrastructure to track and report byte vs non-byte token usage statistics across tokenizers
Introduced utility scripts for downloading, normalizing, and managing Hugging Face vocabularies with whitespace marker mapping
Enhanced batch preparation scripts with parallel processing support and better configuration handling for large-file tokenization workflows

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
data/template/utils/plot_byte_report.py	New visualization script for generating charts from byte tokenization reports (has critical API usage bugs)
data/template/utils/partition_file.py	Enhanced to support decimal chunk sizes and added input validation
data/template/utils/large_file_prepare.sh	Extended with parallel processing and tokenizer-specific configuration support
data/template/utils/batch_prepare.py	Major refactor adding parallel processing, better error handling, and support for char_bpe/json_byte_fallback tokenizers
data/template/tokenizers.py	Added byte token reporting methods and tracking to all byte-fallback tokenizers, plus formatting fixes
data/template/prepare.py	Integrated byte tokenization reporting with CLI flags and report file output
data/template/premade_vocab_sets/json_keys_to_json_array.py	New utility to convert JSON vocab mappings to sorted arrays
data/template/premade_vocab_sets/get_qwen.sh	New script to download Qwen vocabularies (has URL bug)
data/template/premade_vocab_sets/download_vocab_set.py	New comprehensive script for downloading and normalizing HF tokenizers
data/template/README.md	Expanded with Char-BPE documentation and detailed large-file batch tokenization workflows
data/flores200-res/phoneticize.sh	Commented out deprecated Shan/Myanmar phoneticization
data/flores200-res/get_dataset.sh	Removed text_shn_Mymr from language list
data/flores200-res/byte_report.sh	New automation script for running byte reports across languages and tokenizers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T00:14:07Z

data/template/premade_vocab_sets/get_qwen.sh

@@ -0,0 +1,3 @@
+#!/bin/bash
+
+wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json


The URL uses the 'blob' path which is for viewing files in the browser, but should use 'resolve' or 'raw' for downloading files directly. This will likely result in downloading HTML content instead of the actual JSON file.

Suggested change

wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json

wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/resolve/main/vocab.json

Copilot · 2026-02-17T00:14:08Z

data/template/tokenizers.py

        self._build_vocab()

+    def is_byte_token(self, token_id):
+        return token_id < 256


The @staticmethod decorator should be on its own line, not on the same line as the previous method. This violates PEP 8 style guidelines which require decorators to be on their own line immediately before the function definition.

Suggested change

return token_id < 256

return token_id < 256

Copilot · 2026-02-17T00:14:08Z

data/template/utils/plot_byte_report.py

+        fig.add_bar(name="byte_percentage", x=labels, y=byte_pct)
+        fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct)


The method fig.add_bar is not a valid Plotly API call. The correct method is fig.add_trace(go.Bar(...)). This will cause a runtime error when trying to generate bar charts.

Copilot · 2026-02-17T00:14:09Z

data/template/prepare.py

+            if args.output_tokenization_subdir:
+                _write_byte_token_report(tokenizer, "val", output_dir)


Same issue as with the train data: the byte token report writing is conditional on args.output_tokenization_subdir rather than having its own control flag. This couples two independent features together unnecessarily.

Copilot · 2026-02-17T00:14:09Z

data/template/premade_vocab_sets/json_keys_to_json_array.py

+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(tokens, f, ensure_ascii=False, indent=2)


The script constructs file paths using user input without validation. An attacker could potentially use path traversal (e.g., "../../../etc/passwd") in the --output argument to write files to arbitrary locations on the filesystem. Consider validating the output path or using os.path.abspath and checking that the resolved path is within expected boundaries.

Copilot · 2026-02-17T00:14:11Z

data/template/utils/plot_byte_report.py

+    fig.add_bar(name="byte_percentage", x=labels, y=byte_pct)
+    fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct)


The method fig.add_bar is not a valid Plotly API call. The correct method is fig.add_trace(go.Bar(...)). This will cause a runtime error when trying to generate bar charts.

Copilot · 2026-02-17T00:14:11Z

data/template/README.md

+  --max_parallel 4
+```
+
+The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):


This refers to a "helper script" but doesn't clearly identify which script. While context suggests it's 'large_file_prepare.sh', being explicit would improve clarity. Consider changing "The helper script" to "The large_file_prepare.sh helper script".

Suggested change

The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):

The `large_file_prepare.sh` helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):

Copilot · 2026-02-17T00:14:11Z

data/template/utils/partition_file.py

        os.makedirs(output_dir)

-    chunk_size = chunk_size_mb * 1024 * 1024  # Convert MB to bytes
+    chunk_size = max(1, int(chunk_size_mb * 1024 * 1024))  # Convert MB to bytes


If chunk_size_mb is between 0 and 1/1048576 (approximately 0.00000095), the conversion to bytes will result in a value less than 1, which gets clamped to 1 by the max() function. However, this creates a misleading situation where the user requests a tiny chunk size but gets 1 byte chunks instead. Consider adding a more reasonable minimum threshold (e.g., 0.001 MB or 1 KB) and raising a clearer error message if the value is too small.

Copilot · 2026-02-17T00:14:11Z

data/template/README.md

+  --max_parallel 4
+```
+
+Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB):


This refers to "Helper script form" but the description in parentheses doesn't match the parameter ordering shown in the earlier example. The earlier text says "3rd arg = method-specific config path, 4th arg = max parallelism, 5th arg = chunk size in MB" but this repeats the same text for the json_byte_fallback section. The description should be consistent or removed if it's redundant.

Suggested change

Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB):

Helper script form:

Copilot · 2026-02-17T00:14:12Z

data/template/premade_vocab_sets/download_vocab_set.py

+            try:
+                tokenizer_path = _download_tokenizer_file(repo_id, filename, revision, tmp_path)
+            except HTTPError:
+                continue


The function catches HTTPError but doesn't catch other potential exceptions like URLError (for network issues), OSError, or IOError. If a network error occurs (e.g., DNS failure, connection timeout), the exception will propagate up rather than trying the next filename. Consider catching a broader set of exceptions or at least URLError in addition to HTTPError.

klei22 and others added 8 commits January 25, 2026 18:58

Extend byte token reporting to tiktoken

d8bde00

Write byte token report when using suffix

d8577d8

Write reports and vocab in output subdir

bf04b30

Update byte token reports and plotting

594eb14

Add byte token report generator

8a8e42e

Update byte reports

605e0a1

Add byte report and download script

414b28a

Update batch prepapre scripts for json and charbpe

20ddfd9

klei22 requested review from Copilot and gkielian February 17, 2026 00:06

Merge branch 'master' into add_fixes_for_tokenizer_download_script_3

070eda2

Copilot started reviewing on behalf of klei22 February 17, 2026 00:06 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add fixes for tokenizer download scripts; byte fallback reports; and updated HF json vocab download script#749

Add fixes for tokenizer download scripts; byte fallback reports; and updated HF json vocab download script#749
klei22 wants to merge 9 commits intoReaLLMASIC:masterfrom
klei22:add_fixes_for_tokenizer_download_script_3

klei22 commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,3 @@
		#!/bin/bash

		wget -O qwen3_8b.json https://huggingface.co/Qwen/Qwen3-8B/blob/main/vocab.json

		fig.add_bar(name="byte_percentage", x=labels, y=byte_pct)
		fig.add_bar(name="non_byte_percentage", x=labels, y=non_byte_pct)

		if args.output_tokenization_subdir:
		_write_byte_token_report(tokenizer, "val", output_dir)

		with open(output_path, "w", encoding="utf-8") as f:
		json.dump(tokens, f, ensure_ascii=False, indent=2)

	The helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):
	The `large_file_prepare.sh` helper script also supports this (`3rd` arg = method-specific config path, `4th` arg = max parallelism, `5th` arg = chunk size in MB):

	Helper script form (`3rd` arg is method-specific config path, `4th` arg is max parallelism, `5th` arg is chunk size MB):
	Helper script form:

Comments

Conversation

klei22 commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant