Tokenizer to be used for prepare_memmap_dataset.py #790

WenJett · 2025-01-24T08:57:29Z

Hi,

Appreciate your work done so far.

With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.

Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32

Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".

Hence, I was wondering if

the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
there should be anything else I should include in the flags for the CLI?
my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?

I hope you can provide some guidance for this matter.

Thank you.

aman-17 · 2025-01-28T22:38:15Z

Hey @WenJett, we used allenai/gpt-neox-olmo-dolma-v1_5 tokenizer. This should fix the issue. Feel free to reopen the issue if you face any trouble.

Update: tokenizers/allenai_dolma2.json is used for OLMo2 (I was wrong earlier).

WenJett · 2025-01-29T04:19:06Z

Hi Aman,

Thanks for the guidance, I have tried your advice but am still facing difficulties.

Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.

CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.

Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows:
tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right

I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error:
OLMoConfigurationError: vocab size mismatch between config and tokenizer

I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?

I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my of dataset.

WenJett added the type/question An issue that's a question label Jan 24, 2025

aman-17 closed this as completed Jan 28, 2025

WenJett mentioned this issue Jan 29, 2025

Tokenizer to be used for generation of data to .npy files #791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer to be used for prepare_memmap_dataset.py #790

Tokenizer to be used for prepare_memmap_dataset.py #790

WenJett commented Jan 24, 2025 •

edited

Loading

aman-17 commented Jan 28, 2025 •

edited

Loading

WenJett commented Jan 29, 2025

Tokenizer to be used for prepare_memmap_dataset.py #790

Tokenizer to be used for prepare_memmap_dataset.py #790

Comments

WenJett commented Jan 24, 2025 • edited Loading

aman-17 commented Jan 28, 2025 • edited Loading

WenJett commented Jan 29, 2025

WenJett commented Jan 24, 2025 •

edited

Loading

aman-17 commented Jan 28, 2025 •

edited

Loading