Tokenizer to be used for generation of data to .npy files

### ❓ The question

Hi,

I was unable to reopen the previous issue: https://github.com/allenai/OLMo/issues/790. Hence, creating another open issue and copying my response below. 

Hi Aman,

Thanks for the guidance, I have tried your advice but am still facing difficulties.

Firstly, I tried with **allenai/gpt-neox-olmo-dolma-v1_5** tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.

**CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.**

Also noted in **OLMo2-7B-stage2-seed42.yaml** scripts, the tokenizer is as follows:
**tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right**

I also tried changing this to **allenai_gpt-neox-olmo-dolma-v1_5.json** but it result in the error:
OLMoConfigurationError: vocab size mismatch between config and tokenizer

I believe that the tokenizer used should be consistent and it seemly is using **dolma2-tokenizer** for OLMo2 models. May I get some clarification on this?

I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with **dolma2-tokenizer**. However, it does not work when I perform the generation of data with my dataset.

Hope to get some direction on this issue. Thanks so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer to be used for generation of data to .npy files #791

❓ The question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer to be used for generation of data to .npy files #791

Description

❓ The question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions