Skip to content

Tokenizer to be used for generation of data to .npy files #791

Closed
@WenJett

Description

@WenJett

❓ The question

Hi,

I was unable to reopen the previous issue: #790. Hence, creating another open issue and copying my response below.

Hi Aman,

Thanks for the guidance, I have tried your advice but am still facing difficulties.

Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.

CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.

Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows:
tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right

I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error:
OLMoConfigurationError: vocab size mismatch between config and tokenizer

I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?

I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my dataset.

Hope to get some direction on this issue. Thanks so much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/questionAn issue that's a question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions