Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer to be used for prepare_memmap_dataset.py #790

Closed
WenJett opened this issue Jan 24, 2025 · 2 comments
Closed

Tokenizer to be used for prepare_memmap_dataset.py #790

WenJett opened this issue Jan 24, 2025 · 2 comments
Labels
type/question An issue that's a question

Comments

@WenJett
Copy link

WenJett commented Jan 24, 2025

Hi,

Appreciate your work done so far.

With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.

Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32

Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".

Hence, I was wondering if

  1. the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
  2. there should be anything else I should include in the flags for the CLI?
  3. my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?

I hope you can provide some guidance for this matter.

Thank you.

@WenJett WenJett added the type/question An issue that's a question label Jan 24, 2025
@aman-17
Copy link
Member

aman-17 commented Jan 28, 2025

Hey @WenJett, we used allenai/gpt-neox-olmo-dolma-v1_5 tokenizer. This should fix the issue. Feel free to reopen the issue if you face any trouble.

Update: tokenizers/allenai_dolma2.json is used for OLMo2 (I was wrong earlier).

@aman-17 aman-17 closed this as completed Jan 28, 2025
@WenJett
Copy link
Author

WenJett commented Jan 29, 2025

Hi Aman,

Thanks for the guidance, I have tried your advice but am still facing difficulties.

Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.

CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.

Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows:
tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right

I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error:
OLMoConfigurationError: vocab size mismatch between config and tokenizer

I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?

I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my of dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants