You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.
Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.
Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".
Hence, I was wondering if
the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
there should be anything else I should include in the flags for the CLI?
my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?
I hope you can provide some guidance for this matter.
Thank you.
The text was updated successfully, but these errors were encountered:
Thanks for the guidance, I have tried your advice but am still facing difficulties.
Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.
CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.
Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows: tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right
I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error: OLMoConfigurationError: vocab size mismatch between config and tokenizer
I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?
I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my of dataset.
Hi,
Appreciate your work done so far.
With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.
Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.
dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32
Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".
Hence, I was wondering if
I hope you can provide some guidance for this matter.
Thank you.
The text was updated successfully, but these errors were encountered: