-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer to be used for generation of data to .npy files #791
Comments
Hey @WenJett,
|
Hi @aman-17, I have also asked about the same issue at dolma GitHub: allenai/dolma#225 to which @soldni has kindly responded but we can't identify the problem. I have also uploaded the json.gz file I have used alongside the dolma tokens CLI data.json.gz From my understanding, the data requires an "id" field and a "text" field. I am not sure if there are any additional requirements in place or steps necessary before performing the dolma tokens CLI. edited to include the error message I received when running stage2 of OLMo script. [2025-01-31 03:26:51] INFO [train:335, rank=0] Checkpoint successfully loaded ╭──────────────────────────────────────────────────────────────────────────────────────────────────╮ [2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero ╭──────────────────────────────────────────────────────────────────────────────────────────────────╮ [rank0]:[W131 03:26:55.738691064 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
|
❓ The question
Hi,
I was unable to reopen the previous issue: #790. Hence, creating another open issue and copying my response below.
Hi Aman,
Thanks for the guidance, I have tried your advice but am still facing difficulties.
Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.
CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.
Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows:
tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right
I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error:
OLMoConfigurationError: vocab size mismatch between config and tokenizer
I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?
I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my dataset.
Hope to get some direction on this issue. Thanks so much!
The text was updated successfully, but these errors were encountered: