Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer to be used for generation of data to .npy files #791

Open
WenJett opened this issue Jan 29, 2025 · 2 comments
Open

Tokenizer to be used for generation of data to .npy files #791

WenJett opened this issue Jan 29, 2025 · 2 comments
Labels
type/question An issue that's a question

Comments

@WenJett
Copy link

WenJett commented Jan 29, 2025

❓ The question

Hi,

I was unable to reopen the previous issue: #790. Hence, creating another open issue and copying my response below.

Hi Aman,

Thanks for the guidance, I have tried your advice but am still facing difficulties.

Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.

CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.

Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows:
tokenizer:
identifier: tokenizers/allenai_dolma2.json
truncate_direction: right

I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error:
OLMoConfigurationError: vocab size mismatch between config and tokenizer

I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?

I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my dataset.

Hope to get some direction on this issue. Thanks so much!

@WenJett WenJett added the type/question An issue that's a question label Jan 29, 2025
@WenJett WenJett changed the title Tokenizer to be used for prepare_memmap_dataset.py Tokenizer to be used generation of data to .npy files Jan 29, 2025
@WenJett WenJett changed the title Tokenizer to be used generation of data to .npy files Tokenizer to be used for generation of data to .npy files Jan 29, 2025
@aman-17
Copy link
Member

aman-17 commented Jan 30, 2025

Hey @WenJett,

  1. prepare_memmap_dataset.py is depreciated.
  2. The correct tokenizer to use is tokenizers/allenai_dolma2.json (I was wrong before)
  3. Can you provide more details of the dataset you're trying to tokenize? I'm not sure why it is throwing empty file.

@WenJett
Copy link
Author

WenJett commented Jan 31, 2025

Hi @aman-17,

I have also asked about the same issue at dolma GitHub: allenai/dolma#225 to which @soldni has kindly responded but we can't identify the problem.

I have also uploaded the json.gz file I have used alongside the dolma tokens CLI data.json.gz

From my understanding, the data requires an "id" field and a "text" field. I am not sure if there are any additional requirements in place or steps necessary before performing the dolma tokens CLI.

edited to include the error message I received when running stage2 of OLMo script.

[2025-01-31 03:26:51] INFO [train:335, rank=0] Checkpoint successfully loaded
[2025-01-31 03:26:51] INFO [train:351, rank=0] Starting training...
[2025-01-31 03:26:51] INFO [olmo.train:967, rank=0] Pre-train system metrics
System/Peak GPU Memory (MB)=43,795
[2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=0] Uncaught ZeroDivisionError: division by zero

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/q3team/_Q4/OLMo/scripts/train.py:389 in │
│ │
│ 386 │ │ raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]") │
│ 387 │ │
│ 388 │ cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list]) │
│ ❱ 389 │ main(cfg) │
│ 390 │
│ /home/q3team/_Q4/OLMo/scripts/train.py:352 in main │
│ 349 │ │ │
│ 350 │ │ if not cfg.dry_run: │
│ 351 │ │ │ log.info("Starting training...") │
│ ❱ 352 │ │ │ trainer.fit() │
│ 353 │ │ │ log.info("Training complete") │
│ 354 │ │ │ else: │
│ 355 │ │ │ log.info("Dry run complete") │
│ │
│ /home/q3team/_Q4/OLMo/olmo/train.py:1185 in fit │
│ 1182 │ │ save_checkpoints: bool = True │
│ 1183 │ │ │
│ 1184 │ │ with torch_profiler as p: │
│ ❱ 1185 │ │ │ for epoch in range(self.epoch or 0, self.max_epochs): │
│ 1186 │ │ │ │ for batch in self.train_loader: │
│ 1187 │ │ │ │ │ # Bookkeeping. │
│ 1188 │ │ │ │ │ # NOTE: To track the global batch size / number of tokens per batch w │
│ │
│ /home/q3team/_Q4/OLMo/olmo/train.py:258 in max_epochs │
│ 255 │ │
│ 256 │ @Property
│ 257 │ def max_epochs(self) -> int: │
│ ❱ 258 │ │ return math.ceil(self.max_steps / self.batches_per_epoch) │
│ 259 │ │
│ 260 │ @Property
│ 261 │ def max_steps(self) -> int: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ZeroDivisionError: division by zero

[2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/q3team/_Q4/OLMo/scripts/train.py:389 in │
│ │
│ 386 │ │ raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]") │
│ 387 │ │
│ 388 │ cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list]) │
│ ❱ 389 │ main(cfg) │
│ 390 │
│ /home/q3team/_Q4/OLMo/scripts/train.py:352 in main │
│ 349 │ │ │
│ 350 │ │ if not cfg.dry_run: │
│ 351 │ │ │ log.info("Starting training...") │
│ ❱ 352 │ │ │ trainer.fit() │
│ 353 │ │ │ log.info("Training complete") │
│ 354 │ │ │ else: │
│ 355 │ │ │ log.info("Dry run complete") │
│ │
│ /home/q3team/_Q4/OLMo/olmo/train.py:1185 in fit │
│ 1182 │ │ save_checkpoints: bool = True │
│ 1183 │ │ │
│ 1184 │ │ with torch_profiler as p: │
│ ❱ 1185 │ │ │ for epoch in range(self.epoch or 0, self.max_epochs): │
│ 1186 │ │ │ │ for batch in self.train_loader: │
│ 1187 │ │ │ │ │ # Bookkeeping. │
│ 1188 │ │ │ │ │ # NOTE: To track the global batch size / number of tokens per batch w │
│ │
│ /home/q3team/_Q4/OLMo/olmo/train.py:258 in max_epochs │
│ 255 │ │
│ 256 │ @Property
│ 257 │ def max_epochs(self) -> int: │
│ ❱ 258 │ │ return math.ceil(self.max_steps / self.batches_per_epoch) │
│ 259 │ │
│ 260 │ @Property
│ 261 │ def max_steps(self) -> int: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ZeroDivisionError: division by zero

[rank0]:[W131 03:26:55.738691064 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[W0131 03:26:56.310000 28522 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 28615 closing signal SIGTERM
[E0131 03:26:56.425000 28522 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 28614) of binary: /usr/local/bin/python3.12
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants