Skip to content

Trying to use iterable dataset #14308

@somay-jalan

Description

@somay-jalan

Describe the bug

I am trying to use the mosaicML/streaming library to use, I have created a class, but as this is an iterable daataset from pytorch I am unable to use the megatron datasampler and I am running into the following bug,
[rank1]: File "/home/somay/coom/main.py", line 15, in
[rank1]: trainer.train()
[rank1]: File "/home/somay/coom/coom/train.py", line 253, in train
[rank1]: self.start_training()
[rank1]: File "/home/somay/coom/coom/train.py", line 237, in start_training
[rank1]: llm.train(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/collections/llm/api.py", line 127, in train
[rank1]: trainer.fit(model, data)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank1]: results = self._run_stage()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
[rank1]: self.advance()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
[rank1]: self.epoch_loop.run(self._data_fetcher)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
[rank1]: self.advance(data_fetcher)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/pytorch/trainer.py", line 47, in advance
[rank1]: super().advance(data_fetcher)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
[rank1]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 183, in run
[rank1]: closure()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in call
[rank1]: self._result = self.closure(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
[rank1]: step_output = self._step_fn()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
[rank1]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
[rank1]: output = fn(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 695, in training_step
[rank1]: out = self.model.training_step(dataloader_iter, *args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 389, in training_step
[rank1]: return self._step(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 501, in _step
[rank1]: return self.forward(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 351, in forward
[rank1]: microbatch_outputs = step()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 1276, in call
[rank1]: raise ValueError("num_microbatches is not set")
[rank1]: ValueError: num_microbatches is not set
I tried setting up the setup_microbatch_calculator() myself but still ending up the same error.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions