-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
I am trying to use the mosaicML/streaming library to use, I have created a class, but as this is an iterable daataset from pytorch I am unable to use the megatron datasampler and I am running into the following bug,
[rank1]: File "/home/somay/coom/main.py", line 15, in
[rank1]: trainer.train()
[rank1]: File "/home/somay/coom/coom/train.py", line 253, in train
[rank1]: self.start_training()
[rank1]: File "/home/somay/coom/coom/train.py", line 237, in start_training
[rank1]: llm.train(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/collections/llm/api.py", line 127, in train
[rank1]: trainer.fit(model, data)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank1]: results = self._run_stage()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
[rank1]: self.advance()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
[rank1]: self.epoch_loop.run(self._data_fetcher)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
[rank1]: self.advance(data_fetcher)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/pytorch/trainer.py", line 47, in advance
[rank1]: super().advance(data_fetcher)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
[rank1]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 183, in run
[rank1]: closure()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in call
[rank1]: self._result = self.closure(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
[rank1]: step_output = self._step_fn()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
[rank1]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
[rank1]: output = fn(*args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 695, in training_step
[rank1]: out = self.model.training_step(dataloader_iter, *args, **kwargs)
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 389, in training_step
[rank1]: return self._step(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 501, in _step
[rank1]: return self.forward(
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 351, in forward
[rank1]: microbatch_outputs = step()
[rank1]: File "/home/somay/eka_env/lib/python3.10/site-packages/nemo/lightning/megatron_parallel.py", line 1276, in call
[rank1]: raise ValueError("num_microbatches is not set")
[rank1]: ValueError: num_microbatches is not set
I tried setting up the setup_microbatch_calculator() myself but still ending up the same error.