Skip to content

CUDA Memory Errors at first epoch at default batch size #203

@cvKDean

Description

@cvKDean

Good day, I would just like to ask if you guys have any idea why I am running into CUDA memory errors when running training? This happens at the end of the first epoch (epoch 0). For reference, I am just trying to reproduce the results in REPRODUCE_RESULTS.md with the smaller dataset with annotation-small.json.

My configuration is:
OS: Windows 10 (Anaconda Prompt)
GPU: GeForce GTX 1070Ti (single)
torch version: 1.0.1

The error stack is as follows:
Error stack:

2019-03-22 14-23-05 steps >>> epoch 0 average batch time: 0:00:00.7
2019-03-22 14-23-06 steps >>> epoch 0 batch 411 sum:     1.74406
2019-03-22 14-23-07 steps >>> epoch 0 batch 412 sum:     2.26457
2019-03-22 14-23-07 steps >>> epoch 0 batch 413 sum:     1.95351
2019-03-22 14-23-08 steps >>> epoch 0 batch 414 sum:     2.39538
2019-03-22 14-23-09 steps >>> epoch 0 batch 415 sum:     1.83759
2019-03-22 14-23-10 steps >>> epoch 0 batch 416 sum:     1.92264
2019-03-22 14-23-10 steps >>> epoch 0 batch 417 sum:     1.71246
2019-03-22 14-23-11 steps >>> epoch 0 batch 418 sum:     2.32141
2019-03-22 14-23-11 steps >>> epoch 0 sum:     2.18943
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
B:\ML Models\src\callbacks.py:168: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  X = Variable(X, volatile=True).cuda()
Traceback (most recent call last):
  File "main.py", line 93, in <module>
    main()
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 697, in main
    rv = self.invoke(ctx)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 31, in train
    pipeline_manager.train(pipeline_name, dev_mode)
  File "B:\ML Models\src\pipeline_manager.py", line 32, in train
    train(pipeline_name, dev_mode, self.logger, self.params, self.seed)
  File "B:\ML Models\src\pipeline_manager.py", line 116, in train
    pipeline.fit_transform(data)
  File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
    step_inputs[input_step.name] = input_step.fit_transform(data)
  File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
    step_inputs[input_step.name] = input_step.fit_transform(data)
  File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
    step_inputs[input_step.name] = input_step.fit_transform(data)
  [Previous line repeated 4 more times]
  File "B:\ML Models\src\steps\base.py", line 112, in fit_transform
    return self._cached_fit_transform(step_inputs)
  File "B:\ML Models\src\steps\base.py", line 123, in _cached_fit_transform
    step_output_data = self.transformer.fit_transform(**step_inputs)
  File "B:\ML Models\src\steps\base.py", line 262, in fit_transform
    self.fit(*args, **kwargs)
  File "B:\ML Models\src\models.py", line 82, in fit
    self.callbacks.on_epoch_end()
  File "B:\ML Models\src\steps\pytorch\callbacks.py", line 92, in on_epoch_end
    callback.on_epoch_end(*args, **kwargs)
  File "B:\ML Models\src\steps\pytorch\callbacks.py", line 163, in on_epoch_end
    val_loss = self.get_validation_loss()
  File "B:\ML Models\src\callbacks.py", line 132, in get_validation_loss
    return self._get_validation_loss()
  File "B:\ML Models\src\callbacks.py", line 138, in _get_validation_loss
    outputs = self._transform()
  File "B:\ML Models\src\callbacks.py", line 172, in _transform
    outputs_batch = self.model(X)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "B:\ML Models\src\unet_models.py", line 387, in forward
    conv2 = self.conv2(conv1)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
    input = module(input)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torchvision\models\resnet.py", line 88, in forward
    out = self.bn3(out)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\batchnorm.py", line 76, in forward
    exponential_average_factor, self.eps)
  File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\functional.py", line 1623, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 8.00 GiB total capacity; 6.18 GiB already allocated; 56.00 MiB free; 48.95 MiB cached)

Lowering the batch size from the default 20 to 10 decreased the memory usage of the GPU from ~6GB to ~4GB, and at the end of epoch 0, increased the memory usage to ~6GB. Afterwards, subsequent epochs have continued to run in training at memory usage of ~6GB.

Is this behavior to be expected/normal? I read somewhere that you also used GTX 1070 GPUs for training, and so I thought I would be able to run training at the default batch size. Also, is it normal for GPU memory usage to increase between epochs 0 and 1? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions