"OutOfMemoryError: CUDA out of memory." in GPU mode #372

asmlgkj · 2024-06-20T04:08:31Z

thanks a lot. here is the install step.
export PYTHONNOUSERSITE="aaaaa"
conda create -y -n cell2location_cuda118_torch22 python=3.10
conda activate cell2location_cuda118_torch22

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

when run mod.train(max_epochs=250)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader but have no validation_step. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=23in theDataLoader` to improve performance.
Epoch 1/250: 0%| | 0/250 [00:00<?, ?it/s]

OutOfMemoryError Traceback (most recent call last)
OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU

The text was updated successfully, but these errors were encountered:

avpatel18 · 2024-07-15T18:23:42Z

I am getting the same error, is there any solution to this issue? Thanks!

vitkl · 2024-07-15T19:25:14Z

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with `nvidia-smi` command.

…

On Mon, 15 Jul 2024 at 19:24, Ankit Patel ***@***.***> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

avpatel18 · 2024-07-16T10:59:27Z

I am getting 'OutOfMemoryError: CUDA out of memory. Tried to allocate 1.85 GiB. GPU ' with Cell2location model.
And its not memory issue because I am assigning lot more HPC resources than it needs.

It's actually very weird because it works fine with the same object where I have applied 'median_abs_deviation' filtering on 'log1p_total_counts' on each sample before concatenating it. its only some 1700 spots difference between the two. Do you know why some (outlier) spots causing this error? Thanks @vitkl for your help!

LiuXintongPKU · 2024-07-17T08:58:56Z

I got the same problem...I found a large usage of GPU memory by "/usr/lib/rstudio-server/bin/rsession", after ending this process by "kill -9 PID", memory was released. But after running mod.train(max_epochs=30000, batch_size=None, train_size=1), the similar process popped up again and took up ~7000MiB again! I repeated this action for several times, which made me confused...

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command.
…
On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

LiuXintongPKU · 2024-07-17T09:06:09Z

I met the problem while running cell2location model~

mod = cell2location.models.Cell2location(adata_vis, cell_state_df=inf_aver, N_cells_per_location=10, detection_alpha=20)
mod.train(max_epochs=30000, batch_size=None, train_size=1)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader but have no validation_step. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=103in theDataLoader` to improve performance.
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 1/30000: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "", line 1, in
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/cell2location/models/_cell2location_model.py", line 209, in train
super().train(**kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/model/base/_pyromixin.py", line 191, in train
return runner()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainrunner.py", line 98, in call
self.trainer.fit(self.training_plan, self.data_splitter)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainer.py", line 220, in fit
super().fit(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 242, in advance
batch_output = self.manual_optimization.run(kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/manual.py", line 92, in run
self.advance(kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/manual.py", line 112, in advance
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 382, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainingplans.py", line 1048, in training_step
loss = torch.Tensor([self.svi.step(*args, **kwargs)])
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/svi.py", line 145, in step
loss = self.loss_and_grads(self.model, self.guide, *args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/trace_elbo.py", line 140, in loss_and_grads
for model_trace, guide_trace in self._get_traces(model, guide, args, kwargs):
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/elbo.py", line 237, in _get_traces
yield self._get_trace(model, guide, args, kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/trace_elbo.py", line 57, in _get_trace
model_trace, guide_trace = get_importance_trace(
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/enum.py", line 75, in get_importance_trace
model_trace.compute_log_prob()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/poutine/trace_struct.py", line 264, in compute_log_prob
log_p = site["fn"].log_prob(
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/distributions/conjugate.py", line 283, in log_prob
-log_beta(self.concentration, value + 1)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/ops/special.py", line 68, in log_beta
return x.lgamma() + y.lgamma() - (x + y).lgamma()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 548.00 MiB. GPU

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command.
…
On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

vitkl · 2024-07-25T09:16:09Z

@avpatel18 You probably need to look into GPU memory settings rather than RAM settings on your cluster.

vitkl · 2024-07-25T09:18:54Z

cell2location.models.Cell2location model needs a large amount of GPU memory. For example, in 80GB GPU memory you can fit a dataset with n_obs ~ 60k and n_vars ~ 18k. 7GB would be enough for only 1-2 Visium sections.

asmlgkj added the bug Something isn't working label Jun 20, 2024

vitkl added question Further information is requested and removed bug Something isn't working labels Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"OutOfMemoryError: CUDA out of memory." in GPU mode #372

"OutOfMemoryError: CUDA out of memory." in GPU mode #372

asmlgkj commented Jun 20, 2024

avpatel18 commented Jul 15, 2024

vitkl commented Jul 15, 2024 via email

avpatel18 commented Jul 16, 2024

LiuXintongPKU commented Jul 17, 2024

LiuXintongPKU commented Jul 17, 2024

vitkl commented Jul 25, 2024

vitkl commented Jul 25, 2024

"OutOfMemoryError: CUDA out of memory." in GPU mode #372

"OutOfMemoryError: CUDA out of memory." in GPU mode #372

Comments

asmlgkj commented Jun 20, 2024

avpatel18 commented Jul 15, 2024

vitkl commented Jul 15, 2024 via email

avpatel18 commented Jul 16, 2024

LiuXintongPKU commented Jul 17, 2024

LiuXintongPKU commented Jul 17, 2024

vitkl commented Jul 25, 2024

vitkl commented Jul 25, 2024