Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"OutOfMemoryError: CUDA out of memory." in GPU mode #372

Open
asmlgkj opened this issue Jun 20, 2024 · 7 comments
Open

"OutOfMemoryError: CUDA out of memory." in GPU mode #372

asmlgkj opened this issue Jun 20, 2024 · 7 comments
Labels
question Further information is requested

Comments

@asmlgkj
Copy link

asmlgkj commented Jun 20, 2024

thanks a lot. here is the install step.
export PYTHONNOUSERSITE="aaaaa"
conda create -y -n cell2location_cuda118_torch22 python=3.10
conda activate cell2location_cuda118_torch22

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

when run mod.train(max_epochs=250)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader but have no validation_step. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/aa/miniconda3/envs/cell2location/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=23in theDataLoader` to improve performance.
Epoch 1/250: 0%| | 0/250 [00:00<?, ?it/s]

OutOfMemoryError Traceback (most recent call last)
OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU

@asmlgkj asmlgkj added the bug Something isn't working label Jun 20, 2024
@avpatel18
Copy link

I am getting the same error, is there any solution to this issue? Thanks!

@vitkl
Copy link
Contributor

vitkl commented Jul 15, 2024 via email

@avpatel18
Copy link

I am getting 'OutOfMemoryError: CUDA out of memory. Tried to allocate 1.85 GiB. GPU ' with Cell2location model.
And its not memory issue because I am assigning lot more HPC resources than it needs.

It's actually very weird because it works fine with the same object where I have applied 'median_abs_deviation' filtering on 'log1p_total_counts' on each sample before concatenating it. its only some 1700 spots difference between the two. Do you know why some (outlier) spots causing this error? Thanks @vitkl for your help!

@LiuXintongPKU
Copy link

I got the same problem...I found a large usage of GPU memory by "/usr/lib/rstudio-server/bin/rsession", after ending this process by "kill -9 PID", memory was released. But after running mod.train(max_epochs=30000, batch_size=None, train_size=1), the similar process popped up again and took up ~7000MiB again! I repeated this action for several times, which made me confused...

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command.

On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

@LiuXintongPKU
Copy link

I met the problem while running cell2location model~

mod = cell2location.models.Cell2location(adata_vis, cell_state_df=inf_aver, N_cells_per_location=10, detection_alpha=20)
mod.train(max_epochs=30000, batch_size=None, train_size=1)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a val_dataloader but have no validation_step. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=103in theDataLoader` to improve performance.
/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 1/30000: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "", line 1, in
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/cell2location/models/_cell2location_model.py", line 209, in train
super().train(**kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/model/base/_pyromixin.py", line 191, in train
return runner()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainrunner.py", line 98, in call
self.trainer.fit(self.training_plan, self.data_splitter)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainer.py", line 220, in fit
super().fit(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 242, in advance
batch_output = self.manual_optimization.run(kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/manual.py", line 92, in run
self.advance(kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/manual.py", line 112, in advance
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 382, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/scvi/train/_trainingplans.py", line 1048, in training_step
loss = torch.Tensor([self.svi.step(*args, **kwargs)])
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/svi.py", line 145, in step
loss = self.loss_and_grads(self.model, self.guide, *args, **kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/trace_elbo.py", line 140, in loss_and_grads
for model_trace, guide_trace in self._get_traces(model, guide, args, kwargs):
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/elbo.py", line 237, in _get_traces
yield self._get_trace(model, guide, args, kwargs)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/trace_elbo.py", line 57, in _get_trace
model_trace, guide_trace = get_importance_trace(
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/infer/enum.py", line 75, in get_importance_trace
model_trace.compute_log_prob()
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/poutine/trace_struct.py", line 264, in compute_log_prob
log_p = site["fn"].log_prob(
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/distributions/conjugate.py", line 283, in log_prob
-log_beta(self.concentration, value + 1)
File "/home/lxt/miniconda3/envs/cell2loc_env/lib/python3.9/site-packages/pyro/ops/special.py", line 68, in log_beta
return x.lgamma() + y.lgamma() - (x + y).lgamma()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 548.00 MiB. GPU

Are you referring to Regression model or Cell2location model? Regression model should not have any issues with this. You can check availability of GPU memory with nvidia-smi command.

On Mon, 15 Jul 2024 at 19:24, Ankit Patel @.> wrote: I am getting the same error, is there any solution to this issue? Thanks! — Reply to this email directly, view it on GitHub <#372 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV5JLGT7OK5BY5NQINDZMQHULAVCNFSM6AAAAABJTEDF2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRZGEZDAMZWGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

@vitkl
Copy link
Contributor

vitkl commented Jul 25, 2024

@avpatel18 You probably need to look into GPU memory settings rather than RAM settings on your cluster.

@vitkl vitkl added question Further information is requested and removed bug Something isn't working labels Jul 25, 2024
@vitkl
Copy link
Contributor

vitkl commented Jul 25, 2024

cell2location.models.Cell2location model needs a large amount of GPU memory. For example, in 80GB GPU memory you can fit a dataset with n_obs ~ 60k and n_vars ~ 18k. 7GB would be enough for only 1-2 Visium sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants