Skip to content

RuntimeError: CUDA out of memory with 6M nodes, 8M edges on A100 GPU #370

@chi2liu

Description

@chi2liu

🐛 Bug

|-------------------------------------------------------------------------------------------------------|
    *** Running (`tmp_data.pt`, `unsup_graphsage`, `node_classification_dw`, `unsup_graphsage_mw`)
|-------------------------------------------------------------------------------------------------------|
Model Parameters: 1568
  0%|                                                                                | 0/500 [00:00<?, ?it/s]OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  0%|                                                                                | 0/500 [00:47<?, ?it/s]
Traceback (most recent call last):
  File "generate_emb.py", line 12, in <module>
    outputs = generator(edge_index, x=x)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/pipelines.py", line 204, in __call__
    model = train(self.args)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/experiments.py", line 216, in train
    result = trainer.run(model_wrapper, dataset_wrapper)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 188, in run
    self.train(self.devices[0], model_w, dataset_w)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 334, in train
    training_loss = self.train_step(model_w, train_loader, optimizers, lr_schedulers, rank, scaler)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/trainer/trainer.py", line 468, in train_step
    loss = model_w.on_train_step(batch)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/wrappers/model_wrapper/base_model_wrapper.py", line 73, in on_train_step
    return self.train_step(*args, **kwargs)
  File "/home/chiliu/miniconda3/envs/cogdl/lib/python3.7/site-packages/cogdl/wrappers/model_wrapper/node_classification/unsup_graphsage_mw.py", line 43, in train_step
    neg_loss = -torch.log(torch.sigmoid(-torch.sum(x.unsqueeze(1).repeat(1, self.num_negative_samples, 1) * x[self.negative_samples], dim=-1))).mean()
RuntimeError: CUDA out of memory. Tried to allocate 11.02 GiB (GPU 0; 39.45 GiB total capacity; 29.23 GiB already allocated; 8.01 GiB free; 30.03 GiB reserved in total by PyTorch)

To Reproduce

Steps to reproduce the behavior:

from cogdl import pipeline
# build a pipeline for generating embeddings using unsupervised GNNs
# pass model name and num_features with its hyper-parameters to this API
import pandas as pd
graph = pd.read_csv("G1.weighted.edgelist", header=None,  sep=' ')
edge_index = graph[[0,1]].to_numpy()
edge_weight = graph[[2]].to_numpy(dtype=np.float16)
e = pd.read_csv("vertex_embeddings.csv", header=None, sep=' ')
x = e.iloc[:, :32].to_numpy(dtype=np.float16)
generator = pipeline("generate-emb", model="unsup_graphsage", no_test=True, num_features=32, hidden_size=16, walk_length=2, sample_size=[4, 2], is_large=True)
outputs = generator(edge_index, x=x)
pd.DataFrame("embeddings.csv")

the graph is 6M nodes, 8M edges on A100 GPU 40Gb

Expected behavior

Environment

  • CogDL version: 0.5.3
  • OS (e.g., Linux): ubuntu
  • Python version: 3.7
  • PyTorch version: 1.9.1.post3
  • CUDA/cuDNN version (if applicable): 11.7
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions