Skip to content

Unable to allocate enough memory #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jcgrenier opened this issue May 12, 2025 · 8 comments
Open

Unable to allocate enough memory #32

jcgrenier opened this issue May 12, 2025 · 8 comments
Assignees

Comments

@jcgrenier
Copy link

jcgrenier commented May 12, 2025

Hello! I've been trying to run neural-admixture in train mode on a big dataset containing almost 500,000 samples and 161k markers but I am not able to make it run in GPU mode. It looks like it tries to send everything in the GPU memory. Do you have any idea how to handle such cases?

Quick note, I've been able to generate the PCA with CPUs only, using more than 1.3Tb of RAM to do so.

I also tried reducing the batch size, but I'm still having the same issue :

Here's the trace :
neural-admixture train --num_cpus 12 --num_gpus 1 --k 2 --name neuralAdmixture --data_path dataset.bed --save_dir neural_admixture_gpus --pca_path neural_admixture_gpus/neuralAdmixture_pca.pt --batch_size 400


    Input format is BED.
Mapping files:   0%|                                                                                                                                                                          | 0/3 [00:00<?, ?it/s] ~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/src/snp_reader.py:61: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  _, _, G = read_plink(str(Path(file).with_suffix("")))
Mapping files:  33%|██████████████████████████████████████████████████████                                                                                                            | 1/3 [00:00<00:01,  1.24it/s] ~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/src/snp_reader.py:61: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  _, _, G = read_plink(str(Path(file).with_suffix("")))
Mapping files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00,  6.15s/it]
    Data contains missing values. Will perform mean-imputation.
    Data contains 487929 samples and 161240 SNPs.
    Bringing data into memory...


    Unexpected error
Traceback (most recent call last):
  File "~/neural-admixture/nadmenv/bin/neural-admixture", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/entry.py", line 64, in main
    sys.exit(train.main(0, arg_list[2:], num_gpus))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/src/train.py", line 139, in main
    raise e
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/src/train.py", line 114, in main
    fit_model(args, trX, device, num_gpus, tr_pops, master)
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/src/train.py", line 30, in fit_model
    data, y = utils.initialize_data(master, trX, tr_pops)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/neural_admixture/src/utils.py", line 113, in initialize_data
    data = trX.compute()
           ^^^^^^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/dask/base.py", line 379, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/dask/base.py", line 667, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/dask/base.py", line 667, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
                   ^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/dask/array/core.py", line 1282, in finalize
    return concatenate3(results)
           ^^^^^^^^^^^^^^^^^^^^^
  File "~/neural-admixture/nadmenv/lib/python3.11/site-packages/dask/array/core.py", line 5313, in concatenate3
    result = np.empty(shape=shape, dtype=dtype(deepfirst(arrays)))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 293. GiB for an array with shape (487929, 161240) and data type float32

Thanks for your help!

JC

@joansaurina joansaurina self-assigned this May 13, 2025
@joansaurina
Copy link
Collaborator

Hi JC,

Thank you for the detailed message and the traceback — this isn’t an error on your side.

Neural-Admixture is indeed built to handle large-scale datasets, including biobank-level data. So your dataset — nearly 500,000 samples and 161k markers — is well within the expected range.

That said, the memory issue you’re encountering stems from a known bug in the data loading pipeline.

The good news: we’re releasing an update later this week that resolves this issue. The new version significantly reduces GPU memory usage during training, especially with large datasets like yours.

I’ll follow up here as soon as the update is live!

Best regards,
Joan

@jcgrenier
Copy link
Author

Hello @joansaurina,

That's a very good news! I will wait for the update and hope it will resolve our issue!
Thanks a lot for getting back quickly to me!

Best,

JC

@jcgrenier
Copy link
Author

Hello @joansaurina, any updates about the new release?

Thanks a lot again!
JC

@joansaurina
Copy link
Collaborator

joansaurina commented May 22, 2025

It's ready it will come out any time soon.
This week or early next week.

Joan

@joansaurina
Copy link
Collaborator

joansaurina commented May 27, 2025

Hey @jcgrenier — the new version v1.6.1 is now available!

Make sure to reinstall, and let us know how it goes. :)

Joan

@jcgrenier
Copy link
Author

Thanks for letting me know! Is there new requirements for that new version? Do we need another python version?
Because we are working on a HPC, we are required to use virtual environment instead of conda. I'm not able to find the new version with my previous environment and tried to create a new one, but without success.

Furthermore, when I try to install it from the git, I have multiple issues with some dependencies, but particularly with numpy, for which version 2.2.5 seems needed, but later on during the installation, some other dependency require a previous version. Is it normal?

  Downloading scikit-learn-1.4.2.tar.gz (7.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 18.0 MB/s eta 0:00:00
  Installing build dependencies ... error
  error: subprocess-exited-with-error

  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/x86-64-v3, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/x86-64-v3, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo2023/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
      Collecting setuptools
        Obtaining dependency information for setuptools from https://files.pythonhosted.org/packages/a3/dc/17031897dae0efacfea57dfd3a82fdd2a2aeb58e0ff71b77b87e44edc772/setuptools-80.9.0-py3-none-any.whl.metadata
        Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
      Processing /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic/wheel-0.45.1+computecanada-py3-none-any.whl
      Collecting Cython>=3.0.8
        Obtaining dependency information for Cython>=3.0.8 from https://files.pythonhosted.org/packages/a7/97/8e8637e67afc09f1b51a617b15a0d1caf0b5159b0f79d47ab101e620e491/cython-3.1.1-py3-none-any.whl.metadata
        Using cached cython-3.1.1-py3-none-any.whl.metadata (3.2 kB)
      ERROR: Could not find a version that satisfies the requirement numpy==2.0.0rc1 (from versions: 1.23.2+computecanada, 1.24.4+computecanada, 1.25.2+computecanada, 1.26.4+computecanada, 2.1.1+computecanada, 2.2.2+computecanada)
      ERROR: No matching distribution found for numpy==2.0.0rc1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

Thanks for your help!

@joansaurina
Copy link
Collaborator

Hey @jcgrenier,

It's strange — we tested with virtualenv and were able to install the new version v1.6.3 successfully.

Could you try again with a fresh Python 3.12 environment?

Feel free to reach out at [[email protected]] to schedule a call if you're still having trouble.

Joan

@jcgrenier
Copy link
Author

Hello @joansaurina,
I was finally able to install it, but needed to clone the repo, and change some requirements version.
I had also some limitations in the wheels available on my system with Python 3.12, so I tried it with python 3.11.5.

It started with an error regarding numpy while trying to install it from the git :

ERROR: Could not find a version that satisfies the requirement numpy>=2.2.5 (from neural-admixture) 
ERROR: No matching distribution found for numpy>=2.2.5

So I changed the setup.cfg file so it could work with numpy>=1.21.0,<2.0.0.

But then torch had also an issue while running the training :

AttributeError: module 'torch.nn' has no attribute 'RMSNorm'

So I extended the requirements so it can take torch 2.4.1 (because 2.4.0 was not available on the wheels on our system).
It looks like it works now.

Hope these changes won't create any issues thought.

Thanks.
JC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants