Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU DPS build #5

Open
wildintellect opened this issue Nov 19, 2024 · 6 comments
Open

GPU DPS build #5

wildintellect opened this issue Nov 19, 2024 · 6 comments
Assignees

Comments

@wildintellect
Copy link

wildintellect commented Nov 19, 2024

Building GPU enabled image on the regular MAAP build infrastucture (CPU) should be possible.
We might need to insert some env variables to ensure conda solves to the GPU versions of packages.
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html#overriding-detected-packages

CONDA_OVERRIDE_CUDA=12 ENV variable
or potentially specify the cuda version of a conda package with packagename=*=*cuda*

noting isce3 could be one of the following, which might avoid the need to set the ENV variable above

- isce3-cuda=0.24.1
- isce3>=0.24.1=*cuda120*

Note maap-py is only in the dev yml right now, does that need to be moved to the regular one @nemo794 ?

@wildintellect
Copy link
Author

@nemo794 can you ask the ISCE3 team if it safely falls back on CPU if no GPU is available so we only have to worry about building 1 image/algorithm?

@wildintellect
Copy link
Author

wildintellect commented Nov 19, 2024

- isce3=0.24.1 # R4.0.4 delivery
is good for maap-py but needs the isce3 change noted above to make sure to pull the GPU compatible version.

@wildintellect
Copy link
Author

Looks like this is handled the correct deps are included:

cuda-cudart                12.6.77       h5888daf_0                   21.9 KiB   conda  cuda-cudart-12.6.77-h5888daf_0.conda
cuda-cudart_linux-64       12.6.77       h3f2d84a_0                   184.2 KiB  conda  cuda-cudart_linux-64-12.6.77-h3f2d84a_0.conda
cuda-version               12.6          h7480c83_3                   20.4 KiB   conda  cuda-version-12.6-h7480c83_3.conda

Now I just need to write a test DPS algorithm that checks the GPU status based on this private ticket https://github.com/NASA-IMPACT/veda-analytics/issues/130 which uses Tensorflow to test since the ISCE3 test is not good. https://github.com/isce-framework/isce3/blob/release-v0.23/tests/python/packages/isce3/core/gpu_check.py

@chuckwondo if you have ideas of another way to test let me know.

@chuckwondo
Copy link
Collaborator

@wildintellect, in VEDA, I launched a TF instance, created a custom conda env with only isce3 installed w/cuda (isce3>=0.24.1=*cuda120*), activated the custom env and simply ran the following, showing that 1 GPU is available:

$ python -c "import isce3; print(isce3.cuda.core.get_device_count())"
1

Doing the same in a non-NVIDIA instance, produced this error, as expected:

$ python -c "import isce3; print(isce3.cuda.core.get_device_count())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Error in file /home/conda/feedstock_root/build_artifacts/isce3_1731982147617/work/cxx/isce3/cuda/core/Device.cu, line 41, function int isce3::cuda::core::getDeviceCount(): cudaError 35 (CUDA driver version is insufficient for CUDA runtime version)

In the MAAP ADE, I followed the same steps and also produced the RuntimeError.

I think in DPS, an algorithm run bash script with the following line should suffice for determining the availability of a GPU:

# Adjust conda env name, if necessary
conda run -n isce3 python -c "import isce3; print(isce3.cuda.core.get_device_count())"

If the job fails, then no GPU is available. If is succeeds, you have at least 1 GPU available.

@wildintellect
Copy link
Author

wildintellect commented Nov 21, 2024

@chuckwondo
Do we really want the job to fail, or should we wrap that in a try so that it passes either way, and log the results? Could also pass a T/F arg to the alg on if a fail at this step should abort the run.

Since this is an ISCE3 specific way, should it go in this repo as alternate algorithm?

@chuckwondo
Copy link
Collaborator

@chuckwondo Do we really want the job to fail, or should we wrap that in a try so that it passes either way, and log the results? Could also pass a T/F arg to the alg on if a fail at this step should abort the run.

We could make the script succeed no matter what, if we want to avoid failing the job. All we need to do is append || true to the line in the bash script:

conda run -n isce3 python -c "import isce3; print(isce3.cuda.core.get_device_count())" || true

That should ensure that we always get an exit code of 0, but if there's no GPU, we'll see the error message captured in the _stderr.txt file.

Of course, there are alternatives, including your suggestion for exposing a boolean input. We could even do as you said and wrap things in a try/except block and avoid generating a traceback altogether, and simply output 0 as the number of GPUs in that case.

Up to you, really. What are you wanting to achieve more generally? Do you want users to be able to use this as a means of checking that a particular queue does or does not launch GPU instances?

Since this is an ISCE3 specific way, should it go in this repo as alternate algorithm?

This is isce3-specific only because that's how this all began, but if we want to not use isce3, then perhaps we find a more general library for testing for the availability of a GPU.

Either way, I'd lean towards not adding this as an alternate algorithm in this repo, particularly if you want to be able to use this more generally as a GPU test, especially since the logic is not tied to this repo in any way, other than that it currently uses isce3 to perform the GPU test, but that's not a necessity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants