GPU DPS build #5

wildintellect · 2024-11-19T21:41:55Z

Building GPU enabled image on the regular MAAP build infrastucture (CPU) should be possible.
We might need to insert some env variables to ensure conda solves to the GPU versions of packages.
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html#overriding-detected-packages

CONDA_OVERRIDE_CUDA=12 ENV variable
or potentially specify the cuda version of a conda package with packagename=*=*cuda*

noting isce3 could be one of the following, which might avoid the need to set the ENV variable above

- isce3-cuda=0.24.1
- isce3>=0.24.1=*cuda120*

Note maap-py is only in the dev yml right now, does that need to be moved to the regular one @nemo794 ?

The text was updated successfully, but these errors were encountered:

wildintellect · 2024-11-19T21:46:14Z

@nemo794 can you ask the ISCE3 team if it safely falls back on CPU if no GPU is available so we only have to worry about building 1 image/algorithm?

wildintellect · 2024-11-19T21:55:09Z

sar2-d2/environment.yml

Line 6 in abac269

- isce3=0.24.1 # R4.0.4 delivery

is good for maap-py but needs the isce3 change noted above to make sure to pull the GPU compatible version.

wildintellect · 2024-11-21T01:18:41Z

Looks like this is handled the correct deps are included:

cuda-cudart                12.6.77       h5888daf_0                   21.9 KiB   conda  cuda-cudart-12.6.77-h5888daf_0.conda
cuda-cudart_linux-64       12.6.77       h3f2d84a_0                   184.2 KiB  conda  cuda-cudart_linux-64-12.6.77-h3f2d84a_0.conda
cuda-version               12.6          h7480c83_3                   20.4 KiB   conda  cuda-version-12.6-h7480c83_3.conda

Now I just need to write a test DPS algorithm that checks the GPU status based on this private ticket https://github.com/NASA-IMPACT/veda-analytics/issues/130 which uses Tensorflow to test since the ISCE3 test is not good. https://github.com/isce-framework/isce3/blob/release-v0.23/tests/python/packages/isce3/core/gpu_check.py

@chuckwondo if you have ideas of another way to test let me know.

chuckwondo · 2024-11-21T16:34:34Z

@wildintellect, in VEDA, I launched a TF instance, created a custom conda env with only isce3 installed w/cuda (isce3>=0.24.1=*cuda120*), activated the custom env and simply ran the following, showing that 1 GPU is available:

$ python -c "import isce3; print(isce3.cuda.core.get_device_count())"
1

Doing the same in a non-NVIDIA instance, produced this error, as expected:

$ python -c "import isce3; print(isce3.cuda.core.get_device_count())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Error in file /home/conda/feedstock_root/build_artifacts/isce3_1731982147617/work/cxx/isce3/cuda/core/Device.cu, line 41, function int isce3::cuda::core::getDeviceCount(): cudaError 35 (CUDA driver version is insufficient for CUDA runtime version)

In the MAAP ADE, I followed the same steps and also produced the RuntimeError.

I think in DPS, an algorithm run bash script with the following line should suffice for determining the availability of a GPU:

# Adjust conda env name, if necessary
conda run -n isce3 python -c "import isce3; print(isce3.cuda.core.get_device_count())"

If the job fails, then no GPU is available. If is succeeds, you have at least 1 GPU available.

wildintellect · 2024-11-21T16:49:20Z

@chuckwondo
Do we really want the job to fail, or should we wrap that in a try so that it passes either way, and log the results? Could also pass a T/F arg to the alg on if a fail at this step should abort the run.

Since this is an ISCE3 specific way, should it go in this repo as alternate algorithm?

chuckwondo · 2024-11-21T17:33:20Z

@chuckwondo Do we really want the job to fail, or should we wrap that in a try so that it passes either way, and log the results? Could also pass a T/F arg to the alg on if a fail at this step should abort the run.

We could make the script succeed no matter what, if we want to avoid failing the job. All we need to do is append || true to the line in the bash script:

conda run -n isce3 python -c "import isce3; print(isce3.cuda.core.get_device_count())" || true

That should ensure that we always get an exit code of 0, but if there's no GPU, we'll see the error message captured in the _stderr.txt file.

Of course, there are alternatives, including your suggestion for exposing a boolean input. We could even do as you said and wrap things in a try/except block and avoid generating a traceback altogether, and simply output 0 as the number of GPUs in that case.

Up to you, really. What are you wanting to achieve more generally? Do you want users to be able to use this as a means of checking that a particular queue does or does not launch GPU instances?

Since this is an ISCE3 specific way, should it go in this repo as alternate algorithm?

This is isce3-specific only because that's how this all began, but if we want to not use isce3, then perhaps we find a more general library for testing for the availability of a GPU.

Either way, I'd lean towards not adding this as an alternate algorithm in this repo, particularly if you want to be able to use this more generally as a GPU test, especially since the logic is not tied to this repo in any way, other than that it currently uses isce3 to perform the GPU test, but that's not a necessity.

wildintellect assigned nemo794 Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU DPS build #5

GPU DPS build #5

wildintellect commented Nov 19, 2024 •

edited

Loading

wildintellect commented Nov 19, 2024

wildintellect commented Nov 19, 2024 •

edited

Loading

wildintellect commented Nov 21, 2024

chuckwondo commented Nov 21, 2024

wildintellect commented Nov 21, 2024 •

edited

Loading

chuckwondo commented Nov 21, 2024

GPU DPS build #5

GPU DPS build #5

Comments

wildintellect commented Nov 19, 2024 • edited Loading

wildintellect commented Nov 19, 2024

wildintellect commented Nov 19, 2024 • edited Loading

wildintellect commented Nov 21, 2024

chuckwondo commented Nov 21, 2024

wildintellect commented Nov 21, 2024 • edited Loading

chuckwondo commented Nov 21, 2024

wildintellect commented Nov 19, 2024 •

edited

Loading

wildintellect commented Nov 19, 2024 •

edited

Loading

wildintellect commented Nov 21, 2024 •

edited

Loading