[BUG] GPU Acceleration not functioning with CUDA #3138

justicel · 2024-10-22T17:57:42Z

What is the bug?
Attempting to follow the sparse instructions for GPU acceleration using CUDA, here: https://opensearch.org/docs/latest/ml-commons-plugin/gpu-acceleration/

When using the instructions, they are mostly for Neuron, but suggest that you can definitely use CUDA. I built a custom docker image using the attached Dockerfile. It contains opensearch, cuda and pytorch.

However, on attempting to load a pre-built model it never shows as utilizing the GPU (via nvidia-smi output):

opensearch@opensearch-cluster-ml-1:~$ nvidia-smi
Tue Oct 22 17:40:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                   On |
| N/A   38C    P0             76W /  700W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0   10   0   0  |              13MiB /  9984MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The 'no running processes definitely means nothing was loaded in to the GPU.

I'm uncertain if there's simply something missing, or, that this is expected.

How can one reproduce the bug?
Steps to reproduce the behavior:

Build docker image using included dockerfile
Run a cluster with docker-compose or docker, using the built image and ml type nodes
Attempt to load a model:

#!/bin/bash -x

MODEL_GROUP_ID=$(curl -s -u 'admin:admin' -H 'Content-Type: application/json' -X POST https://<endpoint>/_plugins/_ml/model_groups/_register -d '
{
  "name": "test_model_group",
  "description": "Default ML model group for opensearch pretrained models"
}' | jq .model_group_id)

curl -u 'admin:admin' -X POST -H 'Content-Type: application/json' https://<endpoint>/_plugins/_ml/models/_register -d "$(cat <<EOF
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_group_id": $MODEL_GROUP_ID,
  "model_format": "TORCH_SCRIPT"
}
EOF
)"

What is the expected behavior?
The model should be loaded and installed in GPU memory.

What is your host/environment?

OS: Ubuntu
Version: @20.04

Dockerfile:

ARG VERSION=2.17.1

FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

ARG UID=1000
ARG GID=1000
ARG OPENSEARCH_HOME=/usr/share/opensearch

RUN addgroup --gid $GID opensearch && \
    adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch

COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME
WORKDIR $OPENSEARCH_HOME

RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \
    echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \
    ls -l $OPENSEARCH_HOME

ENV JAVA_HOME=$OPENSEARCH_HOME/jdk
ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin

# Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"

# Change user
USER $UID

# Setup OpenSearch
# Disable security demo installation during image build, and allow user to disable during startup of the container
# Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true
ARG DISABLE_SECURITY_PLUGIN=false
RUN ./opensearch-onetime-setup.sh

# Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650

ARG VERSION
ARG BUILD_DATE
ARG NOTES

# CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"]
CMD ["opensearch"]

Additional note: Hopefully a newer version of pytorch and cuda are actually supported?

The text was updated successfully, but these errors were encountered:

justicel · 2024-10-22T18:00:24Z

Oh also the logs from the pod when the model was loaded:

Downloading: 100% |========================================| model_meta_list.jsonl-1]
Downloading: 100% |========================================| config.jsoncluster-ml-1]
[2024-10-22T17:40:17,148][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,205][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,210][INFO ][o.o.m.e.i.MLIndicesHandler] [opensearch-cluster-ml-1] create index:.plugins-ml-model
[2024-10-22T17:40:17,217][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,240][INFO ][o.o.m.m.MLModelManager   ] [opensearch-cluster-ml-1] create new model meta doc UmxQtZIBu7HrnioNsKS8 for register model task K5NQtZIBOJotuW9ArPuc
[2024-10-22T17:40:17,339][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
Downloading: 100% |========================================| all-MiniLM-L12-v2.zip-1]
[2024-10-22T17:40:33,836][INFO ][o.o.m.m.MLModelManager   ] [opensearch-cluster-ml-1] Model registered successfully, model id: UmxQtZIBu7HrnioNsKS8, task id: K5NQtZIBOJotuW9ArPuc

justicel · 2024-10-23T03:44:14Z

So I finally got it working. The documentation for CUDA is incomplete and outdated. Given that DJI 0.28.0 is now being used, the documentation about CUDA 11.6 and pytorch 1.31.1 is incorrect. It should instead be in the supported list from here:

https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions

I was able to use pytorch 2.2.2 and cuda 12.1 to finally get it launched. As well, there's quite a lot missing to make a working Docker image for this. Here's what I have that finally worked:

ARG VERSION=2.17.1

# Supported pytorch versions from here: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions
# Note: There is also a cross-section with supported cuda. You can most easily figure it out from the file archive here: https://publish.djl.ai/pytorch/2.2.2/files.txt
ARG PYTORCH_VERSION=2.2.2

FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source

FROM pytorch/pytorch:$PYTORCH_VERSION-cuda12.1-cudnn8-devel

ARG UID=1000
ARG GID=1000
ARG OPENSEARCH_HOME=/usr/share/opensearch

RUN addgroup --gid $GID opensearch \
  && adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch

# Install pytorch components

RUN pip install transformers

COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME
WORKDIR $OPENSEARCH_HOME

RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \
    echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \
    ls -l $OPENSEARCH_HOME

ENV JAVA_HOME=$OPENSEARCH_HOME/jdk
ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin
ENV PYTORCH_VERSION=$PYTORCH_VERSION

# Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"

# Change user
USER $UID

# Setup OpenSearch
# Disable security demo installation during image build, and allow user to disable during startup of the container
# Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true
ARG DISABLE_SECURITY_PLUGIN=false
RUN ./opensearch-onetime-setup.sh

# Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650

ARG VERSION
ARG BUILD_DATE
ARG NOTES

# CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"]
CMD ["opensearch"]

mingshl · 2024-11-05T19:20:34Z

@justicel glad that it's working, can you close this issue if it's resolved?

justicel added bug Something isn't working untriaged labels Oct 22, 2024

dhrubo-os added this to ml-commons projects Oct 28, 2024

dhrubo-os moved this to Untriaged in ml-commons projects Oct 28, 2024

mingshl removed the untriaged label Nov 5, 2024

mingshl moved this from Untriaged to In Progress in ml-commons projects Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPU Acceleration not functioning with CUDA #3138

[BUG] GPU Acceleration not functioning with CUDA #3138

justicel commented Oct 22, 2024

justicel commented Oct 22, 2024

justicel commented Oct 23, 2024 •

edited

Loading

mingshl commented Nov 5, 2024

[BUG] GPU Acceleration not functioning with CUDA #3138

[BUG] GPU Acceleration not functioning with CUDA #3138

Comments

justicel commented Oct 22, 2024

justicel commented Oct 22, 2024

justicel commented Oct 23, 2024 • edited Loading

mingshl commented Nov 5, 2024

justicel commented Oct 23, 2024 •

edited

Loading