Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU Acceleration not functioning with CUDA #3138

Open
justicel opened this issue Oct 22, 2024 · 3 comments
Open

[BUG] GPU Acceleration not functioning with CUDA #3138

justicel opened this issue Oct 22, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@justicel
Copy link

What is the bug?
Attempting to follow the sparse instructions for GPU acceleration using CUDA, here: https://opensearch.org/docs/latest/ml-commons-plugin/gpu-acceleration/

When using the instructions, they are mostly for Neuron, but suggest that you can definitely use CUDA. I built a custom docker image using the attached Dockerfile. It contains opensearch, cuda and pytorch.

However, on attempting to load a pre-built model it never shows as utilizing the GPU (via nvidia-smi output):

opensearch@opensearch-cluster-ml-1:~$ nvidia-smi
Tue Oct 22 17:40:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                   On |
| N/A   38C    P0             76W /  700W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0   10   0   0  |              13MiB /  9984MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The 'no running processes definitely means nothing was loaded in to the GPU.

I'm uncertain if there's simply something missing, or, that this is expected.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Build docker image using included dockerfile
  2. Run a cluster with docker-compose or docker, using the built image and ml type nodes
  3. Attempt to load a model:
#!/bin/bash -x

MODEL_GROUP_ID=$(curl -s -u 'admin:admin' -H 'Content-Type: application/json' -X POST https://<endpoint>/_plugins/_ml/model_groups/_register -d '
{
  "name": "test_model_group",
  "description": "Default ML model group for opensearch pretrained models"
}' | jq .model_group_id)

curl -u 'admin:admin' -X POST -H 'Content-Type: application/json' https://<endpoint>/_plugins/_ml/models/_register -d "$(cat <<EOF
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_group_id": $MODEL_GROUP_ID,
  "model_format": "TORCH_SCRIPT"
}
EOF
)"

What is the expected behavior?
The model should be loaded and installed in GPU memory.

What is your host/environment?

  • OS: Ubuntu
  • Version: @20.04

Dockerfile:

ARG VERSION=2.17.1

FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

ARG UID=1000
ARG GID=1000
ARG OPENSEARCH_HOME=/usr/share/opensearch

RUN addgroup --gid $GID opensearch && \
    adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch

COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME
WORKDIR $OPENSEARCH_HOME

RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \
    echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \
    ls -l $OPENSEARCH_HOME

ENV JAVA_HOME=$OPENSEARCH_HOME/jdk
ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin

# Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"

# Change user
USER $UID

# Setup OpenSearch
# Disable security demo installation during image build, and allow user to disable during startup of the container
# Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true
ARG DISABLE_SECURITY_PLUGIN=false
RUN ./opensearch-onetime-setup.sh

# Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650

ARG VERSION
ARG BUILD_DATE
ARG NOTES

# CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"]
CMD ["opensearch"]

Additional note: Hopefully a newer version of pytorch and cuda are actually supported?

@justicel justicel added bug Something isn't working untriaged labels Oct 22, 2024
@justicel
Copy link
Author

Oh also the logs from the pod when the model was loaded:

Downloading: 100% |========================================| model_meta_list.jsonl-1]
Downloading: 100% |========================================| config.jsoncluster-ml-1]
[2024-10-22T17:40:17,148][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,205][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,210][INFO ][o.o.m.e.i.MLIndicesHandler] [opensearch-cluster-ml-1] create index:.plugins-ml-model
[2024-10-22T17:40:17,217][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
[2024-10-22T17:40:17,240][INFO ][o.o.m.m.MLModelManager   ] [opensearch-cluster-ml-1] create new model meta doc UmxQtZIBu7HrnioNsKS8 for register model task K5NQtZIBOJotuW9ArPuc
[2024-10-22T17:40:17,339][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-ml-1] Cancelling the migration process.
Downloading: 100% |========================================| all-MiniLM-L12-v2.zip-1]
[2024-10-22T17:40:33,836][INFO ][o.o.m.m.MLModelManager   ] [opensearch-cluster-ml-1] Model registered successfully, model id: UmxQtZIBu7HrnioNsKS8, task id: K5NQtZIBOJotuW9ArPuc

@justicel
Copy link
Author

justicel commented Oct 23, 2024

So I finally got it working. The documentation for CUDA is incomplete and outdated. Given that DJI 0.28.0 is now being used, the documentation about CUDA 11.6 and pytorch 1.31.1 is incorrect. It should instead be in the supported list from here:

https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions

I was able to use pytorch 2.2.2 and cuda 12.1 to finally get it launched. As well, there's quite a lot missing to make a working Docker image for this. Here's what I have that finally worked:

ARG VERSION=2.17.1

# Supported pytorch versions from here: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md#supported-pytorch-versions
# Note: There is also a cross-section with supported cuda. You can most easily figure it out from the file archive here: https://publish.djl.ai/pytorch/2.2.2/files.txt
ARG PYTORCH_VERSION=2.2.2

FROM public.ecr.aws/opensearchproject/opensearch:$VERSION AS source

FROM pytorch/pytorch:$PYTORCH_VERSION-cuda12.1-cudnn8-devel

ARG UID=1000
ARG GID=1000
ARG OPENSEARCH_HOME=/usr/share/opensearch

RUN addgroup --gid $GID opensearch \
  && adduser --uid $UID --gid $GID --home $OPENSEARCH_HOME opensearch

# Install pytorch components

RUN pip install transformers

COPY --from=source --chown=$UID:$GID $OPENSEARCH_HOME $OPENSEARCH_HOME
WORKDIR $OPENSEARCH_HOME

RUN echo "export JAVA_HOME=$OPENSEARCH_HOME/jdk" >> /etc/profile.d/java_home.sh && \
    echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile.d/java_home.sh && \
    ls -l $OPENSEARCH_HOME

ENV JAVA_HOME=$OPENSEARCH_HOME/jdk
ENV PATH=$PATH:$JAVA_HOME/bin:$OPENSEARCH_HOME/bin
ENV PYTORCH_VERSION=$PYTORCH_VERSION

# Add k-NN lib directory to library loading path variable
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$OPENSEARCH_HOME/plugins/opensearch-knn/lib"

# Change user
USER $UID

# Setup OpenSearch
# Disable security demo installation during image build, and allow user to disable during startup of the container
# Enable security plugin during image build, and allow user to disable during startup of the container
ARG DISABLE_INSTALL_DEMO_CONFIG=true
ARG DISABLE_SECURITY_PLUGIN=false
RUN ./opensearch-onetime-setup.sh

# Expose ports for the opensearch service (9200 for HTTP and 9300 for internal transport) and performance analyzer (9600 for the agent and 9650 for the root cause analysis component)
EXPOSE 9200 9300 9600 9650

ARG VERSION
ARG BUILD_DATE
ARG NOTES

# CMD to run
ENTRYPOINT ["./opensearch-docker-entrypoint.sh"]
CMD ["opensearch"]

@mingshl
Copy link
Collaborator

mingshl commented Nov 5, 2024

@justicel glad that it's working, can you close this issue if it's resolved?

@mingshl mingshl removed the untriaged label Nov 5, 2024
@mingshl mingshl moved this from Untriaged to In Progress in ml-commons projects Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

2 participants