Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

Open
2 of 6 tasks
eero-t opened this issue Dec 16, 2024 · 4 comments
Open
2 of 6 tasks

[Bug] Regression: "opea/vllm-gaudi:latest" container in crash loop #1038

eero-t opened this issue Dec 16, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@eero-t
Copy link
Contributor

eero-t commented Dec 16, 2024

Priority

Undecided

OS type

Ubuntu

Hardware type

Gaudi2

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source

Deploy method

  • Docker compose
  • Docker
  • Kubernetes
  • Helm

Running nodes

Single Node

What's the version?

https://hub.docker.com/layers/opea/vllm-gaudi/latest/images/sha256-d2c0b0aa88cd26ae2084990663d8d789728f658bacacd8a49cc5b81a6a022c58

Description

vllm-gaudi:latest container does not find devices, and is in crash loop.

But if I change latest tag to 1.1, it works fine, i.e. this is regression.

Reproduce steps

Apply: opea-project/GenAIInfra#610

Then run ChatQnA from GenAIInfra:
$ helm install chatqna chatqna/ --skip-tests --values chatqna/gaudi-vllm-values.yaml ...

Raw log

$ kubectl logs chatqna-vllm-75dfb59d66-wp4vs
...
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 132, in current_device
    init()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 71, in init
    _hpu_C.init()
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
@eero-t eero-t added the bug Something isn't working label Dec 16, 2024
@eero-t
Copy link
Contributor Author

eero-t commented Dec 16, 2024

Only change in OPEA Git since v1.1 is dropping of the eager option, but v1.1 image works fine with that change.
https://github.com/opea-project/GenAIComps/commits/main/comps/llms/text-generation/vllm/langchain/dependency/

However, comparing the latest image layers to ones in earlier v1.1 one:
https://hub.docker.com/layers/opea/vllm-gaudi/1.1/images/sha256-c75d22e05ff23e4c0745e9c0a56ec74763f85c7fecf23b7f62e0da74175ddae7

Shows quite a few differences, also in sizes of the installed layers.

=> I think the problem is in Habana repo side.

Recent Gaudi vLLM dependency changes is one possibility: https://github.com/HabanaAI/vllm-fork/commits/habana_main/requirements-hpu.txt

Maybe new HPU deps do not handle correctly pod's Gaudi plugin device request allowing vLLM (write) access only to one of the 8 devices in the node?

@xiguiw
Copy link
Collaborator

xiguiw commented Dec 19, 2024

Not sure how the docker file are buid.
I find out one build command, but did not find Dockerfile.hpu

Here both v1.1 and latest are built from the same commit id git checkout 3c39626.
because Dockerfile.hpu is missed. And from the behavior, it should be Gaudi vllm service issue. not OPEA level.

@ashahba
Would you let us know the build command for docker image on docker hub?

@eero-t
Copy link
Contributor Author

eero-t commented Dec 19, 2024

Not sure how the docker file are buid. I find out one build command, but did not find Dockerfile.hpu

@xiguiw As you can see from the OPEA script, it git clones Habana repo [1], cds to repo's vllm-fork dir, and builds Dockerfile.hpu from there.

[1] It would be faster to clone specific commit instead of first cloning whole repo and only then checking out that specific commit.

@xiguiw
Copy link
Collaborator

xiguiw commented Dec 24, 2024

Not sure how the docker file are buid. I find out one build command, but did not find Dockerfile.hpu

@xiguiw As you can see from the OPEA script, it git clones Habana repo [1], cds to repo's vllm-fork dir, and builds Dockerfile.hpu from there.

@eero-t
I did not find the exact build command.

I mean if the vllm-gaudi image build from the same commit in v1.1 and latest, it does not make sense that v1.1 works but latest failed.

If v1.1 and latest docker images are built with different commit ID, that possible.
We can try to build vllm-gaudi docker image independently (without OPEA), then to verify if vllm-gaudi service works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants