Distributed inference on Hugging Face Spaces #9142
rgerganov
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have been playing with HF spaces and decided to try running rpc-server there. I had some partial success so I will share my experience here.
Their Docker space allows you to expose a TCP port to the outside world, so my first thought was to simply run
rpc-server
in a Docker container and expoose its port. Unfortunately this didn't work as they support only HTTP(2) based traffic and inrpc-server
we use custom binary protocol. I have tried to workaround with some tunneling solutions likewstunnel
with no success.Then I found they have SSH support for PRO users and decided to try it out. The idea is to setup an SSH tunnel between two spaces and forward traffic to
rpc-server
:The steps to do this are:
id_ed25519
is a disposable SSH key):Start
rpc-server
in Space 2 on its default port (50052) which is now forwarded to Space 1Run
llama-cli
in Space 1, offloading tolocalhost:50052
I gave this a try with Meta-Llama-3-70B-Instruct-IQ2_M.gguf (22.46 GiB) and two spaces each running on NVidia T4 small (16GB VRAM) and here is the result:
I believe the performance would be much better if there was a direct network connection between the two spaces but there is no way to prove this. It'd be great if HuggingFace add support for private networks between their Docker spaces.
If anyone wants to play with this, here is the space that I've been using.
Beta Was this translation helpful? Give feedback.
All reactions