-
Notifications
You must be signed in to change notification settings - Fork 774
Support for batch processing #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…rn a list rather than a tensor
Does this provide a different output file for each line? That's what I would need. |
Yes. You can call it with `--text "Text 1" "Text 2" "Text 3" "Text 4" and it will create the text in parallel for all 4. On 3090 you might be able to generate 64 texts in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! I have been using this patch now for two weeks, but there is one limitation to be aware of: The patch does not support Hybrid (haven't tested!) and it seems slower for single samples when compiling. I still need to investigate if there is anything to be done about this. |
…rnings in the logs
…an audio prefix file (or '' if not defined)
This is awesome! Any plans to combine this with the other PR for streaming? :) |
@sayanb The batching patch aims to help you if you have a lot of samples to generate at the same time. The streaming patch helps you if you want low-latency responses. Technically, these two could be combined to answer multiple low-latency streams in chunks, but it wouldn't make much sense for typical use cases I think. Do you want to use batch processing but get just one big file back? |
@coezbek the use case I had in mind was a real-time streaming service offered as an API. If this API receives too many concurrent hits, it might struggle to scale unless a significant number of GPUs are made available, either in one server or across multiple servers behind a load balancer. By combining batching and streaming, the concurrent requests could be grouped, processed as a batch in one GPU, and the outputs streamed back to the respective clients. This way, the service could scale without requiring an excessive number of GPUs. |
@sayanb This is more of an commercial use-case. I am sure any freelancer can whip that up based on both patches. If you need support you can find me on LinkedIn. |
@sayanb I'm working on batched realtime triton inference this week. not sure though if it makes to open source since it gets quite specific for our triton deployment and ensembles, but we'll try to share zonos parts at least |
@coezbek I initially modified the batch script to load multiple speaker embeddings and attempted to pass them along with the corresponding texts to As a workaround, I've successfully implemented a "Group by Speaker" strategy:
This workaround functions correctly and generates the audio for all speakers. However, I've observed a significant performance difference compared to the original single-speaker batch script (
This performance difference is understandable, as my workaround processes speaker groups sequentially, potentially leading to smaller average batch sizes per My question is: Is there a more efficient, built-in way within the Zonos (v0.1-transformer) framework to handle batch generation with heterogeneous speaker embeddings?
Any insights or alternative strategies you could suggest would be greatly appreciated! I'm happy to provide snippets of the modified script if it helps clarify the workaround. Thanks again for your work! |
Hey @derektan5, generating multiple distinct speakers at the same time is possible with a small change make_cond_dict. In make_cond_dict function signature:
Replace the final processing loop in make_cond_dict:
|
Hey guys, I just jumped in yesterday to allow multiple speaker reference to be added, because I really liked your approach. I extended the approach to add lists of languages, emotions etc. to dynamically adjust each spoken sentence in batches. In my eyes that's absolutely important to bring the model to production. Also I am working on a server that serves as batch-collector. So that if in a certain period (e.g. 200ms) multiple requests come in, they are processed in batches together but returned as individual response. Thanks for your work @coezbek. That was some great work you did! I will share my work when I feel it is done :) |
@deguodedongxi Very nice to hear! What kind of API would you implement to interface with a batch server? |
to keep it simple I want to play around with a fastapi server first. |
It only occurred to me now that max_new_tokens can be reduced if you already now that you are going to generate less content. This reduces KV cache size and allows you generate more samples at the same time. A first test of mine shows that an RTX 3090 starts being capped at 100% GPU usage for 200 concurrent samples which are generated roughly at 16 it/s for a total throughput of 3200 it/s. When I run a single sample I get 90 it/s, so the batching gives an 35x Speed-Up in such a case. An RTX 4090 might even go higher, but need sliding window or other techniques to reduce KV cache usage. |
README.md
Outdated
uv run zonos_batch_cli.py --input_file sample_input.txt --output output/output.wav --text_repeat 2 | ||
``` | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line messes up the markdown for the following Features & Installation sections. FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
I added support for processing multiple texts at the same time. On my RTX 4060 Ti I can process up to 12 Lines of Text simultaneously for a 6.7x speed up. Processing more slows down due to limited memory.
/usr/bin/time -v uv run zonos_batch_cli.py --text "Text 1 with ......." "Text 2 ....." "Text 3 ...."
Timing: GPU mem via
watch -n 1 nvidia-smi
There are some constraints though:
model.generate()
because I am returning a list of code tensors ([[1, 9, T]]
) rather than the batch sized tensor[B,9,T]
. If there is another way, I would be curious. Audio generation is performed as a list due to this.There is a bug with the audio prefix. All audio files except the first one have noise at the beginning of the audio. I think this is a bug in the DAC or Zonos. I haven't been able to fix this.(fixed in b9da552)@darkacorn @gabrielclark3330 Maybe you have ideas on the first and last item.