Skip to content

Support for batch processing #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Support for batch processing #192

wants to merge 31 commits into from

Conversation

coezbek
Copy link
Contributor

@coezbek coezbek commented Mar 14, 2025

I added support for processing multiple texts at the same time. On my RTX 4060 Ti I can process up to 12 Lines of Text simultaneously for a 6.7x speed up. Processing more slows down due to limited memory.

/usr/bin/time -v uv run zonos_batch_cli.py --text "Text 1 with ......." "Text 2 ....." "Text 3 ...."

Timing: GPU mem via watch -n 1 nvidia-smi

  • 1 Line of Text => 45.67s, 5.5 GB GPU RAM
  • 4 Line of Text => 61.95s 5.9 GB GPU RAM, Speed-Up 2.9x
  • 9 Lines of Text => 67.7s, 7.2 GB GPU RAM, Speed-Up 6.1x
  • 12 Lines of Text => 81.2s, 7.9 GB GPU RAM, Speed-Up 6.7x
  • 15 Lines of Text => 146.1 7.9 GB GPU RAM, Speed-Up 4.7x

There are some constraints though:

  • Because the resulting audio will be of variable length, the codes can't be returned as a tensor. This breaks existing users of model.generate() because I am returning a list of code tensors ([[1, 9, T]]) rather than the batch sized tensor [B,9,T] . If there is another way, I would be curious. Audio generation is performed as a list due to this.
  • I have only implemented passing multiple texts at the same time. It should be easy to also provide all other conditioning parameters individually. I am not so sure about the other generate parameters such as sampling.
  • Can't adjust the torch seed of each generation.
  • There is a bug with the audio prefix. All audio files except the first one have noise at the beginning of the audio. I think this is a bug in the DAC or Zonos. I haven't been able to fix this. (fixed in b9da552)

@darkacorn @gabrielclark3330 Maybe you have ideas on the first and last item.

@AlexiousITA
Copy link

Does this provide a different output file for each line? That's what I would need.

@coezbek
Copy link
Contributor Author

coezbek commented Mar 31, 2025

Yes. You can call it with `--text "Text 1" "Text 2" "Text 3" "Text 4" and it will create the text in parallel for all 4. On 3090 you might be able to generate 64 texts in parallel.

Copy link
Contributor

@darkacorn darkacorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coezbek
Copy link
Contributor Author

coezbek commented Apr 2, 2025

Thanks! I have been using this patch now for two weeks, but there is one limitation to be aware of: The patch does not support Hybrid (haven't tested!) and it seems slower for single samples when compiling. I still need to investigate if there is anything to be done about this.

@sayanb
Copy link

sayanb commented Apr 10, 2025

This is awesome! Any plans to combine this with the other PR for streaming? :)

@coezbek
Copy link
Contributor Author

coezbek commented Apr 10, 2025

@sayanb The batching patch aims to help you if you have a lot of samples to generate at the same time. The streaming patch helps you if you want low-latency responses.

Technically, these two could be combined to answer multiple low-latency streams in chunks, but it wouldn't make much sense for typical use cases I think.

Do you want to use batch processing but get just one big file back?

@sayanb
Copy link

sayanb commented Apr 10, 2025

@coezbek the use case I had in mind was a real-time streaming service offered as an API. If this API receives too many concurrent hits, it might struggle to scale unless a significant number of GPUs are made available, either in one server or across multiple servers behind a load balancer. By combining batching and streaming, the concurrent requests could be grouped, processed as a batch in one GPU, and the outputs streamed back to the respective clients. This way, the service could scale without requiring an excessive number of GPUs.

@coezbek
Copy link
Contributor Author

coezbek commented Apr 10, 2025

@sayanb This is more of an commercial use-case. I am sure any freelancer can whip that up based on both patches. If you need support you can find me on LinkedIn.

@mrdrprofuroboros
Copy link

@sayanb I'm working on batched realtime triton inference this week. not sure though if it makes to open source since it gets quite specific for our triton deployment and ensembles, but we'll try to share zonos parts at least

@derektan5
Copy link

@coezbek
Firstly, thanks for the great work on the Zonos model!
I've been exploring batch generation using the provided CLI examples (like zonos_batch_cli.py) as a starting point. My goal is to efficiently process a list of text prompts where each prompt can potentially have a different speaker reference, maximizing GPU utilization through batching.

I initially modified the batch script to load multiple speaker embeddings and attempted to pass them along with the corresponding texts to make_cond_dict and subsequently model.prepare_conditioning / model.generate. However, this approach ran into errors, first within make_cond_dict (e.g., ValueError: only one element tensors can be converted...) and later within model.prepare_conditioning (e.g., TypeError: list indices must be integers or slices, not str), suggesting these functions might expect uniform conditioning parameters (like the speaker embedding) across the items within a single batch.

As a workaround, I've successfully implemented a "Group by Speaker" strategy:

  1. Load all unique speaker embeddings.
  2. Group the input tasks by their assigned speaker reference.
  3. Iterate through each speaker group sequentially.
  4. Within each speaker group, use batching (up to max_per_batch) by calling make_cond_dict, prepare_conditioning, and model.generate with the list of texts for that speaker and their single, shared speaker embedding.

This workaround functions correctly and generates the audio for all speakers. However, I've observed a significant performance difference compared to the original single-speaker batch script (zonos_batch_cli.py).

  1. Running the modified multi-speaker (grouping) script takes roughly ~132 seconds for my test set.
  2. Running the original zonos_batch_cli.py (which processes a batch for only one speaker) takes roughly ~42 seconds for a similar number of generations.

This performance difference is understandable, as my workaround processes speaker groups sequentially, potentially leading to smaller average batch sizes per model.generate call and losing some parallelism compared to a large, single-speaker batch.

My question is:

Is there a more efficient, built-in way within the Zonos (v0.1-transformer) framework to handle batch generation with heterogeneous speaker embeddings?

  • Can model.prepare_conditioning and model.generate actually handle batches where the speaker conditioning varies per item?

  • If so, how should the conditioning data be structured or passed (e.g., modifications to make_cond_dict or how its output is used)?

  • Are there any plans to support this kind of mixed-speaker batching more directly in future versions?

Any insights or alternative strategies you could suggest would be greatly appreciated! I'm happy to provide snippets of the modified script if it helps clarify the workaround.

Thanks again for your work!

@coezbek
Copy link
Contributor Author

coezbek commented Apr 30, 2025

Hey @derektan5, generating multiple distinct speakers at the same time is possible with a small change make_cond_dict.

In make_cond_dict function signature:

speaker: Union[torch.Tensor, list[torch.Tensor], None] = None,

Replace the final processing loop in make_cond_dict:

for k in unconditional_keys:
    cond_dict.pop(k, None)

# Process items, handling speaker specifically
for k, v in list(cond_dict.items()): # Iterate over a copy
    if k == "speaker":
        if isinstance(v, list): # List of tensors requires stacking
            cond_dict[k] = torch.stack(v, dim=0).unsqueeze(1).to(device)
        elif isinstance(v, torch.Tensor): # Single tensor uses view
            cond_dict[k] = v.view(1, 1, -1).to(device)
        # If v is None, keep it as None for unconditional generation or specific handling
    elif isinstance(v, (float, int, list)): # Convert basic types/lists to tensors
        v_tensor = torch.tensor(v)
        cond_dict[k] = v_tensor.view(1, 1, -1).to(device)
        if k == "emotion": # Normalize emotion vector
            cond_dict[k] /= cond_dict[k].sum(dim=-1)
    elif isinstance(v, torch.Tensor): # Handle any other tensors passed directly
         cond_dict[k] = v.view(1, 1, -1).to(device)
    # Non-tensor types like the 'espeak' tuple are kept as is

return cond_dict

@deguodedongxi
Copy link

Hey guys, I just jumped in yesterday to allow multiple speaker reference to be added, because I really liked your approach. I extended the approach to add lists of languages, emotions etc. to dynamically adjust each spoken sentence in batches. In my eyes that's absolutely important to bring the model to production.

Also I am working on a server that serves as batch-collector. So that if in a certain period (e.g. 200ms) multiple requests come in, they are processed in batches together but returned as individual response.

Thanks for your work @coezbek. That was some great work you did! I will share my work when I feel it is done :)

@coezbek
Copy link
Contributor Author

coezbek commented Apr 30, 2025

@deguodedongxi Very nice to hear! What kind of API would you implement to interface with a batch server?

@deguodedongxi
Copy link

to keep it simple I want to play around with a fastapi server first.

@coezbek
Copy link
Contributor Author

coezbek commented May 17, 2025

It only occurred to me now that max_new_tokens can be reduced if you already now that you are going to generate less content. This reduces KV cache size and allows you generate more samples at the same time.

A first test of mine shows that an RTX 3090 starts being capped at 100% GPU usage for 200 concurrent samples which are generated roughly at 16 it/s for a total throughput of 3200 it/s. When I run a single sample I get 90 it/s, so the batching gives an 35x Speed-Up in such a case.

An RTX 4090 might even go higher, but need sliding window or other techniques to reduce KV cache usage.

README.md Outdated
uv run zonos_batch_cli.py --input_file sample_input.txt --output output/output.wav --text_repeat 2
```

```python
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line messes up the markdown for the following Features & Installation sections. FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants