Support for batch processing #192

coezbek · 2025-03-14T12:52:15Z

I added support for processing multiple texts at the same time. On my RTX 4060 Ti I can process up to 12 Lines of Text simultaneously for a 6.7x speed up. Processing more slows down due to limited memory.

/usr/bin/time -v uv run zonos_batch_cli.py --text "Text 1 with ......." "Text 2 ....." "Text 3 ...."

Timing: GPU mem via watch -n 1 nvidia-smi

1 Line of Text => 45.67s, 5.5 GB GPU RAM
4 Line of Text => 61.95s 5.9 GB GPU RAM, Speed-Up 2.9x
9 Lines of Text => 67.7s, 7.2 GB GPU RAM, Speed-Up 6.1x
12 Lines of Text => 81.2s, 7.9 GB GPU RAM, Speed-Up 6.7x
15 Lines of Text => 146.1 7.9 GB GPU RAM, Speed-Up 4.7x

There are some constraints though:

Because the resulting audio will be of variable length, the codes can't be returned as a tensor. This breaks existing users of model.generate() because I am returning a list of code tensors ([[1, 9, T]]) rather than the batch sized tensor [B,9,T] . If there is another way, I would be curious. Audio generation is performed as a list due to this.
I have only implemented passing multiple texts at the same time. It should be easy to also provide all other conditioning parameters individually. I am not so sure about the other generate parameters such as sampling.
Can't adjust the torch seed of each generation.
~~There is a bug with the audio prefix. All audio files except the first one have noise at the beginning of the audio. I think this is a bug in the DAC or Zonos. I haven't been able to fix this.~~ (fixed in b9da552)

@darkacorn @gabrielclark3330 Maybe you have ideas on the first and last item.

…rn a list rather than a tensor

…ften

AlexiousITA · 2025-03-31T15:38:24Z

Does this provide a different output file for each line? That's what I would need.

coezbek · 2025-03-31T15:40:21Z

Yes. You can call it with `--text "Text 1" "Text 2" "Text 3" "Text 4" and it will create the text in parallel for all 4. On 3090 you might be able to generate 64 texts in parallel.

darkacorn

LGTM

coezbek · 2025-04-02T12:45:14Z

Thanks! I have been using this patch now for two weeks, but there is one limitation to be aware of: The patch does not support Hybrid (haven't tested!) and it seems slower for single samples when compiling. I still need to investigate if there is anything to be done about this.

…aesthetics

… to decode

… module

…ch_cli

…rnings in the logs

…an audio prefix file (or '' if not defined)

sayanb · 2025-04-10T05:57:20Z

This is awesome! Any plans to combine this with the other PR for streaming? :)

coezbek · 2025-04-10T06:59:17Z

@sayanb The batching patch aims to help you if you have a lot of samples to generate at the same time. The streaming patch helps you if you want low-latency responses.

Technically, these two could be combined to answer multiple low-latency streams in chunks, but it wouldn't make much sense for typical use cases I think.

Do you want to use batch processing but get just one big file back?

sayanb · 2025-04-10T11:27:35Z

@coezbek the use case I had in mind was a real-time streaming service offered as an API. If this API receives too many concurrent hits, it might struggle to scale unless a significant number of GPUs are made available, either in one server or across multiple servers behind a load balancer. By combining batching and streaming, the concurrent requests could be grouped, processed as a batch in one GPU, and the outputs streamed back to the respective clients. This way, the service could scale without requiring an excessive number of GPUs.

coezbek · 2025-04-10T11:43:48Z

@sayanb This is more of an commercial use-case. I am sure any freelancer can whip that up based on both patches. If you need support you can find me on LinkedIn.

mrdrprofuroboros · 2025-04-14T16:52:47Z

@sayanb I'm working on batched realtime triton inference this week. not sure though if it makes to open source since it gets quite specific for our triton deployment and ensembles, but we'll try to share zonos parts at least

derektan5 · 2025-04-16T10:42:32Z

@coezbek
Firstly, thanks for the great work on the Zonos model!
I've been exploring batch generation using the provided CLI examples (like zonos_batch_cli.py) as a starting point. My goal is to efficiently process a list of text prompts where each prompt can potentially have a different speaker reference, maximizing GPU utilization through batching.

I initially modified the batch script to load multiple speaker embeddings and attempted to pass them along with the corresponding texts to make_cond_dict and subsequently model.prepare_conditioning / model.generate. However, this approach ran into errors, first within make_cond_dict (e.g., ValueError: only one element tensors can be converted...) and later within model.prepare_conditioning (e.g., TypeError: list indices must be integers or slices, not str), suggesting these functions might expect uniform conditioning parameters (like the speaker embedding) across the items within a single batch.

As a workaround, I've successfully implemented a "Group by Speaker" strategy:

Load all unique speaker embeddings.
Group the input tasks by their assigned speaker reference.
Iterate through each speaker group sequentially.
Within each speaker group, use batching (up to max_per_batch) by calling make_cond_dict, prepare_conditioning, and model.generate with the list of texts for that speaker and their single, shared speaker embedding.

This workaround functions correctly and generates the audio for all speakers. However, I've observed a significant performance difference compared to the original single-speaker batch script (zonos_batch_cli.py).

Running the modified multi-speaker (grouping) script takes roughly ~132 seconds for my test set.
Running the original zonos_batch_cli.py (which processes a batch for only one speaker) takes roughly ~42 seconds for a similar number of generations.

This performance difference is understandable, as my workaround processes speaker groups sequentially, potentially leading to smaller average batch sizes per model.generate call and losing some parallelism compared to a large, single-speaker batch.

My question is:

Is there a more efficient, built-in way within the Zonos (v0.1-transformer) framework to handle batch generation with heterogeneous speaker embeddings?

Can model.prepare_conditioning and model.generate actually handle batches where the speaker conditioning varies per item?
If so, how should the conditioning data be structured or passed (e.g., modifications to make_cond_dict or how its output is used)?
Are there any plans to support this kind of mixed-speaker batching more directly in future versions?

Any insights or alternative strategies you could suggest would be greatly appreciated! I'm happy to provide snippets of the modified script if it helps clarify the workaround.

Thanks again for your work!

coezbek · 2025-04-30T12:48:47Z

Hey @derektan5, generating multiple distinct speakers at the same time is possible with a small change make_cond_dict.

In make_cond_dict function signature:

speaker: Union[torch.Tensor, list[torch.Tensor], None] = None,

Replace the final processing loop in make_cond_dict:

for k in unconditional_keys:
    cond_dict.pop(k, None)

# Process items, handling speaker specifically
for k, v in list(cond_dict.items()): # Iterate over a copy
    if k == "speaker":
        if isinstance(v, list): # List of tensors requires stacking
            cond_dict[k] = torch.stack(v, dim=0).unsqueeze(1).to(device)
        elif isinstance(v, torch.Tensor): # Single tensor uses view
            cond_dict[k] = v.view(1, 1, -1).to(device)
        # If v is None, keep it as None for unconditional generation or specific handling
    elif isinstance(v, (float, int, list)): # Convert basic types/lists to tensors
        v_tensor = torch.tensor(v)
        cond_dict[k] = v_tensor.view(1, 1, -1).to(device)
        if k == "emotion": # Normalize emotion vector
            cond_dict[k] /= cond_dict[k].sum(dim=-1)
    elif isinstance(v, torch.Tensor): # Handle any other tensors passed directly
         cond_dict[k] = v.view(1, 1, -1).to(device)
    # Non-tensor types like the 'espeak' tuple are kept as is

return cond_dict

deguodedongxi · 2025-04-30T13:54:27Z

Hey guys, I just jumped in yesterday to allow multiple speaker reference to be added, because I really liked your approach. I extended the approach to add lists of languages, emotions etc. to dynamically adjust each spoken sentence in batches. In my eyes that's absolutely important to bring the model to production.

Also I am working on a server that serves as batch-collector. So that if in a certain period (e.g. 200ms) multiple requests come in, they are processed in batches together but returned as individual response.

Thanks for your work @coezbek. That was some great work you did! I will share my work when I feel it is done :)

coezbek · 2025-04-30T14:04:27Z

@deguodedongxi Very nice to hear! What kind of API would you implement to interface with a batch server?

deguodedongxi · 2025-04-30T14:11:09Z

to keep it simple I want to play around with a fastapi server first.

coezbek · 2025-05-17T06:29:05Z

It only occurred to me now that max_new_tokens can be reduced if you already now that you are going to generate less content. This reduces KV cache size and allows you generate more samples at the same time.

A first test of mine shows that an RTX 3090 starts being capped at 100% GPU usage for 200 concurrent samples which are generated roughly at 16 it/s for a total throughput of 3200 it/s. When I run a single sample I get 90 it/s, so the batching gives an 35x Speed-Up in such a case.

An RTX 4090 might even go higher, but need sliding window or other techniques to reduce KV cache usage.

verderog · 2025-07-06T19:35:11Z

README.md

+uv run zonos_batch_cli.py --input_file sample_input.txt --output output/output.wav --text_repeat 2
+```
+
+```python


This line messes up the markdown for the following Features & Installation sections. FYI.

Fixed, thanks.

coezbek added 10 commits March 14, 2025 13:15

Batching CLI

e720337

Support for batching multiple texts at once.

c977b57

generated codes are of variable length, thus model generate must retu…

a15c52a

…rn a list rather than a tensor

Adjust CLI client to support batch

aabfa19

Fixed click/pop sounds after audio prefix

b9da552

Make EOS token prediction less likely, because audio is cut off too o…

f78cc23

…ften

Left pad instead of right pad for prefixing

c859c94

Fix in case there is no unknown token

28eb97a

Fix in case there is no unknown token

e99433c

Fix if full audio is generated and cut of prefix in any case

1b51c73

darkacorn approved these changes Apr 2, 2025

View reviewed changes

coezbek added 16 commits April 3, 2025 14:17

Branchless code, seems slightly faster.

871c4b8

Added repeat flags to generate same text multiple times, Added Audio …

d7d1165

…aesthetics

Fix extra space added if prefix_text is ""

3d98c94

Phonemizer: don't add flags in case espeak switches language codes

7117494

Phonemizer: Add debug output

a6f79b3

Phonemizer: Warn if there is at least one phoneme which espeak failed…

71f61fc

… to decode

Increase phonemizer logging to warning

1eb83fc

Reorder logging in phonemizer to don't get debug logs from Phonemizer…

aa21d79

… module

Only log once per string when phonemization fails

8e66afe

Add documentation for batch_cli

91a96a2

cfg_scale == 1.0 isn't supported + added some documentation

451d93d

Better handling of unknown token masks

9469af3

Make EOS less likely

d6a1847

Avoid query to hugging face if model is already local

5ee73b0

Merge branch 'batch_cli' of https://github.com/coezbek/Zonos into bat…

0a465a5

…ch_cli

Fix gradio in batching branch

cef5181

coezbek added 4 commits April 6, 2025 20:51

Switched to my own audiobox-aesthetics branch, because it has less wa…

e88747b

…rnings in the logs

Debug logging for sampling (every 64th generation) if enabled.

1a504e2

Use offline models for Speaker Embedding

792c870

transcripts.json is used by batch_cli to get the text to prepend for …

8cb25d2

…an audio prefix file (or '' if not defined)

verderog reviewed Jul 6, 2025

View reviewed changes

Update README.md

458be7c

Support for batch processing #192

Are you sure you want to change the base?

Support for batch processing #192

Uh oh!

Conversation

coezbek commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexiousITA commented Mar 31, 2025

Uh oh!

coezbek commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darkacorn left a comment

Choose a reason for hiding this comment

Uh oh!

coezbek commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayanb commented Apr 10, 2025

Uh oh!

coezbek commented Apr 10, 2025

Uh oh!

sayanb commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coezbek commented Apr 10, 2025

Uh oh!

mrdrprofuroboros commented Apr 14, 2025

Uh oh!

derektan5 commented Apr 16, 2025

Uh oh!

coezbek commented Apr 30, 2025

In make_cond_dict function signature:

Replace the final processing loop in make_cond_dict:

Uh oh!

deguodedongxi commented Apr 30, 2025

Uh oh!

coezbek commented Apr 30, 2025

Uh oh!

deguodedongxi commented Apr 30, 2025

Uh oh!

coezbek commented May 17, 2025

Uh oh!

verderog Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

coezbek Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coezbek commented Mar 14, 2025 •

edited

Loading

coezbek commented Mar 31, 2025 •

edited

Loading

coezbek commented Apr 2, 2025 •

edited

Loading

sayanb commented Apr 10, 2025 •

edited

Loading