Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

InconsolableCellist · 2025-02-16T00:03:43Z

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation

Overview

This PR introduces chunked generation support with latent space interpolation, to be used with voice cloning with the transformer model variant (not hybrid). The implementation uses overlapping windows in the latent space to maintain coherence across chunk boundaries.

Important Usage Notes

Best Used With:
- Voice cloning (speaker audio provided)
- Transformer model variant only
- Longer multi-sentence texts
Not Recommended For:
- Hybrid model variant
- Generation without voice cloning

Key Changes

Core Generation

Added sentence-based chunking for long-form text processing
Implemented NLTK-based sentence tokenization
Added cosine-based crossfade between chunks
Changed maximum generation length from 30 seconds to 120 seconds (but the length the latent feature can generate is unbounded)

Gradio Interface

New toggle in Advanced Parameters

Technical Implementation

Text is split into sentences using NLTK
Each sentence is processed independently with the same seed
Overlap regions are analyzed for best transition points
Cosine crossfade is applied at chunk boundaries
Results are concatenated with smooth transitions

Misc.

Created a cache directory mounted in the Docker compose file, to avoid having to constantly download the models during development

Limitations/Improvements Needed

Audio Artifacting
- Chunking introduces discontinuities in the latent space, which cause small audio artifacts, which sometimes manifest as the sound of a microphone being adjusted, and other time aren't perceptible
- Sometimes sentences are separated by a second pause

Examples

With Latent Windowing (123 seconds)

latent.windowing.4.mp4

Without Latent Windowing (46 seconds)

regular_3.mp4

* Hardcoded window size and overlap * Changed from linear interpolation to sinusoidal * Moved UI around, added text

FurkanGozukara · 2025-02-16T00:13:55Z

amazing improvement

darkacorn · 2025-02-16T07:06:06Z

interesting approach with latens .. i spoke to the team 1-2 days ago about this ..

internally they just splice it and gen them individually and stich it together on the production api ( that was the information i got)

a proposed option they recommended to make transition smooth would be prefix the last 2-3 words as prefix audio and cut that out of gen 2 (text has to be prefixed too) but that would maybe allow infinite length in theory

but you would eventually need some asr to prefix the text chunks too

ideal solution is probably somewhere in the middle - thanks for that approach

InconsolableCellist · 2025-02-16T07:40:33Z

I haven't had a chance to try the playground version. Does it have better performance doing it that way, and is it consistent?

What's ASR is that context?

darkacorn · 2025-02-16T07:43:48Z

asr - whisper - pretty much stt / as otherwise it be hard to know when to cut off and what to feed back in _ the text has to be prefixed just the way prefixes work .. - the playground has a few differences to what we have in oss - namely that they seem to use different samplers (internaly) albeit the model inferenced beeing the transformer

InconsolableCellist · 2025-02-16T07:46:22Z

Ah whisper, gotcha.

Do you think the performance of my solution won't be enough to make it into upstream? Or you want to do that other approach eventually and won't use this?

darkacorn · 2025-02-16T08:57:36Z

no man ..i think your approach is super interesting, and something i would have not thought about. i was merely stating the conversations i had with the team to find out how they do it / and what ideas they got

ideally someone would find something that works for arbitrary length and mamba too - but this is a very cool approach already !

Ph0rk0z · 2025-02-17T13:39:49Z

This is quite important and has to basically be done for every TTS. Otherwise we have a hard limit on length.

InconsolableCellist · 2025-02-22T21:50:26Z

I merged the upstream changes in for the sampler but it creates dramatically worse results for me now, not sure why yet.

EvanGee · 2025-02-24T23:54:11Z

docker-compose.yml

@@ -10,7 +10,10 @@ services:
    network_mode: "host"
    stdin_open: true
    tty: true
-    command: ["python3", "gradio_interface.py"]
+    command: ["bash", "-c", "pip install nltk && python3 -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"punkt_tab\")' && python3 gradio_interface.py"]


Why not install nltk with the rest of the python packages? furthermore, don't you already download punkt on lines 99 in gradio_interface.py?

dwohlfahrt · 2025-02-26T02:48:28Z

zonos/model.py

+
+            # Sample next token for each codebook
+            for j in range(self.config.n_codebooks):
+                next_token = sample_token(


sample_token seems to be missing?

Malone678 · 2025-03-19T18:41:03Z

Seeking Help with Repeated Words and Volume Inconsistencies in Chunked TTS Generation (Using PR#101)

I’ve been working on generating a ~106.5-second audio narration for a short story using the Zonos model,
and I’m running into two persistent issues: repeated words at chunk boundaries and inconsistent volume across sections.
I’m using a chunking approach inspired by PR#101’s guidance on handling longer audio,
but I would like to refine it further if possible.
Here’s the setup and what I’m aiming for—any suggestions or insights would be hugely appreciated!
What I’m Trying to Do
I want to convert a four-paragraph story into a seamless audio file, using my voice prompt for the speaker embedding.
The model has a ~30-second generation limit (as noted), so I’m splitting the text into smaller chunks,
generating audio for each, and stitching them together with crossfades.
The code samples below is the closest I’ve gotten—it produces most of the story but it still has issues.

I’m running in Google Colab:

relevent parts below:

import os
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
from zonos.utils import DEFAULT_DEVICE as device
from IPython.display import Audio
import nltk

Split into chunks (2 sentences each)

all_chunks = []
for para_idx, para in enumerate(paragraphs, 1):
sentences = nltk.sent_tokenize(para)
chunk_size = 2
for i in range(0, len(sentences), chunk_size):
chunk_text = " ".join(sentences[i:i + chunk_size])
all_chunks.append({"text": chunk_text, "para": para_idx})

Add overlap (last 3 words from previous chunk)

for i in range(1, len(all_chunks)):
prev_chunk = all_chunks[i - 1]["text"]
words = prev_chunk.split()
prefix = " ".join(words[-3:]) + " " if len(words) >= 3 else prev_chunk + " "
all_chunks[i]["text"] = prefix + all_chunks[i]["text"]

Generate audio

wavs_list = []
speaking_rate = 8.5
for i, chunk in enumerate(all_chunks, 1):
print(f"Generating Chunk {i} (Para {chunk['para']})...")
cond_dict = make_cond_dict(text=chunk["text"], speaker=speaker, language="en-us", speaking_rate=speaking_rate)
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wav = model.autoencoder.decode(codes).cpu()
if wav.dim() == 3:
wav = wav[0]
wav = torchaudio.functional.vad(wav, TARGET_SR, trigger_level=5.0)
wav = torchaudio.functional.vad(wav.flip(1), TARGET_SR, trigger_level=5.0).flip(1)
if i > 1: # Trim prefix audio
prefix_samples = int(1.0 * TARGET_SR) # ~1s for prefix
if prefix_samples < wav.shape[1]:
wav = wav[:, prefix_samples:]
wavs_list.append(wav)
print(f"Generated Chunk {i}: {wav.shape[1] / TARGET_SR:.2f}s")
torch.cuda.empty_cache()

Combine with 200ms crossfade

overlap_seconds = 0.2
overlap_samples = int(overlap_seconds * TARGET_SR)
combined_wav = wavs_list[0]
for i in range(1, len(wavs_list)):
curr_wav = wavs_list[i]
prev_end = combined_wav[:, -overlap_samples:]
curr_start = curr_wav[:, :overlap_samples]
fade_out = torch.linspace(1, 0, overlap_samples)
fade_in = torch.linspace(0, 1, overlap_samples)
crossfade = prev_end * fade_out + curr_start * fade_in
combined_wav = torch.cat([combined_wav[:, :-overlap_samples], crossfade, curr_wav[:, overlap_samples:]], dim=1)

Final processing

combined_wav = torchaudio.functional.vad(combined_wav.flip(1), TARGET_SR, trigger_level=5.0).flip(1)
total_duration = combined_wav.shape[1] / TARGET_SR
print(f"Total Duration: {total_duration:.2f}s")
rms = torch.sqrt(torch.mean(combined_wav ** 2))
if rms > 0:
combined_wav = combined_wav * (0.1 / rms)
combined_wav = torch.clamp(combined_wav, -0.9, 0.9)

Pull Requests Used
PR#101: I’m leveraging this PR, which extends generation to support up to 120 seconds of audio by “splicing and generating individually,
then stitching together.” My approach splits the story into chunks of 2 sentences each, generates audio for each chunk,
and stitches them with a 200ms crossfade.
However, I suspect the 30-second limit might still apply in practice, as some chunks don’t fully generate.

What It Produces
The code aims to produce a ~106.5-second WAV file containing the full narration of the four paragraphs.
It splits each paragraph into chunks of 2 sentences (e.g., 6-8 chunks total),
generates audio for each using my voice prompt, trims silence with VAD (trigger 5.0),
and combines them with a 200ms crossfade for smooth transitions. The final audio is normalized to an RMS of 0.1.
Issues I’m Facing
Repeated Words: At chunk boundaries, words like from my text "mysterious clearing" or "she returned home" repeat into the next sentence.
I prefix each chunk (except the first) with the last 3 words of the previous chunk and trim ~1 second from the audio start,
but this isn’t fully preventing overlap during the crossfade. Is there a better way to handle prefixes or trim more precisely?

Inconsistent Volume: The volume varies across sections—some parts are louder, others quieter.
I normalize the final combined audio to 0.1 RMS, but per-chunk variations persist.
Should I normalize each chunk individually before combining, or is there a known fix for consistent loudness?

Has anyone else seen words repeating at chunk boundaries when using PR#101’s splicing method?
Any tips on trimming the prefix audio more accurately (e.g., dynamic timing instead of a fixed 1s)?

Is there a known issue with volume fluctuations in the model’s output?
Could this be tied to the DAC or generation process? Any recommended preprocessing or postprocessing tricks?
Even with PR#101, I sometimes get incomplete generations (e.g., progress stops at 48%).
Is the 30-second limit still a factor, or am I hitting a different constraint?

It seems to be not generating the chunks as intended Generating Chunk 1 (Para 1)...
Generating: 80%|████████ | 2077/2588 [00:22<00:05, 91.73it/s]
Generated Chunk 1: 23.32s
Generating Chunk 2 (Para 2)...
Generating: 81%|████████▏ | 2109/2588 [00:22<00:05, 93.15it/s]
Generated Chunk 2: 22.74s
Generating Chunk 3 (Para 3)...
Generating: 64%|██████▍ | 1666/2588 [00:18<00:09, 92.53it/s]
Generated Chunk 3: 17.20s
Generating Chunk 4 (Para 3)...
Generating: 32%|███▏ | 834/2588 [00:09<00:19, 92.22it/s]
Generated Chunk 4: 7.99s
Generating Chunk 5 (Para 4)...
Generating: 56%|█████▌ | 1441/2588 [00:15<00:12, 92.85it/s]
Generated Chunk 5: 14.39s
Generating Chunk 6 (Para 4)...
Generating: 55%|█████▌ | 1433/2588 [00:15<00:12, 92.77it/s]
Generated Chunk 6: 14.94s
Total Duration: 99.28s

darkacorn · 2025-03-20T01:16:54Z

the % you see is the % of the max tokens - yes the 30 sec is very much a constrain you could go over . but it will get weird

InconsolableCellist added 4 commits February 14, 2025 20:32

Fairly functional sliding latent window

7aa0561

Improved latent window feature

f4b2be5

* Hardcoded window size and overlap * Changed from linear interpolation to sinusoidal * Moved UI around, added text

Increasing limit to 120 seconds

a064181

Merge branch 'main' into sliding_window_2

778cf15

InconsolableCellist mentioned this pull request Feb 16, 2025

Is there a maximum limit of characters for a single inference? #10

Open

Ph0rk0z mentioned this pull request Feb 17, 2025

Very short text limit #98

Open

Merge branch 'main' into sliding_window_2

40efd9f

Ph0rk0z mentioned this pull request Feb 22, 2025

Files + Infinite content in Gradio UI #148

Open

EvanGee reviewed Feb 24, 2025

View reviewed changes

dwohlfahrt reviewed Feb 26, 2025

View reviewed changes

coezbek mentioned this pull request Mar 6, 2025

Wishlist for Zonos 0.2 release #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Uh oh!

InconsolableCellist commented Feb 16, 2025

Uh oh!

FurkanGozukara commented Feb 16, 2025

Uh oh!

darkacorn commented Feb 16, 2025

Uh oh!

InconsolableCellist commented Feb 16, 2025

Uh oh!

darkacorn commented Feb 16, 2025

Uh oh!

InconsolableCellist commented Feb 16, 2025

Uh oh!

darkacorn commented Feb 16, 2025 •

edited

Loading

Uh oh!

Ph0rk0z commented Feb 17, 2025

Uh oh!

InconsolableCellist commented Feb 22, 2025

Uh oh!

EvanGee Feb 24, 2025

Uh oh!

dwohlfahrt Feb 26, 2025

Uh oh!

Malone678 commented Mar 19, 2025

Uh oh!

darkacorn commented Mar 20, 2025

Uh oh!

Uh oh!

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Are you sure you want to change the base?

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Uh oh!

Conversation

InconsolableCellist commented Feb 16, 2025

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation

Overview

Important Usage Notes

Key Changes

Core Generation

Gradio Interface

Technical Implementation

Misc.

Limitations/Improvements Needed

Examples

Uh oh!

FurkanGozukara commented Feb 16, 2025

Uh oh!

darkacorn commented Feb 16, 2025

Uh oh!

InconsolableCellist commented Feb 16, 2025

Uh oh!

darkacorn commented Feb 16, 2025

Uh oh!

InconsolableCellist commented Feb 16, 2025

Uh oh!

darkacorn commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Feb 17, 2025

Uh oh!

InconsolableCellist commented Feb 22, 2025

Uh oh!

EvanGee Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

dwohlfahrt Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Malone678 commented Mar 19, 2025

Split into chunks (2 sentences each)

Add overlap (last 3 words from previous chunk)

Generate audio

Combine with 200ms crossfade

Final processing

Uh oh!

darkacorn commented Mar 20, 2025

Uh oh!

Uh oh!

darkacorn commented Feb 16, 2025 •

edited

Loading