Skip to content

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

InconsolableCellist
Copy link

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation

Overview

This PR introduces chunked generation support with latent space interpolation, to be used with voice cloning with the transformer model variant (not hybrid). The implementation uses overlapping windows in the latent space to maintain coherence across chunk boundaries.

Important Usage Notes

  • Best Used With:
    • Voice cloning (speaker audio provided)
    • Transformer model variant only
    • Longer multi-sentence texts
  • Not Recommended For:
    • Hybrid model variant
    • Generation without voice cloning

Key Changes

Core Generation

  • Added sentence-based chunking for long-form text processing
  • Implemented NLTK-based sentence tokenization
  • Added cosine-based crossfade between chunks
  • Changed maximum generation length from 30 seconds to 120 seconds (but the length the latent feature can generate is unbounded)

Gradio Interface

  • New toggle in Advanced Parameters

Technical Implementation

  1. Text is split into sentences using NLTK
  2. Each sentence is processed independently with the same seed
  3. Overlap regions are analyzed for best transition points
  4. Cosine crossfade is applied at chunk boundaries
  5. Results are concatenated with smooth transitions

Misc.

  • Created a cache directory mounted in the Docker compose file, to avoid having to constantly download the models during development

Limitations/Improvements Needed

  • Audio Artifacting
    • Chunking introduces discontinuities in the latent space, which cause small audio artifacts, which sometimes manifest as the sound of a microphone being adjusted, and other time aren't perceptible
    • Sometimes sentences are separated by a second pause

Examples

With Latent Windowing (123 seconds)

latent.windowing.4.mp4

Without Latent Windowing (46 seconds)

regular_3.mp4

@FurkanGozukara
Copy link

amazing improvement

@darkacorn
Copy link
Contributor

interesting approach with latens .. i spoke to the team 1-2 days ago about this ..

internally they just splice it and gen them individually and stich it together on the production api ( that was the information i got)

a proposed option they recommended to make transition smooth would be prefix the last 2-3 words as prefix audio and cut that out of gen 2 (text has to be prefixed too) but that would maybe allow infinite length in theory

but you would eventually need some asr to prefix the text chunks too

ideal solution is probably somewhere in the middle - thanks for that approach

@InconsolableCellist
Copy link
Author

I haven't had a chance to try the playground version. Does it have better performance doing it that way, and is it consistent?

What's ASR is that context?

@darkacorn
Copy link
Contributor

asr - whisper - pretty much stt / as otherwise it be hard to know when to cut off and what to feed back in _ the text has to be prefixed just the way prefixes work .. - the playground has a few differences to what we have in oss - namely that they seem to use different samplers (internaly) albeit the model inferenced beeing the transformer

@InconsolableCellist
Copy link
Author

Ah whisper, gotcha.

Do you think the performance of my solution won't be enough to make it into upstream? Or you want to do that other approach eventually and won't use this?

@darkacorn
Copy link
Contributor

darkacorn commented Feb 16, 2025

no man ..i think your approach is super interesting, and something i would have not thought about. i was merely stating the conversations i had with the team to find out how they do it / and what ideas they got

ideally someone would find something that works for arbitrary length and mamba too - but this is a very cool approach already !

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 17, 2025

This is quite important and has to basically be done for every TTS. Otherwise we have a hard limit on length.

@Ph0rk0z Ph0rk0z mentioned this pull request Feb 17, 2025
@InconsolableCellist
Copy link
Author

I merged the upstream changes in for the sampler but it creates dramatically worse results for me now, not sure why yet.

@@ -10,7 +10,10 @@ services:
network_mode: "host"
stdin_open: true
tty: true
command: ["python3", "gradio_interface.py"]
command: ["bash", "-c", "pip install nltk && python3 -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"punkt_tab\")' && python3 gradio_interface.py"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not install nltk with the rest of the python packages? furthermore, don't you already download punkt on lines 99 in gradio_interface.py?


# Sample next token for each codebook
for j in range(self.config.n_codebooks):
next_token = sample_token(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_token seems to be missing?

@Malone678
Copy link

Seeking Help with Repeated Words and Volume Inconsistencies in Chunked TTS Generation (Using PR#101)

I’ve been working on generating a ~106.5-second audio narration for a short story using the Zonos model,
and I’m running into two persistent issues: repeated words at chunk boundaries and inconsistent volume across sections.
I’m using a chunking approach inspired by PR#101’s guidance on handling longer audio,
but I would like to refine it further if possible.
Here’s the setup and what I’m aiming for—any suggestions or insights would be hugely appreciated!
What I’m Trying to Do
I want to convert a four-paragraph story into a seamless audio file, using my voice prompt for the speaker embedding.
The model has a ~30-second generation limit (as noted), so I’m splitting the text into smaller chunks,
generating audio for each, and stitching them together with crossfades.
The code samples below is the closest I’ve gotten—it produces most of the story but it still has issues.

I’m running in Google Colab:

relevent parts below:

import os
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
from zonos.utils import DEFAULT_DEVICE as device
from IPython.display import Audio
import nltk

Split into chunks (2 sentences each)

all_chunks = []
for para_idx, para in enumerate(paragraphs, 1):
sentences = nltk.sent_tokenize(para)
chunk_size = 2
for i in range(0, len(sentences), chunk_size):
chunk_text = " ".join(sentences[i:i + chunk_size])
all_chunks.append({"text": chunk_text, "para": para_idx})

Add overlap (last 3 words from previous chunk)

for i in range(1, len(all_chunks)):
prev_chunk = all_chunks[i - 1]["text"]
words = prev_chunk.split()
prefix = " ".join(words[-3:]) + " " if len(words) >= 3 else prev_chunk + " "
all_chunks[i]["text"] = prefix + all_chunks[i]["text"]

Generate audio

wavs_list = []
speaking_rate = 8.5
for i, chunk in enumerate(all_chunks, 1):
print(f"Generating Chunk {i} (Para {chunk['para']})...")
cond_dict = make_cond_dict(text=chunk["text"], speaker=speaker, language="en-us", speaking_rate=speaking_rate)
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wav = model.autoencoder.decode(codes).cpu()
if wav.dim() == 3:
wav = wav[0]
wav = torchaudio.functional.vad(wav, TARGET_SR, trigger_level=5.0)
wav = torchaudio.functional.vad(wav.flip(1), TARGET_SR, trigger_level=5.0).flip(1)
if i > 1: # Trim prefix audio
prefix_samples = int(1.0 * TARGET_SR) # ~1s for prefix
if prefix_samples < wav.shape[1]:
wav = wav[:, prefix_samples:]
wavs_list.append(wav)
print(f"Generated Chunk {i}: {wav.shape[1] / TARGET_SR:.2f}s")
torch.cuda.empty_cache()

Combine with 200ms crossfade

overlap_seconds = 0.2
overlap_samples = int(overlap_seconds * TARGET_SR)
combined_wav = wavs_list[0]
for i in range(1, len(wavs_list)):
curr_wav = wavs_list[i]
prev_end = combined_wav[:, -overlap_samples:]
curr_start = curr_wav[:, :overlap_samples]
fade_out = torch.linspace(1, 0, overlap_samples)
fade_in = torch.linspace(0, 1, overlap_samples)
crossfade = prev_end * fade_out + curr_start * fade_in
combined_wav = torch.cat([combined_wav[:, :-overlap_samples], crossfade, curr_wav[:, overlap_samples:]], dim=1)

Final processing

combined_wav = torchaudio.functional.vad(combined_wav.flip(1), TARGET_SR, trigger_level=5.0).flip(1)
total_duration = combined_wav.shape[1] / TARGET_SR
print(f"Total Duration: {total_duration:.2f}s")
rms = torch.sqrt(torch.mean(combined_wav ** 2))
if rms > 0:
combined_wav = combined_wav * (0.1 / rms)
combined_wav = torch.clamp(combined_wav, -0.9, 0.9)

Pull Requests Used
PR#101: I’m leveraging this PR, which extends generation to support up to 120 seconds of audio by “splicing and generating individually,
then stitching together.” My approach splits the story into chunks of 2 sentences each, generates audio for each chunk,
and stitches them with a 200ms crossfade.
However, I suspect the 30-second limit might still apply in practice, as some chunks don’t fully generate.

What It Produces
The code aims to produce a ~106.5-second WAV file containing the full narration of the four paragraphs.
It splits each paragraph into chunks of 2 sentences (e.g., 6-8 chunks total),
generates audio for each using my voice prompt, trims silence with VAD (trigger 5.0),
and combines them with a 200ms crossfade for smooth transitions. The final audio is normalized to an RMS of 0.1.
Issues I’m Facing
Repeated Words: At chunk boundaries, words like from my text "mysterious clearing" or "she returned home" repeat into the next sentence.
I prefix each chunk (except the first) with the last 3 words of the previous chunk and trim ~1 second from the audio start,
but this isn’t fully preventing overlap during the crossfade. Is there a better way to handle prefixes or trim more precisely?

Inconsistent Volume: The volume varies across sections—some parts are louder, others quieter.
I normalize the final combined audio to 0.1 RMS, but per-chunk variations persist.
Should I normalize each chunk individually before combining, or is there a known fix for consistent loudness?

Has anyone else seen words repeating at chunk boundaries when using PR#101’s splicing method?
Any tips on trimming the prefix audio more accurately (e.g., dynamic timing instead of a fixed 1s)?

Is there a known issue with volume fluctuations in the model’s output?
Could this be tied to the DAC or generation process? Any recommended preprocessing or postprocessing tricks?
Even with PR#101, I sometimes get incomplete generations (e.g., progress stops at 48%).
Is the 30-second limit still a factor, or am I hitting a different constraint?

It seems to be not generating the chunks as intended Generating Chunk 1 (Para 1)...
Generating: 80%|████████ | 2077/2588 [00:22<00:05, 91.73it/s]
Generated Chunk 1: 23.32s
Generating Chunk 2 (Para 2)...
Generating: 81%|████████▏ | 2109/2588 [00:22<00:05, 93.15it/s]
Generated Chunk 2: 22.74s
Generating Chunk 3 (Para 3)...
Generating: 64%|██████▍ | 1666/2588 [00:18<00:09, 92.53it/s]
Generated Chunk 3: 17.20s
Generating Chunk 4 (Para 3)...
Generating: 32%|███▏ | 834/2588 [00:09<00:19, 92.22it/s]
Generated Chunk 4: 7.99s
Generating Chunk 5 (Para 4)...
Generating: 56%|█████▌ | 1441/2588 [00:15<00:12, 92.85it/s]
Generated Chunk 5: 14.39s
Generating Chunk 6 (Para 4)...
Generating: 55%|█████▌ | 1433/2588 [00:15<00:12, 92.77it/s]
Generated Chunk 6: 14.94s
Total Duration: 99.28s

@darkacorn
Copy link
Contributor

the % you see is the % of the max tokens - yes the 30 sec is very much a constrain you could go over . but it will get weird

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants