-
Notifications
You must be signed in to change notification settings - Fork 774
Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101
Conversation
* Hardcoded window size and overlap * Changed from linear interpolation to sinusoidal * Moved UI around, added text
amazing improvement |
interesting approach with latens .. i spoke to the team 1-2 days ago about this .. internally they just splice it and gen them individually and stich it together on the production api ( that was the information i got) a proposed option they recommended to make transition smooth would be prefix the last 2-3 words as prefix audio and cut that out of gen 2 (text has to be prefixed too) but that would maybe allow infinite length in theory but you would eventually need some asr to prefix the text chunks too ideal solution is probably somewhere in the middle - thanks for that approach |
I haven't had a chance to try the playground version. Does it have better performance doing it that way, and is it consistent? What's ASR is that context? |
asr - whisper - pretty much stt / as otherwise it be hard to know when to cut off and what to feed back in _ the text has to be prefixed just the way prefixes work .. - the playground has a few differences to what we have in oss - namely that they seem to use different samplers (internaly) albeit the model inferenced beeing the transformer |
Ah whisper, gotcha. Do you think the performance of my solution won't be enough to make it into upstream? Or you want to do that other approach eventually and won't use this? |
no man ..i think your approach is super interesting, and something i would have not thought about. i was merely stating the conversations i had with the team to find out how they do it / and what ideas they got ideally someone would find something that works for arbitrary length and mamba too - but this is a very cool approach already ! |
This is quite important and has to basically be done for every TTS. Otherwise we have a hard limit on length. |
I merged the upstream changes in for the sampler but it creates dramatically worse results for me now, not sure why yet. |
@@ -10,7 +10,10 @@ services: | |||
network_mode: "host" | |||
stdin_open: true | |||
tty: true | |||
command: ["python3", "gradio_interface.py"] | |||
command: ["bash", "-c", "pip install nltk && python3 -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"punkt_tab\")' && python3 gradio_interface.py"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not install nltk with the rest of the python packages? furthermore, don't you already download punkt on lines 99 in gradio_interface.py?
|
||
# Sample next token for each codebook | ||
for j in range(self.config.n_codebooks): | ||
next_token = sample_token( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_token
seems to be missing?
Seeking Help with Repeated Words and Volume Inconsistencies in Chunked TTS Generation (Using PR#101) I’ve been working on generating a ~106.5-second audio narration for a short story using the Zonos model, I’m running in Google Colab: relevent parts below: import os Split into chunks (2 sentences each)all_chunks = [] Add overlap (last 3 words from previous chunk)for i in range(1, len(all_chunks)): Generate audiowavs_list = [] Combine with 200ms crossfadeoverlap_seconds = 0.2 Final processingcombined_wav = torchaudio.functional.vad(combined_wav.flip(1), TARGET_SR, trigger_level=5.0).flip(1) Pull Requests Used What It Produces Inconsistent Volume: The volume varies across sections—some parts are louder, others quieter. Has anyone else seen words repeating at chunk boundaries when using PR#101’s splicing method? Is there a known issue with volume fluctuations in the model’s output? It seems to be not generating the chunks as intended Generating Chunk 1 (Para 1)... |
the % you see is the % of the max tokens - yes the 30 sec is very much a constrain you could go over . but it will get weird |
Attaining arbitrarily long audio generation using chunked generation and latent space interpolation
Overview
This PR introduces chunked generation support with latent space interpolation, to be used with voice cloning with the transformer model variant (not hybrid). The implementation uses overlapping windows in the latent space to maintain coherence across chunk boundaries.
Important Usage Notes
Key Changes
Core Generation
Gradio Interface
Technical Implementation
Misc.
Limitations/Improvements Needed
Examples
With Latent Windowing (123 seconds)
latent.windowing.4.mp4
Without Latent Windowing (46 seconds)
regular_3.mp4