-
Notifications
You must be signed in to change notification settings - Fork 774
Infinite streaming #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Infinite streaming #208
Conversation
Cross fade streaming
minor improvements from open PRs in Zonos repo
@mrdrprofuroboros Very interesting! I will give this a spin! Couple of questions:
|
what we really also need to state is, that we just duck-tapeing till 1.5/2. there is only so much we can get out of a beta release / they are working on it tho .. |
@darkacorn |
okay, @coezbek 's suggestion about putting the first segment as a start eliminates the degradation and works way better, practically enabling infinite generation from zonos.conditioning import make_cond_dict
from tqdm.auto import tqdm
texts = [
"The old clock tower hadn't chimed in living memory.",
"Its stone face, weathered and stained, watched over the perpetually drowsy town.",
"Elara, however, felt a strange pull towards it.",
"She often sketched its silhouette in her worn notebook.",
"One moonless night, a faint, melodic hum vibrated through the cobblestones beneath her feet.",
"It seemed to emanate from the silent tower.",
"Driven by a curiosity stronger than fear, she crept towards the heavy oak door.",
"Surprisingly, it swung open at her touch, revealing a spiral staircase choked with dust.",
"The air inside was thick with the scent of ozone and something ancient.",
"She ascended, each step echoing in the profound stillness.",
"Higher and higher she climbed, the humming growing louder, resonating within her chest.",
"Finally, she reached the belfry.",
"Instead of bells, intricate crystalline structures pulsed with soft, blue light.",
"They hung suspended, rotating slowly, emitting the enchanting melody.",
"In the center hovered a sphere of swirling energy.",
"As Elara approached, the humming intensified, the light brightening.",
"Tendrils of energy reached out from the sphere, brushing against her fingertips.",
"A flood of images poured into her mind: star charts, forgotten equations, galaxies blooming and dying.",
"She wasn't just in a clock tower; she was inside a celestial resonator.",
"It was a device left by travelers from a distant star, waiting for someone attuned to its frequency.",
"Elara realized the tower hadn't been silent, just waiting.",
"She raised her hands, not in fear, but in acceptance.",
"The energy flowed into her, cool and invigorating.",
"Suddenly, with a resonant *gong*, the tower chimed, a sound unheard for centuries.",
"Its song wasn't marking time, but awakening possibilities across the cosmos."
]
prefixing = True
first_text = ""
first_codes = None
all_segments = []
whitespace = " "
torch.manual_seed(777)
for text in tqdm(texts):
cond_dict = make_cond_dict(
text=first_text + text + whitespace,
language="en-us",
speaker=speaker,
pitch_std=120,
)
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning, first_codes, progress_bar=False)
if prefixing:
if first_codes is None:
first_codes = codes
first_text = text + whitespace
else:
codes = codes[:,:,first_codes.shape[-1]:]
wavs = model.autoencoder.decode(codes).cpu()
all_segments.append(wavs[0])
audio = torch.cat(all_segments, dim=-1)
display(Audio(data=audio, rate=44100)) prefixing = False no-prefix.mp4prefixing = True first-as-prefix.mp4 |
pushed the updated version and also added the longer log fading in the ends of sentences. here's an example of 25 sentence generation: streaming-first-as-prefix.mp4 |
okidoke, one more cool update, I reduced initial response latency to 135ms on RTX3090 here's the thing - the model takes some time/tokens to warm up. We can preallocate open streams and feed some warm-up text to them, so once we have real queries, we'd be ready to process them faster here's what I came up with:
So it took 885ms to warmup, but then from the point I got the first real sentence to the first response chunk of audio it took only 1017 - 885 = 132ms Giving it a warmup of "And I say OK" doesn't change the prosody/style of the next sentence from what I saw. But at the same time it also does a pretty natural full stop pause without actually full stopping for a long time. you can experiment with other warmup prefills |
the warmup is mostly torch compile and building out the cuda graphs / i fyou hop on discord - com is easyer .. [coezbek] (oezi in discord) is in there too |
*we continued the discussion in discord, but just for the record - when I said "initial response latency" I actually meant TTFB |
I made some minor improvements (fix warnings) in my branch https://github.com/coezbek/Zonos/tree/infinite-streaming |
pulled them in, thank you! |
…ensors must match except in dimension 0. Expected size 29 but got size 28 for tensor number 1 in the list.
@gabrielclark3330 thoughts? Can this be merged? |
Closing #187 in favor of this
So basically this is an long version of seamless streaming, examples below:
Imagine we have something like this to say:
A naive approach before was something like this:
The result is quite anticlimactic, something like
https://github.com/user-attachments/assets/b4756b0a-91d1-4ef5-87e3-6cb94b476946
This PR brings:
Which gives:
streaming.mp4
Here are some stats for RTX3090
This PR incorporates:
The main idea is super simple and straightforward: we split the text into sentences and use FULL previous sentence audio codes and text as a prefix for the next one. Thus we know exactly where to cut it and how to stitch it. Also we decode tokens in small chunks as they appear during the inference and apply cosine cross-fade to stitch them together eliminating most audible clicks
Unfortunately it's not infinite, since it accumulates the error of the previous generations. It's still usable on roughly <30-40 sec generations, but now they can be streamed with minimal latency
PS
there's also the ability to load models from filesystem (both zonos weights and embedder)