-
Notifications
You must be signed in to change notification settings - Fork 18
Kyutai #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Kyutai #85
Conversation
…k + FINAL_TRANSCRIPT_DELAY (s) an new utterance is started
|
kyutai/README.md
Outdated
To build and run the **CUDA** version, first set the `MOSHI_SERVER` environment variable, then run the compose command: | ||
```bash | ||
export MOSHI_SERVER=moshi-server-cuda | ||
docker compose --profile cuda up --build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy to run, but it needs cuda 12.9 (maybe add a note).
kyutai/stt/processing/utils.py
Outdated
sr = wav.rate | ||
if sr != SAMPLE_RATE: | ||
gcd = np.gcd(sr, SAMPLE_RATE) | ||
data = resample_poly(data, SAMPLE_RATE // gcd, sr // gcd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in Moshi paper, they mention they use AudioSR for upsampling their training data from 8kHz to 24kHz.
https://pypi.org/project/audiosr/
It seems to be overkill here (it's a solution based on neural networks), but maybe there is a better thing than polynomial regression for upsampling.
Also this resampling method is copy-pasted at several places. That should be factorized by defining a helper function here (in utils.py), to conform every audio signal when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by 12363ab
…backend with 2 seconds of blank audio while discarding 1st second of incoming audio stream (well, fixes nothing...)
kyutai/stt/processing/streaming.py
Outdated
# Warm-up phase: send a single silent packet to start the stream, | ||
# then wait for 2 seconds while discarding any incoming audio. | ||
await ws_server.send( | ||
msgpack.packb({"type": "Audio", "pcm": [0.0] * SAMPLE_RATE}, use_single_float=True) | ||
msgpack.packb({"type": "Audio", "pcm": [0.0]}, use_single_float=True) | ||
) | ||
warmup_end_time = asyncio.get_event_loop().time() + 2.0 | ||
while asyncio.get_event_loop().time() < warmup_end_time: | ||
try: | ||
# Discard incoming audio until warmup complete. | ||
_ = await asyncio.wait_for(ws_client.recv(), timeout=0.05) | ||
except asyncio.TimeoutError: | ||
# No message received, just wait | ||
await asyncio.sleep(0.05) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jeronymous
I’ve noticed that the moshi‑server occasionally returns partially broken transcriptions, and I haven’t figured out the trigger yet. When it happens:
The overall accuracy drops, and all punctuation disappears, which breaks the “final transcript” utterance segmentation.
After the suggested 1 second blank audio required (moshi docs) to warmup the model, i tried skipping the first second of incoming audio (adding a longer initial delay), but I’m not sure it makes any difference.
Do you have any thoughts or ideas to diagnose or mitigate this?
To investigate : Unable to figure out how to properly send finals / partials transcripts. Current behavior drifts from expectations (No finals). Tryed to make a timeout and detect utterance end using punctuation but seems wrong.