Skip to content

Kyutai #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: next
Choose a base branch
from
Open

Kyutai #85

wants to merge 14 commits into from

Conversation

damienlaine
Copy link
Member

  • Recipe to run a Kyutai STT + Docker
  • Protocol Wrapper LinTO <-> Kyutai's Moshi. Recipe to run & Dockerfile
  • Basic in browser's mic testing webpages.

To investigate : Unable to figure out how to properly send finals / partials transcripts. Current behavior drifts from expectations (No finals). Tryed to make a timeout and detect utterance end using punctuation but seems wrong.

@damienlaine damienlaine changed the base branch from master to next July 7, 2025 00:17
@damienlaine damienlaine requested a review from Houpert July 8, 2025 21:34
@damienlaine
Copy link
Member Author

  • Implemented jenkins build files, semver for the LinTO <-> Moshi Wrapper
  • implemented semantic utterance stop to mimic LInTO's ASR behavior (partial / final transcripts)
    @Houpert : can you please verify Jenkins CI.
    Once done, we wait for @Jeronymous go to merge into next and deploy on LinTO Preprod.

kyutai/README.md Outdated
To build and run the **CUDA** version, first set the `MOSHI_SERVER` environment variable, then run the compose command:
```bash
export MOSHI_SERVER=moshi-server-cuda
docker compose --profile cuda up --build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy to run, but it needs cuda 12.9 (maybe add a note).

sr = wav.rate
if sr != SAMPLE_RATE:
gcd = np.gcd(sr, SAMPLE_RATE)
data = resample_poly(data, SAMPLE_RATE // gcd, sr // gcd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in Moshi paper, they mention they use AudioSR for upsampling their training data from 8kHz to 24kHz.
https://pypi.org/project/audiosr/
It seems to be overkill here (it's a solution based on neural networks), but maybe there is a better thing than polynomial regression for upsampling.

Also this resampling method is copy-pasted at several places. That should be factorized by defining a helper function here (in utils.py), to conform every audio signal when needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 12363ab

…backend with 2 seconds of blank audio while discarding 1st second of incoming audio stream (well, fixes nothing...)
Comment on lines 34 to 47
# Warm-up phase: send a single silent packet to start the stream,
# then wait for 2 seconds while discarding any incoming audio.
await ws_server.send(
msgpack.packb({"type": "Audio", "pcm": [0.0] * SAMPLE_RATE}, use_single_float=True)
msgpack.packb({"type": "Audio", "pcm": [0.0]}, use_single_float=True)
)
warmup_end_time = asyncio.get_event_loop().time() + 2.0
while asyncio.get_event_loop().time() < warmup_end_time:
try:
# Discard incoming audio until warmup complete.
_ = await asyncio.wait_for(ws_client.recv(), timeout=0.05)
except asyncio.TimeoutError:
# No message received, just wait
await asyncio.sleep(0.05)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeronymous
I’ve noticed that the moshi‑server occasionally returns partially broken transcriptions, and I haven’t figured out the trigger yet. When it happens:

The overall accuracy drops, and all punctuation disappears, which breaks the “final transcript” utterance segmentation.

After the suggested 1 second blank audio required (moshi docs) to warmup the model, i tried skipping the first second of incoming audio (adding a longer initial delay), but I’m not sure it makes any difference.

Do you have any thoughts or ideas to diagnose or mitigate this?

@damienlaine damienlaine mentioned this pull request Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants