Kyutai #85

damienlaine · 2025-07-07T00:17:00Z

Recipe to run a Kyutai STT + Docker
Protocol Wrapper LinTO <-> Kyutai's Moshi. Recipe to run & Dockerfile
Basic in browser's mic testing webpages.

To investigate : Unable to figure out how to properly send finals / partials transcripts. Current behavior drifts from expectations (No finals). Tryed to make a timeout and detect utterance end using punctuation but seems wrong.

…k + FINAL_TRANSCRIPT_DELAY (s) an new utterance is started

damienlaine · 2025-07-08T21:39:58Z

Implemented jenkins build files, semver for the LinTO <-> Moshi Wrapper
implemented semantic utterance stop to mimic LInTO's ASR behavior (partial / final transcripts)
@Houpert : can you please verify Jenkins CI.
Once done, we wait for @Jeronymous go to merge into next and deploy on LinTO Preprod.

AudranBert · 2025-07-10T14:07:40Z

kyutai/README.md

+To build and run the **CUDA** version, first set the `MOSHI_SERVER` environment variable, then run the compose command:
+```bash
+export MOSHI_SERVER=moshi-server-cuda
+docker compose --profile cuda up --build


Easy to run, but it needs cuda 12.9 (maybe add a note).

kyutai/README.md

Jeronymous · 2025-07-11T12:25:20Z

kyutai/stt/processing/utils.py

+    sr = wav.rate
+    if sr != SAMPLE_RATE:
+        gcd = np.gcd(sr, SAMPLE_RATE)
+        data = resample_poly(data, SAMPLE_RATE // gcd, sr // gcd)


Note that in Moshi paper, they mention they use AudioSR for upsampling their training data from 8kHz to 24kHz.
https://pypi.org/project/audiosr/
It seems to be overkill here (it's a solution based on neural networks), but maybe there is a better thing than polynomial regression for upsampling.

Also this resampling method is copy-pasted at several places. That should be factorized by defining a helper function here (in utils.py), to conform every audio signal when needed.

Fixed by 12363ab

…backend with 2 seconds of blank audio while discarding 1st second of incoming audio stream (well, fixes nothing...)

damienlaine · 2025-07-14T20:29:24Z

kyutai/stt/processing/streaming.py

+    # Warm-up phase: send a single silent packet to start the stream,
+    # then wait for 2 seconds while discarding any incoming audio.
    await ws_server.send(
-        msgpack.packb({"type": "Audio", "pcm": [0.0] * SAMPLE_RATE}, use_single_float=True)
+        msgpack.packb({"type": "Audio", "pcm": [0.0]}, use_single_float=True)
    )
+    warmup_end_time = asyncio.get_event_loop().time() + 2.0
+    while asyncio.get_event_loop().time() < warmup_end_time:
+        try:
+            # Discard incoming audio until warmup complete.
+            _ = await asyncio.wait_for(ws_client.recv(), timeout=0.05)
+        except asyncio.TimeoutError:
+            # No message received, just wait
+            await asyncio.sleep(0.05)
+


@Jeronymous
I’ve noticed that the moshi‑server occasionally returns partially broken transcriptions, and I haven’t figured out the trigger yet. When it happens:

The overall accuracy drops, and all punctuation disappears, which breaks the “final transcript” utterance segmentation.

After the suggested 1 second blank audio required (moshi docs) to warmup the model, i tried skipping the first second of incoming audio (adding a longer initial delay), but I’m not sure it makes any difference.

Do you have any thoughts or ideas to diagnose or mitigate this?

damienlaine added 3 commits July 6, 2025 02:33

kyutai stt dockerfile & some info

5a49475

Basic webclient tooling for testing purpose with browser's mic

bfd7573

Lightweight LinTO Wrapper arround Kyutai's Moshi Rust server

7ac87e2

damienlaine requested review from tjiho, Jeronymous and AudranBert July 7, 2025 00:17

damienlaine changed the base branch from master to next July 7, 2025 00:17

damienlaine added 5 commits July 7, 2025 02:22

Fix recipe for local moshi-server

7ea6c20

Update README.md

0ef2ebe

Add some logs mechanisms

977e374

Semantic finals. Whenever no text is received after a punctuation mar…

6ecaa57

…k + FINAL_TRANSCRIPT_DELAY (s) an new utterance is started

Jenkins CI

d70b6eb

damienlaine requested a review from Houpert July 8, 2025 21:34

Add convinient one liner to test this

e29efcd

damienlaine force-pushed the kyutai branch from 66e27c3 to e29efcd Compare July 8, 2025 22:47

AudranBert approved these changes Jul 10, 2025

View reviewed changes

Update jenkinsfile for automatic build

f7bc54c

Jeronymous reviewed Jul 11, 2025

View reviewed changes

damienlaine added 3 commits July 14, 2025 22:15

for file upload, renamed client to avoid confusion

a59b854

moved resample to utils. Now LinTO adapter tries to initialize moshi …

12363ab

…backend with 2 seconds of blank audio while discarding 1st second of incoming audio stream (well, fixes nothing...)

Clarify Readme

1ffcaa4

damienlaine commented Jul 14, 2025

View reviewed changes

damienlaine mentioned this pull request Jul 14, 2025

@Jeronymous #89

Open

Added git submodule for kyutai building

773eed4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kyutai #85

Kyutai #85

Uh oh!

damienlaine commented Jul 7, 2025

Uh oh!

damienlaine commented Jul 8, 2025

Uh oh!

AudranBert Jul 10, 2025

Uh oh!

Uh oh!

Jeronymous Jul 11, 2025

Uh oh!

damienlaine Jul 14, 2025

Uh oh!

damienlaine Jul 14, 2025

Uh oh!

Uh oh!

Kyutai #85

Are you sure you want to change the base?

Kyutai #85

Uh oh!

Conversation

damienlaine commented Jul 7, 2025

Uh oh!

damienlaine commented Jul 8, 2025

Uh oh!

AudranBert Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jeronymous Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

damienlaine Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

damienlaine Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!