Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: strip leading/trailing silence from tts output #420

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

ErikBjare
Copy link
Owner

@ErikBjare ErikBjare commented Jan 24, 2025

Kokoro generates ~300ms of silence at the start and end of every generation.

This strips that silence so that there aren't such long breaks between sentences.


Important

Adds strip_silence() to remove silence from TTS output in scripts/tts_server.py.

  • Behavior:
    • Adds strip_silence() function in scripts/tts_server.py to remove leading/trailing silence from audio data.
    • Integrates strip_silence() in text_to_speech() to process generated audio.
  • Functionality:
    • strip_silence() uses amplitude threshold and minimum silence duration to identify and remove silence.
    • Handles cases where no non-silent audio is found by returning original audio data.

This description was created by Ellipsis for d79606a. It will automatically update as commits are pushed.

@ErikBjare ErikBjare changed the title fix: added stripping of leading/trailing silence from tts output fix: strip leading/trailing silence from tts output Jan 24, 2025
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to d79606a in 1 minute and 15 seconds

More details
  • Looked at 61 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. scripts/tts_server.py:145
  • Draft comment:
    The min_silence_duration parameter is incorrectly used. It should not be subtracted from the start or added to the end index, as this could trim non-silent audio. Consider revising the logic to ensure only leading and trailing silence is removed.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment suggests this is incorrect, but I disagree. The function's purpose appears to be to trim silence while preserving some minimal amount of silence/context around the actual speech. Subtracting/adding min_silence_duration achieves this goal by keeping a buffer zone. The max(0, ...) and min(len(audio_data), ...) ensure we don't go out of bounds.
    I could be wrong about the intended behavior - maybe it should only trim exact silence. The parameter name "min_silence_duration" is a bit confusing if it's meant to be a buffer size.
    While the parameter name could be clearer, the implementation is sound and achieves a reasonable goal of preserving some context around the speech. The bounds checking prevents any issues with array access.
    The comment should be deleted as it incorrectly identifies an issue where there isn't one. The code is working as intended to preserve some silence around the speech content.

Workflow ID: wflow_O1ApIeEWPOLtre9u


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.62%. Comparing base (6ff8b94) to head (d79606a).

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #420      +/-   ##
==========================================
- Coverage   70.74%   70.62%   -0.12%     
==========================================
  Files          70       70              
  Lines        5862     5862              
==========================================
- Hits         4147     4140       -7     
- Misses       1715     1722       +7     
Flag Coverage Δ
anthropic/claude-3-haiku-20240307 68.83% <ø> (+0.01%) ⬆️
deepseek/deepseek-chat 64.41% <ø> (+0.10%) ⬆️
openai/gpt-4o-mini 67.80% <ø> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ErikBjare ErikBjare merged commit 96c54c4 into master Jan 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants