Feat: Implement Custom Pause Tags and Automatic Newline Pauses #283

mylukin · 2025-04-07T06:48:59Z

Description:

This PR introduces two significant enhancements to the TTS audio generation process:

Custom Pause Tag Support: Users can now insert pauses of specific durations directly within the input text using the format [pause:Xs], where X is the duration in seconds (e.g., [pause:0.5s], [pause:2s]).
Automatic Newline Pauses: A default pause of 0.5 seconds is automatically inserted after each newline character (\n) encountered in the input text, typically signifying paragraph breaks.

These features provide greater control over the rhythm and pacing of the generated speech.

Key Changes:

api/src/services/text_processing/text_processor.py:
- Modified smart_split to recognize and parse [pause:Xs] tags using regex (PAUSE_TAG_PATTERN).
- Updated smart_split to yield tuples indicating whether a chunk is text (Optional[float] is None) or a pause (Optional[float] contains duration).
- Modified get_sentence_info and smart_split's chunking logic to preserve trailing newline characters (\n) from the original text segments, allowing the service layer to detect them.
- Improved handling of custom phoneme markers ([word](/ipa/)) during normalization and splitting, ensuring they are correctly restored before yielding text chunks.
- Fixed logic errors in smart_split related to handling oversized sentences and clauses, ensuring correct chunking around token limits.
api/src/services/tts_service.py:
- Modified generate_audio_stream to consume the new output format from smart_split.
- Added logic to generate silent audio chunks of the specified duration when a pause chunk is received.
- Added logic to check for trailing newlines (\n) in yielded text chunks and insert an additional 0.5-second silent chunk if found.
- Ensured accurate accumulation of the current_offset for timestamps, accounting for both generated speech and inserted silence.
- Refactored _get_voices_path for more robust parsing of combined voices with weights (e.g., voice1(0.7)+voice2(0.3)) using regex, improved weight normalization handling based on settings, and ensured correct device placement for combined tensors.
- Fixed NameError in generate_audio by removing the incorrect check for output_format.
- Corrected async handling and normalization logic within generate_from_phonemes.
- Improved error logging (using logger.exception) and added resource cleanup (writer.close()) in generate_audio_stream.
- Removed duplicated/incorrect logic blocks.

How to Test:

Generate speech with inputs containing [pause:Xs] tags (e.g., "Hello [pause:1.5s] world."). Verify the pause duration in the output audio.
Generate speech with inputs containing newline characters (e.g., "First paragraph.\n\nSecond paragraph."). Verify the 0.5s pauses between paragraphs.
Test inputs combining both pause tags and newlines.
Test edge cases like pauses at the beginning/end of text or multiple consecutive newlines.
Test with various base and combined voices (including weighted combinations).
Test with inputs containing custom phoneme markers alongside pauses or newlines.

…ocessing. Updated smart_split to preserve newlines and added logic for generating silence chunks during pauses. Improved error handling and logging for audio processing.

…s, newlines, and custom phonemes. Updated smart_split to manage pause tags and improved error logging. Adjusted audio generation logic for better performance and clarity.

…ration logic. Updated smart_split for better newline management and refined error logging for clarity.

Dev 20250407 add runpod

…k handling, and ensure audio consistency. Updated normalization logic for combined audio output and refined error handling for writer closure.

…correct parameter name for sample rate, and conditionally set bit rate for applicable codecs. Improved error handling by using self.format in exceptions.

…header for MP3 encoding. Added explicit flushing of the container to ensure all data is written to the buffer before closing.

…sing the container now handles finalization. Updated logging to reflect changes in packet muxing.

…in Dockerfile to use pyproject-runpod.toml.

tanhv90 · 2025-04-07T11:56:41Z

@mylukin Could you please help check case calling API dev/captioned_speech with stream = false?

2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | ERROR    | audio:198 | Error converting audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | WARNING  | development:360 | Invalid request: Failed to convert audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | INFO:     172.24.0.1:46290 - "POST /dev/captioned_speech HTTP/1.1" 400 Bad Request

Sample request body:

{
        "model": "kokoro",
        "input": "Hello [pause:5s] World",
        "voice": "af_alloy",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": false
}

mylukin · 2025-04-07T12:10:21Z

@mylukin Could you please help check case calling API dev/captioned_speech with stream = false?

2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | ERROR    | audio:198 | Error converting audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | WARNING  | development:360 | Invalid request: Failed to convert audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | INFO:     172.24.0.1:46290 - "POST /dev/captioned_speech HTTP/1.1" 400 Bad Request

Sample request body:

{
        "model": "kokoro",
        "input": "Hello [pause:5s] World",
        "voice": "af_alloy",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": false
}

This fork solves this problem: https://github.com/EasyMetaAu/Kokoro-FastAPI

fireblade2534 · 2025-04-07T13:45:35Z

@mylukin Please fix the test issues

RBEmerson970 · 2025-04-07T13:55:13Z

@fireb;lade2534 - Is there a way to contact you off-list?

fireblade2534 · 2025-04-07T14:04:01Z

off list uhh I'm on the official kokoro discord under the same name

RBEmerson970 · 2025-04-07T14:08:59Z

off list uhh I'm on the official kokoro discord under the same name

Gotit

mylukin · 2025-04-07T15:27:00Z

All of this code is generated by Gemini 2.5 Pro 😂

fireblade2534 · 2025-04-07T15:33:37Z

I suspected as much based on your description

fireblade2534 · 2025-04-07T15:37:30Z

Its worse now @mylukin

fireblade2534 · 2025-04-07T16:21:11Z

@mylukin Still doesn't work

fireblade2534 · 2025-04-07T16:22:40Z

Please read CONTRIBUTING.md

fireblade2534 · 2025-04-07T16:43:38Z

@mylukin The tests still fail. Again please read CONTRIBUTING.md

…sertions in test_get_sentence_info_phenomoes to verify placeholder presence and token counts. Modified smart_split tests to unpack additional values and ensure proper handling of text and tokens. Improved clarity in test assertions for punctuation preservation.

mylukin · 2025-04-07T16:50:42Z

@mylukin The tests still fail. Again please read CONTRIBUTING.md

Finally fixed it, but still can't rely entirely on AI, I have to write the code myself! 😂

fireblade2534 · 2025-04-07T17:21:38Z

Ok so um it doesn't work now I tried "Hello [pause:5s] world." with this code:

import base64
import json

import requests

text = """Hello [pause:5s] world."""


Type = "wav"

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",
        "input": text,
        "voice": "af_heart+af_sky",
        "speed": 1.0,
        "response_format": Type,
        "stream": False,
    },
    stream=True,
)

with open(f"output.{Type}", "wb") as f:
    f.write(response.content)

It pronouces pause 5s

…ing. Updated filename regex to allow additional characters, enhanced silence chunk creation for AudioService, and ensured final audio output is consistently in int16 format. Removed premature writer closure in the finalization process, delegating responsibility to the caller.

mylukin · 2025-04-08T01:41:12Z

Ok so um it doesn't work now I tried "Hello [pause:5s] world." with this code:

import base64
import json

import requests

text = """Hello [pause:5s] world."""


Type = "wav"

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",
        "input": text,
        "voice": "af_heart+af_sky",
        "speed": 1.0,
        "response_format": Type,
        "stream": False,
    },
    stream=True,
)

with open(f"output.{Type}", "wb") as f:
    f.write(response.content)

It pronouces pause 5s

I fixed this.

curl -X 'POST' \
  'http://127.0.0.1:8880/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "kokoro",
  "input": "Hello [pause:5s] world.",
  "voice": "af_heart+af_sky",
  "response_format": "wav",
  "speed": 1,
  "stream": false
}'
Warning: Binary output can mess up your terminal. Use "--output -" to tell curl to output it to your terminal anyway, or consider "--output <FILE>" to save to a file.

…s and mock behaviors. Enhanced test coverage for sentence processing and voice path retrieval, ensuring proper handling of edge cases and expected outputs.

…encoding comments and update logging for Xing VBR header. Enhance test assertions for MP3 header validation to include common MPEG frame sync pattern.

… header and conditionally set bit rate for applicable formats. Improved error handling by using self.format in exceptions.

…audio generation process. Integrate voice processing and normalization options, enhance error handling, and improve logging for better traceability. Update parameter validation and ensure proper handling of audio streaming with the new StreamingAudioWriter.

api/src/services/tts_service.py

api/src/services/text_processing/text_processor.py

api/src/services/tts_service.py

…end warmup and improving error handling. Introduce checks for backend readiness post-initialization and refine logging for better traceability during audio generation.

…ety. Removed unnecessary comments, adjusted text processing for legacy backends, and enhanced error handling during audio stream generation. Updated filename regex to restrict allowed characters for safer filenames.

…dling. Updated voice parsing to support combined voices with weights, enhanced normalization handling, and streamlined audio generation process. Improved logging for better debugging and removed unnecessary comments for clarity.

…ove logging. Removed unnecessary comments and streamlined voice path handling for clarity.

…d phoneme restoration. Enhanced regex for sentence splitting, added detailed docstring for clarity, and improved handling of trailing newlines and whitespace-only sentences. Updated tokenization logic to ensure robust error handling during processing.

…and normalization. Improved logging for clarity and error handling, ensuring compatibility with both ID and original tag formats. Streamlined text processing logic for better performance and maintainability.

…nsure proper data transmission. This change updates the audio_data field to yield encoded audio bytes instead of raw output.

…udio generation paths. Enhance error handling and logging during TTS service initialization and audio processing. Introduce normalization options and ensure proper handling of audio data encoding for both modes.

Copilot

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Files not reviewed (1)

run-tests.sh: Language not supported

Comments suppressed due to low confidence (2)

api/tests/test_text_processor.py:53

The function name 'test_get_sentence_info_phenomoes' contains a spelling error. Consider renaming it to 'test_get_sentence_info_phonemes' for clarity.

def test_get_sentence_info_phenomoes():

api/src/services/text_processing/text_processor.py:93

The parameter name 'custom_phenomes_list' appears to be misspelled. Consider renaming it to 'custom_phonemes_list' to maintain consistency and clarity.

def get_sentence_info(text: str, custom_phenomes_list: Dict[str, str]) -> List[Tuple[str, List[int], int]]:

api/src/services/text_processing/text_processor.py

api/src/services/tts_service.py

…and maintain user permissions for UID 1001.

Introduced a new configuration option to enable custom phoneme IDs in the Settings class. Updated the TTS service to include a TODO comment regarding potential future restoration of custom phonemes. Enhanced logging to ensure clarity in audio chunk generation warnings. Adjusted smart_split function to allow independent control of ID replacement based on the new configuration.

…emes and normalization. Enhanced logging for better clarity during tokenization and sentence processing. Updated `get_sentence_info` and `smart_split` to ensure compatibility with both custom phoneme IDs and original tags, streamlining text processing logic for improved performance.

mylukin added 5 commits April 7, 2025 13:21

Enhance TTS service to handle pauses and trailing newlines in text pr…

b31f79d

…ocessing. Updated smart_split to preserve newlines and added logic for generating silence chunks during pauses. Improved error handling and logging for audio processing.

Refactor TTS service and text processing to enhance handling of pause…

c0da571

…s, newlines, and custom phonemes. Updated smart_split to manage pause tags and improved error logging. Adjusted audio generation logic for better performance and clarity.

add file for runpod

401315e

Refactor TTS service to improve pause handling and enhance audio gene…

aaf22a7

…ration logic. Updated smart_split for better newline management and refined error logging for clarity.

Merge pull request #1 from mylukin/dev_20250407_add_runpod

869758a

Dev 20250407 add runpod

mylukin force-pushed the dev_20250407_add_pause branch 2 times, most recently from b56d85c to c0da571 Compare April 7, 2025 08:36

mylukin added 5 commits April 7, 2025 16:54

Refactor TTS service to enhance filename safety, improve silence chun…

2b2e24b

…k handling, and ensure audio consistency. Updated normalization logic for combined audio output and refined error handling for writer closure.

Enhance StreamingAudioWriter to disable ID3v2 tags for MP3 encoding, …

903f03d

…correct parameter name for sample rate, and conditionally set bit rate for applicable codecs. Improved error handling by using self.format in exceptions.

Enhance StreamingAudioWriter to disable both ID3v2 tags and Xing VBR …

54abafe

…header for MP3 encoding. Added explicit flushing of the container to ensure all data is written to the buffer before closing.

Refactor StreamingAudioWriter to remove explicit flush method, as clo…

6558647

…sing the container now handles finalization. Updated logging to reflect changes in packet muxing.

Update runpod dependency to version 1.7.8 and correct file reference …

41cf641

…in Dockerfile to use pyproject-runpod.toml.

Add test_input.json to Dockerfile for runpod setup

4bc9057

mylukin force-pushed the dev_20250407_add_pause branch from 2bd4265 to c0da571 Compare April 7, 2025 16:29

mylukin added 5 commits April 8, 2025 10:08

Refactor tests in text_processor and tts_service to improve assertion…

f4d5bec

…s and mock behaviors. Enhanced test coverage for sentence processing and voice path retrieval, ensuring proper handling of edge cases and expected outputs.

Refactor StreamingAudioWriter to remove ID3v2 tag disabling from MP3 …

afba4fa

…encoding comments and update logging for Xing VBR header. Enhance test assertions for MP3 header validation to include common MPEG frame sync pattern.

Enhance StreamingAudioWriter to support MP3 encoding without Xing VBR…

88b9349

… header and conditionally set bit rate for applicable formats. Improved error handling by using self.format in exceptions.

Add stream property to test_input.json for audio configuration

e262de3

fireblade2534 requested changes Apr 8, 2025

View reviewed changes

mylukin added 6 commits April 8, 2025 10:51

Enhance TTS service initialization in RunPod handler by ensuring back…

667c9c7

…end warmup and improving error handling. Introduce checks for backend readiness post-initialization and refine logging for better traceability during audio generation.

Refactor TTS service to simplify language code determination and impr…

0d1dd66

…ove logging. Removed unnecessary comments and streamlined voice path handling for clarity.

Refactor smart_split function to enhance handling of custom phonemes …

3e23fb0

…and normalization. Improved logging for clarity and error handling, ensuring compatibility with both ID and original tag formats. Streamlined text processing logic for better performance and maintainability.

mylukin requested a review from fireblade2534 April 8, 2025 04:03

mylukin added 2 commits April 8, 2025 13:32

Implement Base64 encoding for audio chunks in the RunPod handler to e…

e7de73d

…nsure proper data transmission. This change updates the audio_data field to yield encoded audio bytes instead of raw output.

remsky requested a review from Copilot April 8, 2025 14:14

Copilot AI reviewed Apr 8, 2025

View reviewed changes

fireblade2534 reviewed Apr 8, 2025

View reviewed changes

api/src/services/text_processing/text_processor.py Outdated Show resolved Hide resolved

api/src/services/tts_service.py Show resolved Hide resolved

mylukin added 2 commits April 11, 2025 22:55

Update docker-compose.yml to ensure container restarts automatically …

69b9bc7

…and maintain user permissions for UID 1001.

mylukin requested a review from fireblade2534 April 30, 2025 04:27

mylukin added 2 commits May 3, 2025 20:05

Merge branch 'master' into dev_20250407_add_pause

8ee7fcd

mylukin closed this May 30, 2025

Uh oh!

Feat: Implement Custom Pause Tags and Automatic Newline Pauses #283

Feat: Implement Custom Pause Tags and Automatic Newline Pauses #283

Uh oh!

Conversation

mylukin commented Apr 7, 2025

Uh oh!

tanhv90 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mylukin commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

RBEmerson970 commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

RBEmerson970 commented Apr 7, 2025

Uh oh!

mylukin commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

mylukin commented Apr 7, 2025

Uh oh!

fireblade2534 commented Apr 7, 2025

Uh oh!

mylukin commented Apr 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tanhv90 commented Apr 7, 2025 •

edited

Loading