Skip to content

Feat: Implement Custom Pause Tags and Automatic Newline Pauses #283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from

Conversation

mylukin
Copy link
Contributor

@mylukin mylukin commented Apr 7, 2025

Description:

This PR introduces two significant enhancements to the TTS audio generation process:

  1. Custom Pause Tag Support: Users can now insert pauses of specific durations directly within the input text using the format [pause:Xs], where X is the duration in seconds (e.g., [pause:0.5s], [pause:2s]).
  2. Automatic Newline Pauses: A default pause of 0.5 seconds is automatically inserted after each newline character (\n) encountered in the input text, typically signifying paragraph breaks.

These features provide greater control over the rhythm and pacing of the generated speech.

Key Changes:

  • api/src/services/text_processing/text_processor.py:
    • Modified smart_split to recognize and parse [pause:Xs] tags using regex (PAUSE_TAG_PATTERN).
    • Updated smart_split to yield tuples indicating whether a chunk is text (Optional[float] is None) or a pause (Optional[float] contains duration).
    • Modified get_sentence_info and smart_split's chunking logic to preserve trailing newline characters (\n) from the original text segments, allowing the service layer to detect them.
    • Improved handling of custom phoneme markers ([word](/ipa/)) during normalization and splitting, ensuring they are correctly restored before yielding text chunks.
    • Fixed logic errors in smart_split related to handling oversized sentences and clauses, ensuring correct chunking around token limits.
  • api/src/services/tts_service.py:
    • Modified generate_audio_stream to consume the new output format from smart_split.
    • Added logic to generate silent audio chunks of the specified duration when a pause chunk is received.
    • Added logic to check for trailing newlines (\n) in yielded text chunks and insert an additional 0.5-second silent chunk if found.
    • Ensured accurate accumulation of the current_offset for timestamps, accounting for both generated speech and inserted silence.
    • Refactored _get_voices_path for more robust parsing of combined voices with weights (e.g., voice1(0.7)+voice2(0.3)) using regex, improved weight normalization handling based on settings, and ensured correct device placement for combined tensors.
    • Fixed NameError in generate_audio by removing the incorrect check for output_format.
    • Corrected async handling and normalization logic within generate_from_phonemes.
    • Improved error logging (using logger.exception) and added resource cleanup (writer.close()) in generate_audio_stream.
    • Removed duplicated/incorrect logic blocks.

How to Test:

  1. Generate speech with inputs containing [pause:Xs] tags (e.g., "Hello [pause:1.5s] world."). Verify the pause duration in the output audio.
  2. Generate speech with inputs containing newline characters (e.g., "First paragraph.\n\nSecond paragraph."). Verify the 0.5s pauses between paragraphs.
  3. Test inputs combining both pause tags and newlines.
  4. Test edge cases like pauses at the beginning/end of text or multiple consecutive newlines.
  5. Test with various base and combined voices (including weighted combinations).
  6. Test with inputs containing custom phoneme markers alongside pauses or newlines.

mylukin added 5 commits April 7, 2025 13:21
…ocessing. Updated smart_split to preserve newlines and added logic for generating silence chunks during pauses. Improved error handling and logging for audio processing.
…s, newlines, and custom phonemes. Updated smart_split to manage pause tags and improved error logging. Adjusted audio generation logic for better performance and clarity.
…ration logic. Updated smart_split for better newline management and refined error logging for clarity.
@mylukin mylukin force-pushed the dev_20250407_add_pause branch 2 times, most recently from b56d85c to c0da571 Compare April 7, 2025 08:36
mylukin added 5 commits April 7, 2025 16:54
…k handling, and ensure audio consistency. Updated normalization logic for combined audio output and refined error handling for writer closure.
…correct parameter name for sample rate, and conditionally set bit rate for applicable codecs. Improved error handling by using self.format in exceptions.
…header for MP3 encoding. Added explicit flushing of the container to ensure all data is written to the buffer before closing.
…sing the container now handles finalization. Updated logging to reflect changes in packet muxing.
@tanhv90
Copy link

tanhv90 commented Apr 7, 2025

@mylukin Could you please help check case calling API dev/captioned_speech with stream = false?

2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | ERROR    | audio:198 | Error converting audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | WARNING  | development:360 | Invalid request: Failed to convert audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | INFO:     172.24.0.1:46290 - "POST /dev/captioned_speech HTTP/1.1" 400 Bad Request

Sample request body:

{
        "model": "kokoro",
        "input": "Hello [pause:5s] World",
        "voice": "af_alloy",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": false
}

@mylukin
Copy link
Contributor Author

mylukin commented Apr 7, 2025

@mylukin Could you please help check case calling API dev/captioned_speech with stream = false?

2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | ERROR    | audio:198 | Error converting audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | 11:54:14 AM | WARNING  | development:360 | Invalid request: Failed to convert audio stream to mp3: I/O operation on closed file.
2025-04-07 18:54:14 kokoro-tts-1  | INFO:     172.24.0.1:46290 - "POST /dev/captioned_speech HTTP/1.1" 400 Bad Request

Sample request body:

{
        "model": "kokoro",
        "input": "Hello [pause:5s] World",
        "voice": "af_alloy",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": false
}

This fork solves this problem: https://github.com/EasyMetaAu/Kokoro-FastAPI

@fireblade2534
Copy link
Collaborator

@mylukin Please fix the test issues

@RBEmerson970
Copy link

@fireb;lade2534 - Is there a way to contact you off-list?

@fireblade2534
Copy link
Collaborator

off list uhh I'm on the official kokoro discord under the same name

@RBEmerson970
Copy link

off list uhh I'm on the official kokoro discord under the same name

Gotit

@mylukin
Copy link
Contributor Author

mylukin commented Apr 7, 2025

All of this code is generated by Gemini 2.5 Pro 😂

@fireblade2534
Copy link
Collaborator

I suspected as much based on your description

@fireblade2534
Copy link
Collaborator

Its worse now @mylukin

@fireblade2534
Copy link
Collaborator

@mylukin Still doesn't work

@fireblade2534
Copy link
Collaborator

Please read CONTRIBUTING.md

@mylukin mylukin force-pushed the dev_20250407_add_pause branch from 2bd4265 to c0da571 Compare April 7, 2025 16:29
@fireblade2534
Copy link
Collaborator

@mylukin The tests still fail. Again please read CONTRIBUTING.md

…sertions in test_get_sentence_info_phenomoes to verify placeholder presence and token counts. Modified smart_split tests to unpack additional values and ensure proper handling of text and tokens. Improved clarity in test assertions for punctuation preservation.
@mylukin
Copy link
Contributor Author

mylukin commented Apr 7, 2025

@mylukin The tests still fail. Again please read CONTRIBUTING.md

Finally fixed it, but still can't rely entirely on AI, I have to write the code myself! 😂

@fireblade2534
Copy link
Collaborator

Ok so um it doesn't work now I tried "Hello [pause:5s] world." with this code:

import base64
import json

import requests

text = """Hello [pause:5s] world."""


Type = "wav"

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",
        "input": text,
        "voice": "af_heart+af_sky",
        "speed": 1.0,
        "response_format": Type,
        "stream": False,
    },
    stream=True,
)

with open(f"output.{Type}", "wb") as f:
    f.write(response.content)

It pronouces pause 5s

…ing. Updated filename regex to allow additional characters, enhanced silence chunk creation for AudioService, and ensured final audio output is consistently in int16 format. Removed premature writer closure in the finalization process, delegating responsibility to the caller.
@mylukin
Copy link
Contributor Author

mylukin commented Apr 8, 2025

Ok so um it doesn't work now I tried "Hello [pause:5s] world." with this code:

import base64
import json

import requests

text = """Hello [pause:5s] world."""


Type = "wav"

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",
        "input": text,
        "voice": "af_heart+af_sky",
        "speed": 1.0,
        "response_format": Type,
        "stream": False,
    },
    stream=True,
)

with open(f"output.{Type}", "wb") as f:
    f.write(response.content)

It pronouces pause 5s

I fixed this.

curl -X 'POST' \
  'http://127.0.0.1:8880/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "kokoro",
  "input": "Hello [pause:5s] world.",
  "voice": "af_heart+af_sky",
  "response_format": "wav",
  "speed": 1,
  "stream": false
}'
Warning: Binary output can mess up your terminal. Use "--output -" to tell curl to output it to your terminal anyway, or consider "--output <FILE>" to save to a file.

mylukin added 5 commits April 8, 2025 10:08
…s and mock behaviors. Enhanced test coverage for sentence processing and voice path retrieval, ensuring proper handling of edge cases and expected outputs.
…encoding comments and update logging for Xing VBR header. Enhance test assertions for MP3 header validation to include common MPEG frame sync pattern.
… header and conditionally set bit rate for applicable formats. Improved error handling by using self.format in exceptions.
…audio generation process. Integrate voice processing and normalization options, enhance error handling, and improve logging for better traceability. Update parameter validation and ensure proper handling of audio streaming with the new StreamingAudioWriter.
mylukin added 6 commits April 8, 2025 10:51
…end warmup and improving error handling. Introduce checks for backend readiness post-initialization and refine logging for better traceability during audio generation.
…ety. Removed unnecessary comments, adjusted text processing for legacy backends, and enhanced error handling during audio stream generation. Updated filename regex to restrict allowed characters for safer filenames.
…dling. Updated voice parsing to support combined voices with weights, enhanced normalization handling, and streamlined audio generation process. Improved logging for better debugging and removed unnecessary comments for clarity.
…ove logging. Removed unnecessary comments and streamlined voice path handling for clarity.
…d phoneme restoration. Enhanced regex for sentence splitting, added detailed docstring for clarity, and improved handling of trailing newlines and whitespace-only sentences. Updated tokenization logic to ensure robust error handling during processing.
…and normalization. Improved logging for clarity and error handling, ensuring compatibility with both ID and original tag formats. Streamlined text processing logic for better performance and maintainability.
@mylukin mylukin requested a review from fireblade2534 April 8, 2025 04:03
mylukin added 2 commits April 8, 2025 13:32
…nsure proper data transmission. This change updates the audio_data field to yield encoded audio bytes instead of raw output.
…udio generation paths. Enhance error handling and logging during TTS service initialization and audio processing. Introduce normalization options and ensure proper handling of audio data encoding for both modes.
@remsky remsky requested a review from Copilot April 8, 2025 14:14
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Files not reviewed (1)
  • run-tests.sh: Language not supported
Comments suppressed due to low confidence (2)

api/tests/test_text_processor.py:53

  • The function name 'test_get_sentence_info_phenomoes' contains a spelling error. Consider renaming it to 'test_get_sentence_info_phonemes' for clarity.
def test_get_sentence_info_phenomoes():

api/src/services/text_processing/text_processor.py:93

  • The parameter name 'custom_phenomes_list' appears to be misspelled. Consider renaming it to 'custom_phonemes_list' to maintain consistency and clarity.
def get_sentence_info(text: str, custom_phenomes_list: Dict[str, str]) -> List[Tuple[str, List[int], int]]:

mylukin added 2 commits April 11, 2025 22:55
Introduced a new configuration option to enable custom phoneme IDs in the Settings class. Updated the TTS service to include a TODO comment regarding potential future restoration of custom phonemes. Enhanced logging to ensure clarity in audio chunk generation warnings. Adjusted smart_split function to allow independent control of ID replacement based on the new configuration.
@mylukin mylukin requested a review from fireblade2534 April 30, 2025 04:27
mylukin added 2 commits May 3, 2025 20:05
…emes and normalization. Enhanced logging for better clarity during tokenization and sentence processing. Updated `get_sentence_info` and `smart_split` to ensure compatibility with both custom phoneme IDs and original tags, streamlining text processing logic for improved performance.
@mylukin mylukin closed this May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants