[Feature request] upgrade TTS Python packahe #4169

hros · 2025-03-13T01:56:22Z

🚀 Feature Description
The TTS package does not install when Python version is greater than 3.11
This is problematic considering that the current version is 3.13, and 3.14 is around the corner

Solution
Support Python version 3.13 (at least)

Alternative Solutions
If upgrading is not hassle-free, fork the project to have a version that is comoaè

Additional context

eginhard · 2025-03-13T13:04:34Z

We do maintain a fork at https://github.com/idiap/coqui-ai-TTS (available via pip install coqui-tts).

Python 3.12 is already supported. For 3.13 we have to wait for all dependencies to update, but you can subscribe to idiap#108 for updates.

hros · 2025-03-14T10:28:50Z

Thanks @eginhard
I switched to Python 3.12, waiting for support for version 3.13

I'm searching the docs and did not find the Python API documentation for text-to-speech
Where can I find the full docs? Specifically, API functions related to voice, and model selection and voice synthesis

eginhard · 2025-03-14T10:56:43Z

https://coqui-tts.readthedocs.io/en/latest/inference.html

hros · 2025-03-14T12:01:53Z

Thanks @eginhard .I saw that page with the helpful examples, but I'm looking for API documentation.
For example, the function TTS.tts takes the following keyword arguments:

speaker, language, speaker_wav, emotion, split_sentences

Where can I find out what the emotion and split_sentences arguments do?

I am looking for this kind of detail for all relevant functions.

By the way, the tts function produces a wav file. Can I produce mp3 files? Or better yet, in-memory objects with mp3 audio data?

eginhard · 2025-03-16T17:11:34Z

You can use Python's built-in help command in the REPL to display the docstring of any Python function:

>> from TTS.api import TTS
>> xtts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
>> help(xtts.tts)
Help on method tts in module TTS.api:

tts(text: str, speaker: str | None = None, language: str | None = None, speaker_wav: str | None = None, emotion: str | None = None, split_sentences: bool = True, **kwargs) method of TTS.api.TTS instance
    Convert text to speech.
    
    Args:
        text (str):
            Input text to synthesize.
        speaker (str, optional):
            Speaker name for multi-speaker. You can check whether loaded model is multi-speaker by
            `tts.is_multi_speaker` and list speakers by `tts.speakers`. Defaults to None.
        language (str): Language of the text. If None, the default language of the speaker is used. Language is only
            supported by `XTTS` model.
        speaker_wav (str, optional):
            Path to a reference wav file to use for voice cloning with supporting models like YourTTS.
            Defaults to None.
        emotion (str, optional):
            Emotion to use for 🐸Coqui Studio models. If None, Studio models use "Neutral". Defaults to None.
        split_sentences (bool, optional):
            Split text into sentences, synthesize them separately and concatenate the file audio.
            Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
            applicable to the 🐸TTS models. Defaults to True.
        kwargs (dict, optional):
            Additional arguments for the model.

Note that Coqui Studio doesn't exist anymore, so the emotion argument doesn't do anything.

By the way, the tts function produces a wav file.

No, tts() produces a Numpy array with the raw audio data that you can save into any format. Using torchaudio.save() you could save to mp3 for example. The tts_to_file() function saves to a WAV file.

hros · 2025-03-17T08:01:07Z

How can I retrieve the sample rate used by the speech synthesizer using the tts.tts(text, voice, lang) function?

UPDATE: I found it... tts.synthesizer.output_sample_rate

Tortoise17 · 2025-03-21T20:18:16Z

@eginhard is there any default integrated script which can be used and with flag one can generate emotion while cloning? and is there any list of emotions available inside the model? please also guide me if I can increase the input text limit from 200 to 512 or so.

eginhard · 2025-03-22T13:34:42Z

@Tortoise17 Your questions are completely unrelated to this issue. In the future, please create a new discussion/issue in that case.

Documentation is available at https://coqui-tts.readthedocs.io/en/latest/
XTTS doesn't have built-in emotions, you can only specify a reference file with a certain emotion
You could change it in the code, but the model wasn't trained on longer inputs, so the quality will be bad then. Better split your input into sentences and process them separately.

hros added the feature request feature requests for making TTS better. label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] upgrade TTS Python packahe #4169

[Feature request] upgrade TTS Python packahe #4169

hros commented Mar 13, 2025

eginhard commented Mar 13, 2025

hros commented Mar 14, 2025

eginhard commented Mar 14, 2025

hros commented Mar 14, 2025

eginhard commented Mar 16, 2025

hros commented Mar 17, 2025 •

edited

Loading

Tortoise17 commented Mar 21, 2025

eginhard commented Mar 22, 2025

[Feature request] upgrade TTS Python packahe #4169

[Feature request] upgrade TTS Python packahe #4169

Comments

hros commented Mar 13, 2025

eginhard commented Mar 13, 2025

hros commented Mar 14, 2025

eginhard commented Mar 14, 2025

hros commented Mar 14, 2025

eginhard commented Mar 16, 2025

hros commented Mar 17, 2025 • edited Loading

Tortoise17 commented Mar 21, 2025

eginhard commented Mar 22, 2025

hros commented Mar 17, 2025 •

edited

Loading