Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] upgrade TTS Python packahe #4169

Open
hros opened this issue Mar 13, 2025 · 8 comments
Open

[Feature request] upgrade TTS Python packahe #4169

hros opened this issue Mar 13, 2025 · 8 comments
Labels
feature request feature requests for making TTS better.

Comments

@hros
Copy link

hros commented Mar 13, 2025

🚀 Feature Description
The TTS package does not install when Python version is greater than 3.11
This is problematic considering that the current version is 3.13, and 3.14 is around the corner

Solution
Support Python version 3.13 (at least)

Alternative Solutions
If upgrading is not hassle-free, fork the project to have a version that is comoaè

Additional context

@hros hros added the feature request feature requests for making TTS better. label Mar 13, 2025
@eginhard
Copy link
Contributor

We do maintain a fork at https://github.com/idiap/coqui-ai-TTS (available via pip install coqui-tts).

Python 3.12 is already supported. For 3.13 we have to wait for all dependencies to update, but you can subscribe to idiap#108 for updates.

@hros
Copy link
Author

hros commented Mar 14, 2025

Thanks @eginhard
I switched to Python 3.12, waiting for support for version 3.13

I'm searching the docs and did not find the Python API documentation for text-to-speech
Where can I find the full docs? Specifically, API functions related to voice, and model selection and voice synthesis

@eginhard
Copy link
Contributor

@hros
Copy link
Author

hros commented Mar 14, 2025

Thanks @eginhard .I saw that page with the helpful examples, but I'm looking for API documentation.
For example, the function TTS.tts takes the following keyword arguments:

speaker, language, speaker_wav, emotion, split_sentences

Where can I find out what the emotion and split_sentences arguments do?

I am looking for this kind of detail for all relevant functions.

By the way, the tts function produces a wav file. Can I produce mp3 files? Or better yet, in-memory objects with mp3 audio data?

@eginhard
Copy link
Contributor

You can use Python's built-in help command in the REPL to display the docstring of any Python function:

>> from TTS.api import TTS
>> xtts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
>> help(xtts.tts)
Help on method tts in module TTS.api:

tts(text: str, speaker: str | None = None, language: str | None = None, speaker_wav: str | None = None, emotion: str | None = None, split_sentences: bool = True, **kwargs) method of TTS.api.TTS instance
    Convert text to speech.
    
    Args:
        text (str):
            Input text to synthesize.
        speaker (str, optional):
            Speaker name for multi-speaker. You can check whether loaded model is multi-speaker by
            `tts.is_multi_speaker` and list speakers by `tts.speakers`. Defaults to None.
        language (str): Language of the text. If None, the default language of the speaker is used. Language is only
            supported by `XTTS` model.
        speaker_wav (str, optional):
            Path to a reference wav file to use for voice cloning with supporting models like YourTTS.
            Defaults to None.
        emotion (str, optional):
            Emotion to use for 🐸Coqui Studio models. If None, Studio models use "Neutral". Defaults to None.
        split_sentences (bool, optional):
            Split text into sentences, synthesize them separately and concatenate the file audio.
            Setting it False uses more VRAM and possibly hit model specific text length or VRAM limits. Only
            applicable to the 🐸TTS models. Defaults to True.
        kwargs (dict, optional):
            Additional arguments for the model.

Note that Coqui Studio doesn't exist anymore, so the emotion argument doesn't do anything.

By the way, the tts function produces a wav file.

No, tts() produces a Numpy array with the raw audio data that you can save into any format. Using torchaudio.save() you could save to mp3 for example. The tts_to_file() function saves to a WAV file.

@hros
Copy link
Author

hros commented Mar 17, 2025

How can I retrieve the sample rate used by the speech synthesizer using the tts.tts(text, voice, lang) function?

UPDATE: I found it... tts.synthesizer.output_sample_rate

@Tortoise17
Copy link

@eginhard is there any default integrated script which can be used and with flag one can generate emotion while cloning? and is there any list of emotions available inside the model? please also guide me if I can increase the input text limit from 200 to 512 or so.

@eginhard
Copy link
Contributor

@Tortoise17 Your questions are completely unrelated to this issue. In the future, please create a new discussion/issue in that case.

  • Documentation is available at https://coqui-tts.readthedocs.io/en/latest/
  • XTTS doesn't have built-in emotions, you can only specify a reference file with a certain emotion
  • You could change it in the code, but the model wasn't trained on longer inputs, so the quality will be bad then. Better split your input into sentences and process them separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature requests for making TTS better.
Projects
None yet
Development

No branches or pull requests

3 participants