(Try listening to this part; this is male1.wav
!)
readme1.webm
This is a humanized fork of Spark-TTS that adds an OpenAI compatible text to speech API. This means you can actually use the amazing capability of Spark-TTS in real programs, instead of just a tech demo.
This fork is capable of processing an unlimited amount of text at once. It uses intelligent semantic text splitting to overcome the limitations of the original repo. It has quality of life improvements that make it suitable as a hands-off TTS provider. It will trim out excessive silence, and retry failed segments.
Generally, this smooths over all the bumps usually encountered with local text to speech models.
These improvements allow Spark-TTS to be used as a seamless drop in replacement for OpenAI. This means it will work with programs like SillyTavern and others. It can be easily used to read stories, roleplay, chat, and so on.
It takes about 8.5 GB vram, small enough to fit and run at about 2x realtime on an RTX 3060. Anything larger will have an easy time running this model at much faster than real time.
This was built off the work of the original model authors, and especially the foundational work of @AcTePuKc's script. This would not have been possible without @AcTePuKc's inference script.
Right now, it is designed to be used locally by one user, processing one request at a time. Multi-user (concurrent requests) generation works, but it is not as stable as it could be.
The program is a work in progress, but right now, pretty great!
There's two ways to get text to speech:
- Connect to a program like SillyTavern and have it read what you want on demand.
- Have it read a
.txt
file and give you a.wav
audio file to listen to or share.
- Add the ability to have multiple voice cloning sources loaded and selectable by API.
- Switch to a more production stable HTTP API hosting solution.
- Always split segments by paragraph.
- Add more diverse voice presets.
- Improve logic to reduce or eliminate entirely bad segments passed to the API when using unfamiliar voices.
- Reduce VRAM usage by using Spark-TTS-Fast.
- Better handling of apostrophes and quotes (fancy unicode quotes break the model).
Clone the repo
git clone https://github.com/dogarrowtype/Spark-TTS-cli-api
cd Spark-TTS
- Needs
python3.11
orpython3.12
Install with conda or pip:
Conda install method:
- Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
#pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
Alternative pip
install method:
python3.12 -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
#pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
Model Download
Download via huggingface-cli command:
huggingface-cli download SparkAudio/Spark-TTS-0.5B --local-dir "pretrained_models/Spark-TTS-0.5B"
ffmpeg
is an optional requirement. It's nice to have, but not critical. If you can't get it, don't worry. It just converts the output wav to mp3 for slightly less bandwidth usage.
Step 1: Get miniconda
Install miniconda
Step 2: Get the files
You have two options to get the files:
Option 1 (Recommended for Windows): Download ZIP manually
- Click "Code" > "Download ZIP" at the top of the page, then extract it to the same folder you've chosen to work in.
- You should have a "Spark-TTS-cli-api" folder inside your chosen install folder.
Option 2: Use Git (Optional, allows for easy updates)
- If you prefer using Git, install Git.
- Then open a Command Prompt (cmd) in the folder you would like to install in, and run:
git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS-cli-api
Step 3: Create a miniconda environment
Open Command Prompt (cmd) in your Spark-TTS-cli-api folder, and run:
conda create -n sparktts python=3.12 -y
conda activate sparktts
This creates and activates a Python 3.12 environment for Spark-TTS.
Step 4: Install dependencies
pip install -r requirements.txt
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
Step 5: Get the model
Download via huggingface-cli command:
huggingface-cli download SparkAudio/Spark-TTS-0.5B --local-dir "pretrained_models/Spark-TTS-0.5B"
Step 6: Ready to start
You should be good to go. Run the command from "Basic Usage" below that fits your needs.
If you installed using conda/miniconda, remember to activate it first:
conda activate sparktts
To start the API server on port 9991:
python ./tts_server.py --prompt_audio "voice_samples/female2.wav"
Since there is no streaming version yet, it will need to generate the entire output before sending the result over the API. With SillyTavern, this isn't as much of a problem since requests can be sent a paragraph at a time.
Alternatively, you can run a command line interface that will read text input:
python ./cli/tts_cli.py --text "Hello, nice to meet you." --prompt_audio "voice_samples/female2.wav"
Or have it read text from a file:
python ./cli/tts_cli.py --text_file "[path to your .txt file]" --prompt_audio "voice_samples/female2.wav"
This will save the output to the folder examples/results
inside Spark-TTS-cli-api.
Try --help
to see all options.
Several working (and consistent) voices are provided in voice_samples
. To switch to another voice, simply change out female2.wav
for another voice.
Personally I recommend using male1.wav
or female2.wav
. Right now, those voices are hyper-consistent and the most fun to listen to (subjectively).
Provided voice descriptions
- female1.wav: melancholy, rainy
- female2.wav: confident, expressive
- female3.wav: anime VA, cutesy, kinda sad, amazed
- female4.wav: clean, light, a touch of mystery
- male1.wav: cute, super expressive, adventurous
- male2.wav: determined, focused
- male3.wav: surprised, best for short bursts
- male4.wav: deep, rich, noir, serious
- male5.wav: alternate version of male4.wav, expressive, deep, rich, engaging
Female 1 |
Female 2 |
female1.webm |
female2.webm |
Female 3 |
Female 4 |
female3.webm |
female4.webm |
Male 1 |
Male 2 |
male1.webm |
male2.webm |
Male 3 |
Male 4 |
male3.webm |
male4.webm |
Male 5 |
Male 6 (not yet found) |
male5.webm |
Not yet found. |
Voice cloning
To clone another voice, gather a 5 to 20 second .wav
voice clip of clear audio of your speaker. Background noise/music should absolutely be avoided. A single long sentence can work great. A single instance of your voice speaking continuously is best. Multiple very short (less than 5 second) clips edited together do not seem to work well. Your clone source file should be in .wav
format (but others may work).
In the above "Basic Usage" commands, switch out female2.wav
for your new clone voice, and hope for the best. Follow the second generation cloning advice below to get better results.
Cloned voices seem to work better when using a second generation clone. Meaning, clone a voice, try generating a few different times, then use the result you like best as a clone for future generations.
Simply put:
- Generate a few samples with your desired clone voice
- Find one that has the correct accent/sound
- Use that perfect result as clone input to get consistent long generations
I use this phrase to create second generation clone clips. It has a good coverage of emotions and accents:
In these days of skeptical enlightened civilization one cannot claim to have seen a sea serpent and expect anything but laughter. And this was even more incredible. The ship's commander, within a few hours, even doubted the evidence of his own senses. But from the sailors the tale leaked out. And a whole ship's company cannot be insane, or all similarly drunk at once.
Random voices
If you don't specify a clone source, you must specify --gender
. You'll get a random voice for each chunk. This is useful to find new voices, and use those as clone sources. However, for normal usage, I don't suggest leaving it on random as the output will be chaotic.
SillyTavern Settings
Change SillyTavern TTS settings (in the "Extensions" menu at the top) match this screenshot. "Narrate by paragraphs (when not streaming)" is very important to reduce latency and increase quality.
Open-Webui Settings
In the Open-Webui admin panel, go to "audio" and set the TTS provider to
http://localhost:9991/v1
. Set "response splitting" to "paragraphs" to reduce latency and increase quality.
- Clone voices can be unstable across many segments if unfamiliar to the model. (Switching from a US English to UK English interpretation if it can't decide).
- (Mostly only an issue with custom clone voices) The segment retry can cause a bit of a traffic jam. If it has to retry a segment on a slow gpu, there may be a break in streaming output with SillyTavern. It will eventually continue, but there can be a pause while it tries to catch up. This is especially confusing in SillyTavern since there is no indicator that voice is being generated. It can appear like the request has failed, even though it is still generating.
- Segments that go off the rails halfway through and then pick up much later are not detected and retried.
Tested with a RTX 3060 12gb GPU on linux with python3.12, and a RTX 4090 GPU on windows with python 3.12. CPU is untested and unsupported.
Spark-TTS is an advanced text-to-speech system that uses the power of large language models (LLM) for highly accurate and natural-sounding voice synthesis. It is designed to be efficient, flexible, and powerful for both research and production use.
- Simplicity and Efficiency: Built entirely on Qwen2.5, Spark-TTS eliminates the need for additional generation models like flow matching. Instead of relying on separate models to generate acoustic features, it directly reconstructs audio from the code predicted by the LLM. This approach streamlines the process, improving efficiency and reducing complexity.
- High-Quality Voice Cloning: Supports zero-shot voice cloning, which means it can replicate a speaker's voice even without specific training data for that voice. This is ideal for cross-lingual and code-switching scenarios, allowing for seamless transitions between languages and voices without requiring separate training for each one.
- Bilingual Support: Supports both Chinese and English, and is capable of zero-shot voice cloning for cross-lingual and code-switching scenarios, enabling the model to synthesize speech in multiple languages with high naturalness and accuracy.
- Controllable Speech Generation: Supports creating virtual speakers by adjusting parameters such as gender, pitch, and speaking rate.
This project provides a zero-shot voice cloning TTS model intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research.
Please note:
-
Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal activities.
-
Ensure compliance with local laws and regulations when using this model and uphold ethical standards.
-
The developers assume no liability for any misuse of this model.
We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us.