A Rust implementation of the Qwen3 Text-to-Speech (TTS) model inference, using the tch crate for PyTorch/libtorch bindings.
The tch crate (v0.23) requires PyTorch/libtorch 2.10.0. You can set it up in one of two ways.
Install PyTorch via pip and point the build system to it:
pip install torch==2.10.0Then set the following environment variables before building:
export LIBTORCH_USE_PYTORCH=1
export LD_LIBRARY_PATH=$(python3 -c "import torch; print(torch.__path__[0])")/lib:$LD_LIBRARY_PATHLinux x86 CPU:
curl -LO https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.10.0%2Bcpu.zip
unzip libtorch-cxx11-abi-shared-with-deps-2.10.0+cpu.zipLinux x86 with CUDA 12.8:
curl -LO https://download.pytorch.org/libtorch/cu128/libtorch-cxx11-abi-shared-with-deps-2.10.0%2Bcu128.zip
unzip libtorch-cxx11-abi-shared-with-deps-2.10.0+cu128.zipmacOS on Apple Silicon (M-series):
curl -LO https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.10.0.zip
unzip libtorch-macos-arm64-2.10.0.zipThen set environment variables (add to ~/.zprofile or ~/.bash_profile to persist):
export LIBTORCH=$(pwd)/libtorch
export LD_LIBRARY_PATH=$(pwd)/libtorch/lib:$LD_LIBRARY_PATHDownload the Qwen3-TTS model weights. For speech synthesis with named speakers (CustomVoice 0.6B):
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir models/Qwen3-TTS-12Hz-0.6B-CustomVoiceFor instruction-controlled voice synthesis (CustomVoice 1.7B, supports emotion/style control):
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir models/Qwen3-TTS-12Hz-1.7B-CustomVoiceFor voice cloning from reference audio (Base):
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base --local-dir models/Qwen3-TTS-12Hz-0.6B-BaseThe Rust tokenizers crate requires a tokenizer.json file with the full tokenizer configuration (including the pre-tokenizer). Generate it from the Python tokenizer for each model you downloaded:
python3 -c "
from transformers import AutoTokenizer
for model in ['Qwen3-TTS-12Hz-0.6B-CustomVoice', 'Qwen3-TTS-12Hz-0.6B-Base', 'Qwen3-TTS-12Hz-1.7B-CustomVoice']:
path = f'models/{model}'
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
tok.backend_tokenizer.save(f'{path}/tokenizer.json')
print(f'Saved {path}/tokenizer.json')
"git clone https://github.com/juntao/qwen3_tts_rs.git
cd qwen3_tts_rs
cargo build --releaseGenerate speech from text:
cargo run --example tts_demo --release -- <model_path> [text] [speaker] [language] [instruction]Example:
cargo run --example tts_demo --release -- \
models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
"Hello world, this is a test." \
Vivian \
englishThis generates an output.wav file with 24kHz audio.
The 1.7B CustomVoice model supports instruction control to modulate voice characteristics like emotion, speaking style, and pace:
cargo run --example tts_demo --release -- \
models/Qwen3-TTS-12Hz-1.7B-CustomVoice \
"Breaking news! There has been a major development." \
Vivian \
english \
"Speak in an urgent and excited voice"Other instruction examples:
"Speak happily and joyfully""Speak slowly and calmly""Speak in a whisper""Speak with a sad tone"
Clone a voice from a reference audio file using the Base model:
cargo run --example voice_clone_demo --release -- <model_path> <ref_audio> [text] [language] [ref_text]Preparing reference audio: The reference audio must be a mono 24kHz 16-bit WAV file. Use ffmpeg to convert from other formats:
ffmpeg -i input.m4a -ac 1 -ar 24000 -sample_fmt s16 reference.wavSample reference audio files are provided in the reference_audio/ directory.
X-vector only mode (no reference text):
cargo run --example voice_clone_demo --release -- \
models/Qwen3-TTS-12Hz-0.6B-Base \
reference_audio/trump.wav \
"Hello world, this is a voice cloning test." \
englishICL mode (with reference text, higher quality):
cargo run --example voice_clone_demo --release -- \
models/Qwen3-TTS-12Hz-0.6B-Base \
reference_audio/trump.wav \
"Hello world, this is a voice cloning test." \
english \
"Angered and appalled millions of Americans across the political spectrum"When a reference text transcript is provided as the 5th argument, the demo uses ICL (In-Context Learning) mode, which encodes the reference audio into codec tokens and conditions generation on both the speaker embedding and the reference audio/text. This typically produces higher fidelity voice cloning compared to x-vector only mode.
Output is written to output_voice_clone.wav.
Verify that model weights load correctly:
cargo run --example test_weights --release -- models/Qwen3-TTS-12Hz-0.6B-CustomVoiceVivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan. See the model's config.json spk_id field for the full list.
english, chinese, and others defined in the model config.
Add qwen3_tts as a dependency in your Cargo.toml:
[dependencies]
qwen3_tts = "0.1"
tch = "0.23"Generate speech using a predefined speaker voice:
use qwen3_tts::audio::write_wav_file;
use qwen3_tts::inference::TTSInference;
use std::path::Path;
use tch::Device;
fn main() -> anyhow::Result<()> {
let inference = TTSInference::new(
Path::new("models/Qwen3-TTS-12Hz-0.6B-CustomVoice"),
Device::Cpu,
)?;
let (waveform, sample_rate) = inference.generate(
"Hello! Welcome to the Qwen3 TTS system.",
"Vivian", // speaker name
"english", // language
)?;
write_wav_file("output.wav", &waveform, sample_rate)?;
Ok(())
}You can customize generation parameters (temperature, top_k, max codes):
let (waveform, sample_rate) = inference.generate_with_params(
"This uses custom generation parameters.",
"Ryan",
"english",
0.8, // temperature
30, // top_k
4096, // max_codes
)?;The 1.7B CustomVoice model supports instruction control to modulate voice characteristics:
use qwen3_tts::audio::write_wav_file;
use qwen3_tts::inference::TTSInference;
use std::path::Path;
use tch::Device;
fn main() -> anyhow::Result<()> {
let inference = TTSInference::new(
Path::new("models/Qwen3-TTS-12Hz-1.7B-CustomVoice"),
Device::Cpu,
)?;
// Generate with instruction control
let (waveform, sample_rate) = inference.generate_with_instruct(
"Breaking news! There has been a major development.",
"Vivian", // speaker name
"english", // language
"Speak in an urgent and excited voice", // instruction
0.9, // temperature
50, // top_k
2048, // max_codes
)?;
write_wav_file("urgent_news.wav", &waveform, sample_rate)?;
// Without instruction (pass empty string for neutral voice)
let (waveform, sample_rate) = inference.generate_with_instruct(
"This is a normal announcement.",
"Vivian",
"english",
"", // no instruction
0.9,
50,
2048,
)?;
write_wav_file("neutral.wav", &waveform, sample_rate)?;
Ok(())
}Clone a voice from reference audio using the Base model with TTSInference and SpeakerEncoder:
use qwen3_tts::audio::{load_wav_file, resample, write_wav_file};
use qwen3_tts::inference::TTSInference;
use qwen3_tts::speaker_encoder::SpeakerEncoder;
use std::path::Path;
use tch::Device;
fn main() -> anyhow::Result<()> {
let device = Device::Cpu;
let model_path = Path::new("models/Qwen3-TTS-12Hz-0.6B-Base");
// Load model and speaker encoder
let inference = TTSInference::new(model_path, device)?;
let se_config = inference.config().speaker_encoder_config.clone();
let speaker_encoder = SpeakerEncoder::load(inference.weights(), &se_config, device)?;
// Load and resample reference audio to 24kHz
let (samples, sr) = load_wav_file("reference.wav")?;
let samples = if sr != se_config.sample_rate {
resample(&samples, sr, se_config.sample_rate)?
} else {
samples
};
// Extract speaker embedding and generate speech
let speaker_embedding = speaker_encoder.extract_embedding(&samples)?;
let (waveform, sample_rate) = inference.generate_with_xvector(
"Hello, this is a voice cloning test.",
&speaker_embedding,
"english",
0.9, // temperature
50, // top_k
2048, // max_codes
)?;
write_wav_file("output.wav", &waveform, sample_rate)?;
Ok(())
}For higher fidelity voice cloning, use ICL mode which conditions on both the speaker embedding and reference audio codec tokens with their text transcription:
use qwen3_tts::audio::{load_wav_file, resample, write_wav_file};
use qwen3_tts::audio_encoder::AudioEncoder;
use qwen3_tts::inference::TTSInference;
use qwen3_tts::speaker_encoder::SpeakerEncoder;
use std::path::Path;
use tch::Device;
fn main() -> anyhow::Result<()> {
let device = Device::Cpu;
let model_path = Path::new("models/Qwen3-TTS-12Hz-0.6B-Base");
// Load model and speaker encoder
let inference = TTSInference::new(model_path, device)?;
let se_config = inference.config().speaker_encoder_config.clone();
let speaker_encoder = SpeakerEncoder::load(inference.weights(), &se_config, device)?;
// Load and resample reference audio to 24kHz
let (samples, sr) = load_wav_file("reference.wav")?;
let samples = if sr != se_config.sample_rate {
resample(&samples, sr, se_config.sample_rate)?
} else {
samples
};
// Extract speaker embedding
let speaker_embedding = speaker_encoder.extract_embedding(&samples)?;
// Encode reference audio to codec tokens
let speech_tokenizer_path = model_path
.join("speech_tokenizer")
.join("model.safetensors");
let audio_encoder = AudioEncoder::load(&speech_tokenizer_path, device)?;
let ref_codes = audio_encoder.encode(&samples)?;
// Generate with ICL (reference text + codec tokens + speaker embedding)
let (waveform, sample_rate) = inference.generate_with_icl(
"Hello, this is a voice cloning test.",
"This is the transcript of the reference audio.",
&ref_codes,
&speaker_embedding,
"english",
0.9, // temperature
50, // top_k
2048, // max_codes
)?;
write_wav_file("output.wav", &waveform, sample_rate)?;
Ok(())
}The audio module provides helpers for reading and writing audio:
use qwen3_tts::audio::{load_wav_file, resample, write_wav_file};
// Read a WAV file
let (samples, sample_rate) = load_wav_file("input.wav")?;
// Write WAV to file
write_wav_file("output.wav", &samples, sample_rate)?;
// Resample between sample rates
let resampled = resample(&samples, 44100, 24000)?;The inference pipeline follows the Python reference implementation:
Text → Tokenizer → Dual-stream Embeddings → TalkerModel (28-layer Transformer)
↓
codec_head → Code 0
↓
CodePredictor (5-layer Transformer) → Codes 1-15
↓
Vocoder → 24kHz Waveform
Key components:
| Module | Description |
|---|---|
inference.rs |
TalkerModel + CodePredictor with dual-stream (text + codec) input |
speaker_encoder.rs |
ECAPA-TDNN speaker encoder for extracting x-vectors from reference audio |
audio_encoder.rs |
Mimi/SEANet speech tokenizer encoder for encoding audio to codec tokens (ICL mode) |
layers.rs |
RMSNorm, RoPE, GQA Attention with QK-Norm, SwiGLU MLP |
vocoder.rs |
ResidualVectorQuantizer, 8-layer pre-transformer, ConvNeXt upsampler, Snake decoder |
config.rs |
Model configuration deserialization from config.json |
audio.rs |
WAV file I/O and audio processing utilities |
- tch (0.23): PyTorch/libtorch bindings for tensor ops and neural network inference
- tokenizers (0.21): HuggingFace tokenizers for BPE text tokenization
- hound: WAV file reading/writing
- serde/serde_json: Configuration deserialization
- thiserror/anyhow: Error handling
Apache-2.0
Based on the original Python implementation by the Alibaba Qwen team.