-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add CosyVoice3 Text-to-Speech Model Support #3281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SpenserCai
wants to merge
24
commits into
huggingface:main
Choose a base branch
from
SpenserCai:cosyvoice_support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add CosyVoice3Frontend for extracting speech tokens, speaker embeddings, and mel spectrograms directly from audio - Extend candle-onnx with new operators: Elu, Mod, Round, ReduceProd - Add AvgPool1d support and fix ReduceSum negative axis handling - Support asymmetric Conv2d strides and Pad constant mode - Add optional 'onnx' feature to candle-transformers
…lar filter in the mel domain, matching `get_mel_banks` in `torchaudio` 2. Forward: change to frame-by-frame processing: - Extract original frames - DC offset removal (frame-by-frame) - Pre-emphasis (frame-by-frame, using replicate padding) - Povey window - FFT - Mel energy calculation
…align with the official implementation of CosyVoice3.
Contributor
Author
|
All functions of CosyVoice3 have been equivalently migrated and are now fully ready to perform CI/CL checks and code reviews. @ivarflakstad |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds complete support for CosyVoice3, a state-of-the-art multilingual zero-shot text-to-speech model from FunAudioLLM. CosyVoice3 supports multiple synthesis modes including zero-shot voice cloning, cross-lingual synthesis, and instruction-guided generation.
Features
Model Architecture
CosyVoice3 consists of four main components:
Supported Modes
Key Implementation Details
Changes
candle-transformers
Added new
cosyvoicemodule with the following structure:Total: ~7,200 lines of new Rust code
candle-onnx
Enhanced ONNX operator support required for CosyVoice3's ONNX models (campplus.onnx, speech_tokenizer_v3.onnx):
AveragePoolceil_modeConvPadconstantmode with custom valueGRUEluModRoundReduceMeanReduceProdcandle-examples
Added
cosyvoice3example with:convert_weights.py)extract_rand_noise.py) for exact reproducibilityUsage
Basic Usage
Weight Conversion
If you prefer to convert weights manually from the original PyTorch model:
python candle-examples/examples/cosyvoice3/convert_weights.py \ --input weights/Fun-CosyVoice3-0.5B-2512 \ --output weights/CosyVoice3-0.5B-CandleRandom Noise Extraction (Optional)
For exact numerical reproducibility with the Python implementation, you can extract the pre-computed random noise:
python candle-examples/examples/cosyvoice3/extract_rand_noise.py \ --output weights/CosyVoice3-0.5B-Candle/rand_noise.safetensorsNote: This file is optional and already included in the pre-converted weights on Hugging Face. Without it, the Candle implementation generates its own deterministic noise using a fallback algorithm. The generated audio will be equally valid but may differ slightly from the Python implementation's output.
Programmatic Usage
Model Weights
Pre-converted Weights (Recommended)
Pre-converted weights are available on Hugging Face:
spensercai/CosyVoice3-0.5B-Candle
# Download using huggingface-cli huggingface-cli download spensercai/CosyVoice3-0.5B-Candle --local-dir weights/CosyVoice3-0.5B-CandleManual Conversion
Alternatively, convert from the original Fun-CosyVoice3-0.5B-2512 using the provided script.
Performance
RTF < 1.0 means faster than real-time
Technical Notes
Kaldi Fbank Compatibility
The mel spectrogram extraction follows Kaldi's fbank implementation with:
Flow Matching
Uses Conditional Flow Matching (CFM) with:
HiFT Vocoder
Neural Source Filter based vocoder with:
Dependencies
candle-corecandle-nncandle-onnx(optional, for native feature extraction)tokenizers(for Qwen2 tokenization)symphonia(optional, for audio decoding)References
Checklist
candle-onnx Enhancements (Detailed)
This PR also includes significant enhancements to
candle-onnxto support the ONNX models used by CosyVoice3:New Operators
GRU (Gated Recurrent Unit)
Full implementation of the ONNX GRU operator with:
Elu (Exponential Linear Unit)
// f(x) = x if x > 0, alpha * (exp(x) - 1) if x <= 0Mod (Modulo)
Supports both floor division (
fmod=0) and truncated division (fmod=1) modes.Round
Banker's rounding (round to nearest even).
ReduceProd
Product reduction along specified axes with keepdims support.
Enhanced Operators
AveragePool
ceil_modesupport for output size calculationConv
Pad
constantmode with custom padding valuereflectmodeReduceMean
normalize_axisThis implementation was developed and tested against the official Python implementation to ensure numerical accuracy.
Development Notes
Verification Process
The implementation was verified against the official Python CosyVoice3 implementation through:
Key Implementation Challenges
Kaldi Fbank Compatibility: Required careful implementation of Povey window, pre-emphasis, and mel filter bank to match torchaudio's
kaldi_fbankONNX GRU Weight Reordering: ONNX uses (z, r, h) gate order while candle-nn uses (r, z, n), requiring weight tensor reordering
HiFT Weight Norm Fusion: The original PyTorch model uses
weight_normparametrization which needed to be fused during weight conversionCausal Convolution: Implemented proper causal padding for streaming-compatible inference
ScreenShot
Metal
CUDA