FlashCosyVoice

FlashCosyVoice

A lightweight vLLM implementation built from scratch for CosyVoice.

Key Features

⚡️ Blazing-fast offline inference without complex dependencies
- Built-in Prefix caching, Torch compilation, CUDA graph, etc.
- Achieve equal or even better inference speed without installing vLLM.
📖 Readable codebase with easier hackability
- Clean & minimal implementation of CosyVoice in pure Python code.
- Allow you to DIY any part, such as implementing ras_sample (repetition aware sample) that is hard to be supported in the standard vLLM library.

Installation

If you don't need any modification:

pip install git+https://github.com/xingchensong/FlashCosyVoice

If you want to do some DIY:

git clone https://github.com/xingchensong/FlashCosyVoice
cd FlashCosyVoice
pip install -e .

Model Download

# you might need `sudo apt-get install git-lfs` before download this model
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git

Example Data Format

{"key": "uttid_1", "prompt_text": "你好，我是小明。", "text": "你好，我是小红。", "prompt_wav": "/mnt/data/audio/00000000.wav", "wav": "/mnt/data/audio_synthetic/uttid_1.wav"}
...
{"key": "uttid_2", "prompt_text": "你好，我是小红。", "text": "你好，我是小明。", "prompt_wav": "/mnt/data/audio/00000001.wav", "wav": "/mnt/data/audio_synthetic/uttid_2.wav"}
...

key is the key of this sample.
prompt_text is the text used for prompt.
text is the text to be generated.
prompt_wav is the audio used for prompt.
wav is the path to save the generated audio (we highly recommend to pre-define the save path before running the script).

Example Usage

FlashCosyVoice is built for distributed offline batch inference, enjoy ultra speed!

# 1 node 1 gpu, try to decrease `batch_size_dataloader` & `batch_size_flow` if OOM
torchrun --nproc_per_node=1 --nnodes=1 \
     --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
    `which flashcosyvoice` \
        --model_path "path to your CosyVoice2-0.5B" \
        --data_list "path to your data.jsonl" \
        --batch_size_dataloader 1024 \
        --batch_size_flow 32 \
        --num_workers 8 \
        --fp16_flow \
        --prefetch 32

# 1 node 8 gpu, try to decrease `batch_size_dataloader` & `batch_size_flow` if OOM
torchrun --nproc_per_node=8 --nnodes=1 \
     --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
    `which flashcosyvoice` \
        --model_path "path to your CosyVoice2-0.5B" \
        --data_list "path to your data.jsonl" \
        --batch_size_dataloader 1024 \
        --batch_size_flow 32 \
        --num_workers 8 \
        --fp16_flow \
        --prefetch 32

Performance Benchmark

Method	RTF	Relative speed up	WERs on CV3-Eval Zero-Shot Test Set (zh/en/ja/ko)
CV2 results reported in CV3 paper (Table 5)	N/A	N/A	4.08 / 6.32 / 9.13 / 19.7
[cosyvoice/pytorch_example.py] (fp32 llm + fp32 flow)	0.487	1x	4.17 / 6.25 / 12.85 / 8.25
[cosyvoice/pytorch_example.py] (fp16 llm + fp16 flow)	0.554	0.9x	3.95 / 6.21 / 9.82 / 9.44
[cosyvoice/vllm_example.py] (bf16 llm + fp16 flow)	0.167	~3x	4.28 / 6.49 / 8.40 / 9.59
FlashCosyVoice (bf16 llm + fp32 flow)	0.086	~5.6x	3.89 / 6.11 / 8.50 / 10.27
FlashCosyVoice (bf16 llm + fp16 flow)	0.055	~9x	3.88 / 6.11 / 8.47 / 10.33

Conclusion

Compared with native PyTorch inference, FlashCosyVoice achieves 9x speed-up while maintaining similar WERs.
Due to the lack of ras_sample, The cosyvoice/vllm_example.py is more unstable and more harmful to WERs than FlashCosyVoice in common languages (Chinese/English).

Test Configuration

Hardware: 1 * H800 (80GB)
Model: CosyVoice2-0.5B
Total Requests: 2000 (500 for each in [zh, en, ja, ko])
seed: 1986
text_frontend: False
batch_size: 1024(dataloader)/32(flow) for FlashCosyVoice and 1 for others, (When testing FlashCosyVoice, we repeated the request 100 times to obtain a more accurate RTF.)

TODO

Support online generation for RL training
Support Ray for ultra-large-scale speech generation
CosyVoice3 (when it is released, hhh)

Acknowledge

This repo is highly motivated by nano-vllm. We drew on the design of the LLM engine and made necessary adaptations for CosyVoice.
This repo also benefits from S3Tokenizer, CosyVoice, CV3-Eval

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
flashcosyvoice		flashcosyvoice
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FlashCosyVoice

Key Features

Installation

Model Download

Example Data Format

Example Usage

Performance Benchmark

TODO

Acknowledge

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

xingchensong/FlashCosyVoice

Folders and files

Latest commit

History

Repository files navigation

FlashCosyVoice

Key Features

Installation

Model Download

Example Data Format

Example Usage

Performance Benchmark

TODO

Acknowledge

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages