ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao | Zhejiang University

Dataset and evaluation code of ISDrama (ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting.

We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting.

We provide the evaluation code in this repository.

Moreover, you can visit our Demo Page for the audio samples of our dataset as well as the results of our model.

Updates

2025.07: We released the evaluation code of MRSDrama!
2025.07: We released the full dataset of MRSDrama!
2025.07: ISDrama is accepted by ACMMM 2025!

TODO List

✅ Release the full dataset.

✅ Release the evaluation code.

🔲 Release the main model code.

Key Features

We develop MRSDrama, the first multimodal recorded spatial drama dataset, accompanying videos, scripts, alignments, positions, and textual prompts.
We introduce ISDrama, the first immersive spatial drama generation model through multimodal prompting. We design the Multimodal Pose Encoder to extract pose from multimodal inputs, while the Immersive Drama Transformer produces binaural speech.
Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics.

Where to download

Click to access our full dataset (videos, scripts, alignments, positions, and textual prompts) on Hugging Face for free! Hope our data is helpful for your research.

Please note that, if you are using MRSDrama, it means that you have accepted the terms of license.

Data Architecture

Our dataset is organized hierarchically.

Each top-level folder contains a set of dramas. Each folder contains a subfolder with cut WAV files, an MP4 video file, and a JSON file containing all annotation information. Additionally, the geometric_pose subdirectory stores NumPy (.npy) sequences—listener‑centric 3D positions, head-orientation quaternions, and radial velocities with respect to the left and right ears. These sequences are aligned at the frame level and generated with a 48 kHz sample rate and a 256-sample hop size.

Evaluation of ISDrama

The evaluation process is based on the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" .

Dependencies

A suitable conda environment named isdrama_eva can be created and activated with:

conda env create -f environment.yml
bash timm_patch/patch.sh
conda activate isdrama_eva

Preparation

Checkpoint Preparation

Please download the finetuned BAT encoder checkpoint and place it at:

./evaluation/ckpt/finetuned.pth

Make sure the path exists (create the `ckpt`` directory if necessary).

Data Preparation

For evaluation, you must prepare paired ground‑truth audio and generated audio. Place them respectively in:

./evaluation/data/gt
./evaluation/data/infer

The expected directory layout is:

.
├── gt
│   ├── 0000.wav
│   ├── 0001.wav
│   ├── 0002.wav
│   └── 0003.wav
└── infer
    ├── 0000.wav
    ├── 0001.wav
    ├── 0002.wav
    └── 0003.wav

Important:

The files inside gt and infer must correspond one‑to‑one.
Filenames and counts must match exactly (e.g., gt/0002.wav pairs with infer/0002.wav).
Ensure sampling rates and channel configurations are consistent if required by downstream metrics.

Metrics

Semantic & Acoustic Metrics

Character Error Rate (CER): Assesses transcript/content accuracy.
Cosine Similarity (SIM): Measures speaker timbre similarity between the generated audio and the prompt/reference audio (e.g., via speaker embeddings).
F0 Frame Error (FFE): Evaluates prosody fidelity by comparing voiced/unvoiced decisions and pitch (F0) frames.

Spatial Metrics

IPD MAE: Mean Absolute Error between ground‑truth and generated Interaural Phase Differences.
ILD MAE: Mean Absolute Error between ground‑truth and generated Interaural Level Differences.
Angle Cosine Similarity (ANG Cos): Cosine similarity between ground‑truth and generated direction (azimuth / elevation) angle embeddings.
Distance Cosine Similarity (Dis Cos): Cosine similarity between ground‑truth and generated distance embeddings.

Note: Cosine‑based spatial scores are in the range [-1, 1], with higher values indicating closer alignment of spatial embeddings.

Running the Evaluation

Run the following script to perform the evaluation pipeline:

cd evaluation
bash ./evaluate/eval.sh

The script evaluate/eval.sh executes the following three stages:

Extract angle and distance embeddings using the BAT encoder.
Extract IPD & ILD features from paired ground‑truth and generated stereo audio.
Compute metrics: MAE (for IPD / ILD) and cosine similarities (for angle and distance).

Ensure that ground‑truth and generated audio files are correctly paired and preprocessed before running the script.

Citations

If you find this code useful in your research, please cite our work:

@article{zhang2025isdrama,
  title={ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting},
  author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2504.20630},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
evaluate		evaluate
timm_patch		timm_patch
utils		utils
README.md		README.md
dataset_license.md		dataset_license.md
engine_eval.py		engine_eval.py
environment.yml		environment.yml
main.py		main.py
spatial_eva.py		spatial_eva.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao | Zhejiang University

Updates

TODO List

Key Features

Where to download

Data Architecture

Evaluation of ISDrama

Dependencies

Preparation

Checkpoint Preparation

Data Preparation

Metrics

Semantic & Acoustic Metrics

Spatial Metrics

Running the Evaluation

Citations

Disclaimer

About

Uh oh!

Releases

Packages

Languages

AaronZ345/ISDrama

Folders and files

Latest commit

History

Repository files navigation

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang*, Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Tao Jin, Zhou Zhao | Zhejiang University

Updates

TODO List

Key Features

Where to download

Data Architecture

Evaluation of ISDrama

Dependencies

Preparation

Checkpoint Preparation

Data Preparation

Metrics

Semantic & Acoustic Metrics

Spatial Metrics

Running the Evaluation

Citations

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao | Zhejiang University

Packages