Dataset and evaluation code of ISDrama (ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting.
We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting.
We provide the evaluation code in this repository.
Moreover, you can visit our Demo Page for the audio samples of our dataset as well as the results of our model.
- 2025.07: We released the evaluation code of MRSDrama!
- 2025.07: We released the full dataset of MRSDrama!
- 2025.07: ISDrama is accepted by ACMMM 2025!
✅ Release the full dataset.
✅ Release the evaluation code.
🔲 Release the main model code.
- We develop MRSDrama, the first multimodal recorded spatial drama dataset, accompanying videos, scripts, alignments, positions, and textual prompts.
- We introduce ISDrama, the first immersive spatial drama generation model through multimodal prompting. We design the Multimodal Pose Encoder to extract pose from multimodal inputs, while the Immersive Drama Transformer produces binaural speech.
- Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics.
Click to access our full dataset (videos, scripts, alignments, positions, and textual prompts) on Hugging Face for free! Hope our data is helpful for your research.
Please note that, if you are using MRSDrama, it means that you have accepted the terms of license.
Our dataset is organized hierarchically.
Each top-level folder contains a set of dramas. Each folder contains a subfolder with cut WAV files, an MP4 video file, and a JSON file containing all annotation information. Additionally, the geometric_pose subdirectory stores NumPy (.npy) sequences—listener‑centric 3D positions, head-orientation quaternions, and radial velocities with respect to the left and right ears. These sequences are aligned at the frame level and generated with a 48 kHz sample rate and a 256-sample hop size.
The evaluation process is based on the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" .
A suitable conda environment named isdrama_eva
can be created
and activated with:
conda env create -f environment.yml
bash timm_patch/patch.sh
conda activate isdrama_eva
Please download the finetuned BAT
encoder checkpoint and place it at:
./evaluation/ckpt/finetuned.pth
Make sure the path exists (create the `ckpt`` directory if necessary).
For evaluation, you must prepare paired ground‑truth audio and generated audio. Place them respectively in:
./evaluation/data/gt
./evaluation/data/infer
The expected directory layout is:
.
├── gt
│ ├── 0000.wav
│ ├── 0001.wav
│ ├── 0002.wav
│ └── 0003.wav
└── infer
├── 0000.wav
├── 0001.wav
├── 0002.wav
└── 0003.wav
Important:
- The files inside gt and infer must correspond one‑to‑one.
- Filenames and counts must match exactly (e.g.,
gt/0002.wav
pairs withinfer/0002.wav
). - Ensure sampling rates and channel configurations are consistent if required by downstream metrics.
- Character Error Rate (CER): Assesses transcript/content accuracy.
- Cosine Similarity (SIM): Measures speaker timbre similarity between the generated audio and the prompt/reference audio (e.g., via speaker embeddings).
- F0 Frame Error (FFE): Evaluates prosody fidelity by comparing voiced/unvoiced decisions and pitch (F0) frames.
- IPD MAE: Mean Absolute Error between ground‑truth and generated Interaural Phase Differences.
- ILD MAE: Mean Absolute Error between ground‑truth and generated Interaural Level Differences.
- Angle Cosine Similarity (ANG Cos): Cosine similarity between ground‑truth and generated direction (azimuth / elevation) angle embeddings.
- Distance Cosine Similarity (Dis Cos): Cosine similarity between ground‑truth and generated distance embeddings.
Note: Cosine‑based spatial scores are in the range [-1, 1], with higher values indicating closer alignment of spatial embeddings.
Run the following script to perform the evaluation pipeline:
cd evaluation
bash ./evaluate/eval.sh
The script evaluate/eval.sh executes the following three stages:
-
Extract angle and distance embeddings using the BAT encoder.
-
Extract IPD & ILD features from paired ground‑truth and generated stereo audio.
-
Compute metrics: MAE (for IPD / ILD) and cosine similarities (for angle and distance).
Ensure that ground‑truth and generated audio files are correctly paired and preprocessed before running the script.
If you find this code useful in your research, please cite our work:
@article{zhang2025isdrama,
title={ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting},
author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Jin, Tao and Zhao, Zhou},
journal={arXiv preprint arXiv:2504.20630},
year={2025}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.