Skip to content

Dataset and evaluation code of ISDrama(ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting

Notifications You must be signed in to change notification settings

AaronZ345/ISDrama

Repository files navigation

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang*, Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Tao Jin, Zhou Zhao | Zhejiang University

Dataset and evaluation code of ISDrama (ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting.

arXiv Hugging Face weixin weixin zhihu GitHub Stars

We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting.

We provide the evaluation code in this repository.

Moreover, you can visit our Demo Page for the audio samples of our dataset as well as the results of our model.

Updates

  • 2025.07: We released the evaluation code of MRSDrama!
  • 2025.07: We released the full dataset of MRSDrama!
  • 2025.07: ISDrama is accepted by ACMMM 2025!

TODO List

✅ Release the full dataset.

✅ Release the evaluation code.

🔲 Release the main model code.

Key Features

  • We develop MRSDrama, the first multimodal recorded spatial drama dataset, accompanying videos, scripts, alignments, positions, and textual prompts.
  • We introduce ISDrama, the first immersive spatial drama generation model through multimodal prompting. We design the Multimodal Pose Encoder to extract pose from multimodal inputs, while the Immersive Drama Transformer produces binaural speech.
  • Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics.

Where to download

Click Hugging Face to access our full dataset (videos, scripts, alignments, positions, and textual prompts) on Hugging Face for free! Hope our data is helpful for your research.

Please note that, if you are using MRSDrama, it means that you have accepted the terms of license.

Data Architecture

Our dataset is organized hierarchically.

Each top-level folder contains a set of dramas. Each folder contains a subfolder with cut WAV files, an MP4 video file, and a JSON file containing all annotation information. Additionally, the geometric_pose subdirectory stores NumPy (.npy) sequences—listener‑centric 3D positions, head-orientation quaternions, and radial velocities with respect to the left and right ears. These sequences are aligned at the frame level and generated with a 48 kHz sample rate and a 256-sample hop size.

Evaluation of ISDrama

The evaluation process is based on the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" .

Dependencies

A suitable conda environment named isdrama_eva can be created and activated with:

conda env create -f environment.yml
bash timm_patch/patch.sh
conda activate isdrama_eva

Preparation

Checkpoint Preparation

Please download the finetuned BAT encoder checkpoint and place it at:

./evaluation/ckpt/finetuned.pth

Make sure the path exists (create the `ckpt`` directory if necessary).

Data Preparation

For evaluation, you must prepare paired ground‑truth audio and generated audio. Place them respectively in:

./evaluation/data/gt
./evaluation/data/infer

The expected directory layout is:

.
├── gt
│   ├── 0000.wav
│   ├── 0001.wav
│   ├── 0002.wav
│   └── 0003.wav
└── infer
    ├── 0000.wav
    ├── 0001.wav
    ├── 0002.wav
    └── 0003.wav

Important:

  • The files inside gt and infer must correspond one‑to‑one.
  • Filenames and counts must match exactly (e.g., gt/0002.wav pairs with infer/0002.wav).
  • Ensure sampling rates and channel configurations are consistent if required by downstream metrics.

Metrics

Semantic & Acoustic Metrics

  • Character Error Rate (CER): Assesses transcript/content accuracy.
  • Cosine Similarity (SIM): Measures speaker timbre similarity between the generated audio and the prompt/reference audio (e.g., via speaker embeddings).
  • F0 Frame Error (FFE): Evaluates prosody fidelity by comparing voiced/unvoiced decisions and pitch (F0) frames.

Spatial Metrics

  • IPD MAE: Mean Absolute Error between ground‑truth and generated Interaural Phase Differences.
  • ILD MAE: Mean Absolute Error between ground‑truth and generated Interaural Level Differences.
  • Angle Cosine Similarity (ANG Cos): Cosine similarity between ground‑truth and generated direction (azimuth / elevation) angle embeddings.
  • Distance Cosine Similarity (Dis Cos): Cosine similarity between ground‑truth and generated distance embeddings.

Note: Cosine‑based spatial scores are in the range [-1, 1], with higher values indicating closer alignment of spatial embeddings.

Running the Evaluation

Run the following script to perform the evaluation pipeline:

cd evaluation
bash ./evaluate/eval.sh

The script evaluate/eval.sh executes the following three stages:

  1. Extract angle and distance embeddings using the BAT encoder.

  2. Extract IPD & ILD features from paired ground‑truth and generated stereo audio.

  3. Compute metrics: MAE (for IPD / ILD) and cosine similarities (for angle and distance).

Ensure that ground‑truth and generated audio files are correctly paired and preprocessed before running the script.

Citations

If you find this code useful in your research, please cite our work:

@article{zhang2025isdrama,
  title={ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting},
  author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2504.20630},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

visitors

About

Dataset and evaluation code of ISDrama(ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published