SocioEmoDialog: A Multimodal Dyadic Dialogue Dataset with Sociologically-Aligned Emotion Distribution
Official repository of the SocioEmoDialog dataset - A large-scale Chinese audio-visual dialogue dataset featuring 21,800 professionally acted dialogues (400 hours) with synchronized high-quality video and audio. Includes:
- Actor diversity: 119 actors with varied demographics (age, gender, etc.).
- Emotion annotations: Discrete emotional labels aligned with sociologically grounded distributions.
- Diverse scenarios: Covers real-life interactions with natural conversational flow and emotional expressions.
- Professional recording: Filmed in acoustically treated neutral studios using high-end cameras and microphones.
We propose the first high-quality multimodal dialogue dataset aligned with sociologically grounded distributions of emotional expression in everyday human interaction. The data set comprises 21,800 dialogue sessions performed by 119 professional actors, spanning 18 emotional categories, with a total duration of 400 hours.
SocioEmoDialog | Value |
---|---|
# actors | 119 |
# emotions | 18 |
# dialogues | 21,880 |
# utterance | 268,404 |
# gender-male | 58 |
# gender-famale | 61 |
avg # utterance/dialogue | 12.27 |
avg age | 26 |
total length (hr) | 400 |
Comparison of different datasets. SocioEmoDialog excels in expression diversity, actor scale, data scale, and recording quality. S and D denote for single and dyadic, respectively.
In real life, people experience a broad spectrum of emotions on a daily basis, and the frequency of these emotions is inherently imbalanced. As in relevant sociological research, the natural distribution of emotions in everyday life is heavily skewed, with neutral and mildly positive emotions occurring far more frequently than extreme emotions such as anger or fear. Guided by these statistical findings, our dataset is constructed to closely reflect real-world emotional frequencies.
{
<dialogue id>: {
"topic_label": <topic label>,
"num_utterances": <number of utterances>,
"utterances": [ # ordered list of conversation turns
{
"utterance_id": <utterance id>,
"speaker_id": 0 or 1,
"emotion_label": <emotion label>,
"text": <utterance text>
},
...
]
},
...
}
- dialogue_id: A unique identifier for the dialogue instance
- topic_label: A label describing the topic or theme of the dialogue (e.g., "Fashion", "Music")
- num_utterances: The total number of utterances (turns) in the dialogue
- utterances: An ordered list of utterance objects, representing the turns in the dialogue
- utterance_id: A sequential identifier for the utterance within the dialogue
- speaker_id: An identifier for the speaker, typically 0 for one participant and 1 for the other
- emotion_label: The annotated emotion associated with the utterance (e.g., "Joy", "Anger")
- text: The content of the utterance
Each video is named using the pattern <date>_md5_<side>.mp4
, where <date>
indicates the recording date, and <side>
denotes either left
or right
, corresponding to the position of the speaker in the video.
git clone https://github.com/KwaiVGI/SocioEmoDialog.git
cd SocioEmoDialog
# create env using conda
conda create -n socioemodialog python==3.8
conda activate socioemodialog
pip install -r requirements.txt
You should install the corresponding torch version. Visit the PyTorch Official Website for installation commands if your CUDA version.
The easiest way to download our dataset is from HuggingFace:
# !pip install -U "huggingface_hub[cli]"
huggingface-cli download SocioEmoDialog/SocioEmoDialog-21.8K --local-dir data --exclude "*.git*" "README.md" "docs"
Then, place the raw_videos
folder into data/videos
,
and place the SocioEmoDialog_scripts.json
file and the SocioEmoDialog_scripts_cn.json
file into data/scripts
.
The original video is <video_name.mp4>
, in which the audio track contains a stereo structure: the left channel corresponds to the speech of the actor on the left, and the right channel corresponds to the actor on the right. We process the raw data through the following steps:
- Channel Separation We use ffmpeg to extract the left and right audio channels into two separate mono audio files, corresponding to the speech of the left and right actors, respectively.
- Speaker Diarization Each mono audio track is independently processed using speaker diarization. We apply speaker embedding clustering to annotate speech segments from different speakers within the same channel.
- Automatic Speech Recognition (ASR) The Whisper model is used to transcribe each valid speech segment, producing both the transcribed text and accurate timestamps.
- Dialogue Clip Generation Based on the speaker segmentation and temporal alignment, the original video is split into short clips organized by dialogue turns, facilitating downstream multimodal analysis.
- Result Saving Each transcribed speech segment, along with its timestamp information, is saved as a structured .json file, named consistently with the original audio file.
Data Processing Instructions To process the data, ensure that the raw video files are located in the following directory:
data/videos/
Then run the processing script:
cd data_tools
python video_processor.py
This will extract and prepare audio segments from the videos for ASR.
Output Directory Structure After processing, you should get the following structure for each video:
├── <video_name>/
├── video_segments/
├── 000000.mp4
└── ...
├── wav_left_segments/
├── 000000.wav
└── ...
├── wav_right_segments/
├── 000000.wav
└── ...
├── <video_name>_left.wav
├── <video_name>_right.wav
├── <video_name>_left_mute.wav
├── <video_name>_right_mute.wav
├── <video_name>_left_speaker_diarization.log
├── <video_name>_right_speaker_diarization.log
├── <video_name>_left_speaker_diarization_asr.json
└── <video_name>_right_speaker_diarization_asr.json
Description of Each Item:
- <video_name>: The base name of the original video file
- video_segments/: Contains turn-level video clips extracted from the original video
- wav_left_segments/: Contains audio segments from the left channel (left speaker only)
- wav_right_segments/: Contains audio segments from the right channel (right speaker only)
- <video_name>_left.wav / <video_name>_right.wav: The raw audio extracted from the left/right channel
- <video_name>_left_mute.wav / <video_name>_right_mute.wav: The left/right audio with the non-target speaker muted
- <video_name>_left_speaker_diarization.log / ...right...: Log file containing speaker diarization segment info
- <video_name>_left_speaker_diarization_asr.json / ...right...: Transcription results (ASR) for the left/right speaker segments
Run the following code to retrieve the matching script lines for each actor's utterance.
python whisper_asr.py
To test the emotion classification functionality, run the following command:
cd eval_tools
python emotion/emotion_classfication.py
Note: Replace penai.api_key = "your-api-key"
in the script with your actual OpenAI API key.
To test the emotion statistics functionality, run the following command:
python emotion/emotion_evaluator.py
First, clone the official repository and follow its instructions to set up the environment:
cd eval_tools/video/BodySegmentationTool
git clone https://github.com/zllrunning/face-parsing.PyTorch.git
Then, refer to the official repository to install dependencies and download the pretrained model (79999_iter.pth). Once face-parsing.PyTorch is set up, run the following command:
python BodySegmentationTool.py
First, clone the official repository and follow its instructions to set up the environment:
cd eval_tools/video/EyesTracking
git clone https://github.com/TadasBaltrusaitis/OpenFace.git
Once OpenFace is set up, run the eye tracking script:
python EyesTracking.py
To generate a heatmap of head pose data, run the following script:
python DataToPlot_HeatMap.py
If you find SocioEmoDialog useful for your research, welcome to star this repo and cite our work using the following BibTeX:
@unpublished{socioemodialog2025,
title = {SocioEmoDialog: A Multimodal Dyadic Dialogue Dataset with Sociologically-Aligned Emotion Distribution},
year = {2025}
}