SocioEmoDialog: A Multimodal Dyadic Dialogue Dataset with Sociologically-Aligned Emotion Distribution

Introduction

Official repository of the SocioEmoDialog dataset - A large-scale Chinese audio-visual dialogue dataset featuring 21,800 professionally acted dialogues (400 hours) with synchronized high-quality video and audio. Includes:

Actor diversity: 119 actors with varied demographics (age, gender, etc.).
Emotion annotations: Discrete emotional labels aligned with sociologically grounded distributions.
Diverse scenarios: Covers real-life interactions with natural conversational flow and emotional expressions.
Professional recording: Filmed in acoustically treated neutral studios using high-end cameras and microphones.

Works on SocioEmoDialog

We propose the first high-quality multimodal dialogue dataset aligned with sociologically grounded distributions of emotional expression in everyday human interaction. The data set comprises 21,800 dialogue sessions performed by 119 professional actors, spanning 18 emotional categories, with a total duration of 400 hours.

1. Statistics

SocioEmoDialog	Value
# actors	119
# emotions	18
# dialogues	21,880
# utterance	268,404
# gender-male	58
# gender-famale	61
avg # utterance/dialogue	12.27
avg age	26
total length (hr)	400

2. Comparison of SocioEmoDialog and pervious datasets

Comparison of different datasets. SocioEmoDialog excels in expression diversity, actor scale, data scale, and recording quality. S and D denote for single and dyadic, respectively.

3. Emotion Distribution Aligned with Real-World Statistics

In real life, people experience a broad spectrum of emotions on a daily basis, and the frequency of these emotions is inherently imbalanced. As in relevant sociological research, the natural distribution of emotions in everyday life is heavily skewed, with neutral and mildly positive emotions occurring far more frequently than extreme emotions such as anger or fear. Guided by these statistical findings, our dataset is constructed to closely reflect real-world emotional frequencies.

4. Data Field Descriptions

Script Data Format

{
    <dialogue id>: {
        "topic_label": <topic label>,
        "num_utterances": <number of utterances>,
        "utterances": [ # ordered list of conversation turns
            {
                "utterance_id": <utterance id>,
                "speaker_id": 0 or 1,
                "emotion_label": <emotion label>,
                "text": <utterance text>
            },
            ...
        ]
    },
    ...
}

dialogue_id: A unique identifier for the dialogue instance
topic_label: A label describing the topic or theme of the dialogue (e.g., "Fashion", "Music")
num_utterances: The total number of utterances (turns) in the dialogue
utterances: An ordered list of utterance objects, representing the turns in the dialogue
- utterance_id: A sequential identifier for the utterance within the dialogue
- speaker_id: An identifier for the speaker, typically 0 for one participant and 1 for the other
- emotion_label: The annotated emotion associated with the utterance (e.g., "Joy", "Anger")
- text: The content of the utterance

Video Data Format

Each video is named using the pattern <date>_md5_<side>.mp4, where <date> indicates the recording date, and <side> denotes either left or right, corresponding to the position of the speaker in the video.

Getting Started

1. Environment

Note

Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/KwaiVGI/SocioEmoDialog.git
cd SocioEmoDialog

# create env using conda
conda create -n socioemodialog python==3.8
conda activate socioemodialog

pip install -r requirements.txt

You should install the corresponding torch version. Visit the PyTorch Official Website for installation commands if your CUDA version.

2. Download

The easiest way to download our dataset is from HuggingFace:

# !pip install -U "huggingface_hub[cli]"
huggingface-cli download SocioEmoDialog/SocioEmoDialog-21.8K --local-dir data --exclude "*.git*" "README.md" "docs"

Then, place the raw_videos folder into data/videos, and place the SocioEmoDialog_scripts.json file and the SocioEmoDialog_scripts_cn.json file into data/scripts.

3. Data process

Process video files

The original video is <video_name.mp4>, in which the audio track contains a stereo structure: the left channel corresponds to the speech of the actor on the left, and the right channel corresponds to the actor on the right. We process the raw data through the following steps:

Channel Separation We use ffmpeg to extract the left and right audio channels into two separate mono audio files, corresponding to the speech of the left and right actors, respectively.
Speaker Diarization Each mono audio track is independently processed using speaker diarization. We apply speaker embedding clustering to annotate speech segments from different speakers within the same channel.
Automatic Speech Recognition (ASR) The Whisper model is used to transcribe each valid speech segment, producing both the transcribed text and accurate timestamps.
Dialogue Clip Generation Based on the speaker segmentation and temporal alignment, the original video is split into short clips organized by dialogue turns, facilitating downstream multimodal analysis.
Result Saving Each transcribed speech segment, along with its timestamp information, is saved as a structured .json file, named consistently with the original audio file.

Data Processing Instructions To process the data, ensure that the raw video files are located in the following directory:

data/videos/

Then run the processing script:

cd data_tools
python video_processor.py

This will extract and prepare audio segments from the videos for ASR.

Output Directory Structure After processing, you should get the following structure for each video:

├── <video_name>/
    ├── video_segments/
        ├── 000000.mp4
        └── ...
    ├── wav_left_segments/
        ├── 000000.wav
        └── ...
    ├── wav_right_segments/
        ├── 000000.wav
        └── ...
    ├── <video_name>_left.wav
    ├── <video_name>_right.wav
    ├── <video_name>_left_mute.wav
    ├── <video_name>_right_mute.wav
    ├── <video_name>_left_speaker_diarization.log
    ├── <video_name>_right_speaker_diarization.log
    ├── <video_name>_left_speaker_diarization_asr.json
    └── <video_name>_right_speaker_diarization_asr.json

Description of Each Item：

<video_name>: The base name of the original video file
video_segments/: Contains turn-level video clips extracted from the original video
wav_left_segments/: Contains audio segments from the left channel (left speaker only)
wav_right_segments/: Contains audio segments from the right channel (right speaker only)
<video_name>_left.wav / <video_name>_right.wav: The raw audio extracted from the left/right channel
<video_name>_left_mute.wav / <video_name>_right_mute.wav: The left/right audio with the non-target speaker muted
<video_name>_left_speaker_diarization.log / ...right...: Log file containing speaker diarization segment info
<video_name>_left_speaker_diarization_asr.json / ...right...: Transcription results (ASR) for the left/right speaker segments

Extract ASR-aligned scripts

Run the following code to retrieve the matching script lines for each actor's utterance.

python whisper_asr.py

Evaluation

1. Emotion

To test the emotion classification functionality, run the following command:

cd eval_tools
python emotion/emotion_classfication.py

Note: Replace penai.api_key = "your-api-key" in the script with your actual OpenAI API key.

To test the emotion statistics functionality, run the following command:

python emotion/emotion_evaluator.py

2. Videos

Body Segmentation

First, clone the official repository and follow its instructions to set up the environment:

cd eval_tools/video/BodySegmentationTool
git clone https://github.com/zllrunning/face-parsing.PyTorch.git

Then, refer to the official repository to install dependencies and download the pretrained model (79999_iter.pth). Once face-parsing.PyTorch is set up, run the following command:

python BodySegmentationTool.py

Eyes Tracking

First, clone the official repository and follow its instructions to set up the environment:

cd eval_tools/video/EyesTracking
git clone https://github.com/TadasBaltrusaitis/OpenFace.git

Once OpenFace is set up, run the eye tracking script:

python EyesTracking.py

To generate a heatmap of head pose data, run the following script:

python DataToPlot_HeatMap.py

Citation

If you find SocioEmoDialog useful for your research, welcome to star this repo and cite our work using the following BibTeX:

@unpublished{socioemodialog2025,
  title   = {SocioEmoDialog: A Multimodal Dyadic Dialogue Dataset with Sociologically-Aligned Emotion Distribution},
  year    = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
data_tools		data_tools
eval_tools		eval_tools
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SocioEmoDialog: A Multimodal Dyadic Dialogue Dataset with Sociologically-Aligned Emotion Distribution

Introduction

Works on SocioEmoDialog

1. Statistics

2. Comparison of SocioEmoDialog and pervious datasets

3. Emotion Distribution Aligned with Real-World Statistics

4. Data Field Descriptions

Script Data Format

Video Data Format

Getting Started

1. Environment

2. Download

3. Data process

Process video files

Extract ASR-aligned scripts

Evaluation

1. Emotion

2. Videos

Body Segmentation

Eyes Tracking

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

KwaiVGI/SocioEmoDialog

Folders and files

Latest commit

History

Repository files navigation

SocioEmoDialog: A Multimodal Dyadic Dialogue Dataset with Sociologically-Aligned Emotion Distribution

Introduction

Works on SocioEmoDialog

1. Statistics

2. Comparison of SocioEmoDialog and pervious datasets

3. Emotion Distribution Aligned with Real-World Statistics

4. Data Field Descriptions

Script Data Format

Video Data Format

Getting Started

1. Environment

2. Download

3. Data process

Process video files

Extract ASR-aligned scripts

Evaluation

1. Emotion

2. Videos

Body Segmentation

Eyes Tracking

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages