Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion

Abstract

Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. Current multi-modal approaches are often single-task and corpus-specific, resulting in overfitting, poor generalization across corpora, and reduced real-world performance. In this work, we address these limitations by: (1) multi-lingual training on corpora that include Russian (RAMAS) and English (MELD, CMU-MOSEI) speech; (2) multi-task learning for joint emotion and sentiment recognition; and (3) a novel Triple Fusion strategy that employs cross-modal integration at both hierarchical unimodal and fused multi-modal feature levels, enhancing intra- and inter-modal relationships of different affective states and modalities. Additionally, to optimize performance of the approach proposed, we compare temporal encoders (Transformer-based, Mamba, xLSTM) and fusion strategies (double and triple fusion strategies with and without a label encoder) to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6% for emotion recognition and weighted F1-score (WF) of 84.8% for sentiment recognition (respectively +9.5% and +6.0% absolute over prior multi-task baselines). On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% for emotion and WF of 60.0% for sentiment (+8.4% WF for emotion recognition over the strongest multi-task baseline). On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6%, and weighted F1-score (WF) of 84.8% for emotion and sentiment recognition, respectively. On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% and WF of 60.0%, respectively. On the Test subset of the RAMAS corpus, the proposed approach showed competitive performance with WF of 71.8% and WF of 90.0%, respectively. We compare the performance of the approach proposed with that of the state-of-the-art ones.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt
run-notebook.sh		run-notebook.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion

Abstract

About

Uh oh!

Releases

Packages

Languages

markitantov/MASAI

Folders and files

Latest commit

History

Repository files navigation

Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion

Abstract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages