Skip to content

markitantov/MASAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion

Abstract

Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. Current multi-modal approaches are often single-task and corpus-specific, resulting in overfitting, poor generalization across corpora, and reduced real-world performance. In this work, we address these limitations by: (1) multi-lingual training on corpora that include Russian (RAMAS) and English (MELD, CMU-MOSEI) speech; (2) multi-task learning for joint emotion and sentiment recognition; and (3) a novel Triple Fusion strategy that employs cross-modal integration at both hierarchical unimodal and fused multi-modal feature levels, enhancing intra- and inter-modal relationships of different affective states and modalities. Additionally, to optimize performance of the approach proposed, we compare temporal encoders (Transformer-based, Mamba, xLSTM) and fusion strategies (double and triple fusion strategies with and without a label encoder) to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6% for emotion recognition and weighted F1-score (WF) of 84.8% for sentiment recognition (respectively +9.5% and +6.0% absolute over prior multi-task baselines). On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% for emotion and WF of 60.0% for sentiment (+8.4% WF for emotion recognition over the strongest multi-task baseline). On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6%, and weighted F1-score (WF) of 84.8% for emotion and sentiment recognition, respectively. On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% and WF of 60.0%, respectively. On the Test subset of the RAMAS corpus, the proposed approach showed competitive performance with WF of 71.8% and WF of 90.0%, respectively. We compare the performance of the approach proposed with that of the state-of-the-art ones.

About

The official repository for TRIFONES page

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages