EC-523-Speech-to-Emotion-Recognition

Introduction:

Our task is to use a deep learning architecture to identify the underlying emotion given some English speaking audio, formally known as Speech to Emotion Recognition (SER). We would like our model to differentiate between 6 emotions: anger, disgust, fear, happiness, sadness, and neutrality.

Requirements:

Python 3.10.16

Packages:

librosa
matplotlib
mamba-ssm
numpy
pandas
pytorch-lightning
pytorch-metric-learning
random
scipy
seaborn
soundfile
torch
torch-audiomentations
torch-pitch-shift
torchaudio
torchmetrics
torchvision
transformers

Installed in conda environment using:

conda create -n yourenv pip

pip install -r requirements.txt

How to run:

Launch a Jupyter notebook with the following settings:

List of modules to load: miniconda, python, torch, and tensorflow.
Pre-Launch Command: conda activate your_project_location/envs/mamba-env/
Default: 5 cores minimum, but you can lower the number of workers if necessary.

Dataset

Our project uses four different datasets CREMA-D, RAVDESS, TESS, and SAVEE. all input files are in WAV.

CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset):
This dataset includes 7,442 audio clips from 91 actors (48 male, 43 female). Each actor spoke 12 sentences representing six emotions (Anger, Disgust, Fear, Happy, Neutral, Sad), expressed at four levels (Low, Medium, High, Unspecified).
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) RAVDESS consists of 7,356 recordings from 24 professional actors (12 male, 12 female) who spoke two statements each. The emotions expressed in this dataset include calm, happy, sad, angry, fearful, surprise, and disgust, both at two levels of emotional intensity, along with two levels of neutral expression.
TESS (Toronto Emotional Speech Set) This dataset features 1,400 recordings of two female actors, ages 24 and 64, articulating 200 target words. Each word was spoken to convey one of seven emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral.
SAVEE (Surrey Audio-Visual Expressed Emotion) SAVEE consists of 480 English statements recorded from 4 male actors. Each actor expressed 7 emotions: anger, disgust, fear, happiness, sadness, surprise, and neutral.

Results

The table below shows the test accuracies for all our of our models to be above 70%. The Mamba-CNN model has the highest accuracy of 74.29%. Majority of the models began overfit around 10 epochs but still uphold decent accuracy.

Model	Test Accuracy (%)	Training Accuracy (%)
Base CNN	73.5	99.55
CNN with Gru	71.11	99.98
CNN-Transformer	70.8	99.11
ResNet	70.89	95.53
Mamba	72.79	98.62
Mamba CNN	74.29	81.95

Code Files Walkthrough

The main.ipynb file contains the results of the CNN based models (Base CNN, CNN-Transformer, CNN-GRU, ResNet).

The mamba.ipynb file contains the results of the base Mamba model. The mambaCNN.ipynb file contains the results of the Mamba-CNN combined model.

The utils directory contains all of the helper methods. The preprocessing.py file stores the preprocessing and normalization methods. The data_loader.py file stores the dataset loader methods. The train_utils.py file contains the training, testing, and accuracy measurement methods. The models.py contains the implementations of the CNN based models.

The hf-demo directory contains the code for the UI demo (HuggingFace).

The single-dataset-esting directory contains the results of the Base CNN model when it was run on each of the datasets individually.

Citation

T. V. L. Trinh, D. T. L. T. Dao, L. X. T. Le, and E. Castelli, “Emotional speech recognition using deep neural networks,” Sensors (Basel), vol. 22, no. 4, p. 1414, Feb. 2022, doi: 10.3390/s22041414.
R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, and T. Alhussain, "Speech emotion recognition using deep learning techniques: A review," IEEE Access, vol.7,pp.117327–117345,2019,doi:10.1109/ACCESS.2019.2936124.
S. Han, F. Leng, and Z. Jin, “Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network,” 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), vol. 2021, pp. 803–807, Beijing, China, 2021, doi: 10.1109/CISCE52179.2021.9445906.
R. R. Subramanian, Y. Sireesha, Y. S. P. K. Reddy, T. Bindamrutha, M. Harika, and R. R. Sudharsan, "Audio emotion recognition by deep neural networks and machine learning algorithms," 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 2021, pp. 1–6, doi: 10.1109/ICAECA52838.2021.9675492.
NeuroByte, “Speech Emotion Recognition with TensorFlow: A CNN & CRNN Guide,” NeuroByte, Jan. 19, 2025. Available: https://neurobyte.org/guides/speech-emotion-recognition-cnns-crnns-tensorflow/.
Kempner Institute. (n.d.). Repeat after me: Transformers are better than state space models at copying. Harvard University. Retrieved March 4, 2025, from https://kempnerinstitute.harvard.edu/research/deeper-learning/repeat-after-me-transformers-are-better-than-state-space-models-at-copying/
Albert Gu and Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv preprint arXiv:2312.00752 (2023).
S. Sharanyaa, T. J. Mercy and S. V.G, "Emotion Recognition Using Speech Processing," 2023 3rd International Conference on Intelligent Technologies (CONIT), Hubli, India, 2023, pp. 1-5, doi: 10.1109/CONIT59222.2023.10205935.

This project was done Spring 2025 for EC 523 Deep Learning at Boston University.

Anish Sinha, James Knee, Nathan Strahs, Tyler Nguyen, Varsha Singh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EC-523-Speech-to-Emotion-Recognition

Introduction:

How to run:

Dataset

Results

Code Files Walkthrough

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
hf-demo		hf-demo
single-dataset-testing		single-dataset-testing
utils		utils
README.md		README.md
main.ipynb		main.ipynb
mamba.ipynb		mamba.ipynb
mambaCNN.ipynb		mambaCNN.ipynb
requirements.txt		requirements.txt

Anish701/EC523-Speech-Emotion-Recognition

Folders and files

Latest commit

History

Repository files navigation

EC-523-Speech-to-Emotion-Recognition

Introduction:

How to run:

Dataset

Results

Code Files Walkthrough

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages