This repository contains a Python project that builds an emotion classification model from text data using a Multinomial Naive Bayes classifier. The workflow includes loading and preprocessing the data, feature extraction with TF-IDF, model training, evaluation, and making predictions on new text.
- Overview
- Data Preparation
- Exploratory Data Analysis
- Text Preprocessing
- Feature Extraction
- Model Training and Evaluation
- Example Prediction
- Dependencies
- Usage
The project performs emotion classification by:
- Reading a dataset containing text and corresponding emotion labels.
- Replacing numerical labels with emotion names.
- Preprocessing the text to remove stopwords and perform lemmatization.
- Converting text into numerical features using TF-IDF.
- Splitting the data into training and testing sets.
- Training a Naive Bayes classifier.
- Evaluating the model's performance and making a prediction on new input.
-
Data Loading:
The dataset is loaded from a CSV file (emotions.csv) using pandas. -
Label Mapping:
The code replaces numerical labels (0-5) with corresponding emotion names:0→ "sadness"1→ "joy"2→ "love"3→ "anger"4→ "fear"5→ "surprise"
-
Subsetting Data:
For efficiency, only the first 5000 rows are used for further analysis.
-
Dataset Shape & Missing Values:
The code prints the shape of the dataset and checks for any missing values. -
Label Distribution:
It prints the frequency distribution of the emotion labels to understand class balance. -
Data Information:
General information and summary statistics about the dataset are displayed.
-
SpaCy Initialization:
The SpaCy English model (en_core_web_sm) is loaded to perform NLP tasks. -
Preprocessing Function:
A functionpreprocess_textis defined to:- Convert text to lowercase.
- Tokenize the text.
- Remove stopwords.
- Keep only alphabetic tokens.
- Apply lemmatization.
-
Applying Preprocessing:
The function is applied to thetextcolumn in the dataset, and a new columncleaned_textis created.
-
TF-IDF Vectorization:
TheTfidfVectorizeris initialized with a maximum of 1000 features.- The cleaned text data is transformed into a TF-IDF feature matrix.
-
Train-Test Split:
The feature matrix (X) and target labels (y) are split into training and testing sets using an 80/20 ratio.
-
Training:
A Multinomial Naive Bayes classifier is instantiated and trained on the training data. -
Prediction and Evaluation:
The classifier predicts emotions on the test set.- The model's accuracy is calculated.
- A classification report is generated, showing precision, recall, and F1-score for each emotion.
An example is provided where a new text ("He seemed mesmerized") is:
- Preprocessed using the same
preprocess_textfunction. - Transformed using the trained TF-IDF vectorizer.
- Classified using the trained Naive Bayes classifier, outputting the predicted emotion.
-
Python Libraries:
numpypandasscikit-learnspacy
-
SpaCy Model:
en_core_web_sm(Make sure to download it viapython -m spacy download en_core_web_sm)
-
Clone the Repository
Clone this repository to your local machine. -
Install Dependencies
Install the required libraries using pip:pip install numpy pandas scikit-learn spacy python -m spacy download en_core_web_sm