Quora Insincere Question Classification

Project Overview

This repository contains my solution for the Kaggle competition Quora Insincere Questions Classification. The challenge focuses on identifying and filtering out toxic and misleading questions from the Quora platform.

The Problem

Quora is a platform that empowers people to learn from each other through asking questions and connecting with others who provide unique insights. A significant challenge for the platform is to filter out "insincere" questions—those founded upon false premises or intended to make statements rather than seek genuine answers.

An insincere question may have these characteristics:

Non-neutral or exaggerated tone targeting groups of people
Rhetorical questions implying statements about groups
Disparaging or inflammatory content
Discriminatory ideas or stereotype confirmation
Attacks against specific individuals or groups
Outlandish premises about groups of people
Disparagement of immutable characteristics
Questions not grounded in reality or based on false information
Sexual content used for shock value rather than seeking genuine answers

Dataset

The dataset consists of over 1.3 million questions with binary labels indicating whether a question is sincere (0) or insincere (1).

Data Description:

Training set: 1,306,122 questions
Test set: 56,370 questions
Features:
- qid: Unique question identifier
- question_text: The text of the Quora question
- target: Binary label (1 for insincere, 0 for sincere)

Solution Approach

Model Architecture

I implemented a deep learning approach using PyTorch with fine-tuned RoBERTa (FacebookAI/roberta-base) as the base model. The architecture includes:

RoBERTa pre-trained language model for feature extraction
A custom classification head with dropout for regularization
Binary cross-entropy loss with logits for handling class imbalance

Training Process

Utilized stratified sampling to maintain class distribution
Implemented gradient clipping to prevent exploding gradients
Used AdamW optimizer with learning rate scheduling
Applied exponential learning rate decay to optimize convergence

Performance

Final F1 Score: 0.70198 on the test set
Ranking: 560 out of 1397 teams
Key Training Metrics:
- Best validation loss: 0.086472
- Best F1 score: 0.72935 during validation
- Convergence after 15999 steps of minibatch size 64

Training Visualization

Training Loss

F1 Score on Val Set During Training

Threshold of obtaining Best F1 Score

Validation Loss

Wandb Dashboard Link

Dashboard_Link

Implementation Details

The model implementation is divided into several modules:

configurations.py: Configuration parameters for training and inference
dataset.py: Data loading and preprocessing pipeline
model.py: Model architecture definition
trainer.py: Training loop with validation logic
inference.py: Prediction functionality
main.py: Entry point with CLI interface

Key Technical Challenges

Handling Class Imbalance: Implemented weighted loss functions to address the imbalance between sincere and insincere questions.
Exploding Gradients: Identified and mitigated exploding gradients during training with gradient clipping (max norm of 5000).
Optimization: Dynamic learning rate scheduling to improve model convergence and performance.

Running the Project

Requirements

torch>=1.8.0
transformers>=4.5.0
pandas>=1.2.0
numpy>=1.19.0
scikit-learn>=0.24.0
rich>=10.0.0
wandb>=0.12.0 (optional)

Training

python main.py --mode train

Inference

python main.py --mode inference --data_path test.csv

Future Improvements

Experiment with ensemble methods combining multiple pre-trained language models

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Notebook		Notebook
Results		Results
.gitignore		.gitignore
README.md		README.md
configurations.py		configurations.py
dataset.py		dataset.py
inference.py		inference.py
main.py		main.py
model.py		model.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quora Insincere Question Classification

Contents

Project Overview

The Problem

Dataset

Data Description:

Solution Approach

Model Architecture

Training Process

Performance

Training Visualization

Training Loss

F1 Score on Val Set During Training

Threshold of obtaining Best F1 Score

Validation Loss

Wandb Dashboard Link

Implementation Details

Key Technical Challenges

Running the Project

Requirements

Training

Inference

Future Improvements

About

Uh oh!

Releases

Packages

Languages

shu-shobhit/Quora-Insinciere-Question-Classification

Folders and files

Latest commit

History

Repository files navigation

Quora Insincere Question Classification

Contents

Project Overview

The Problem

Dataset

Data Description:

Solution Approach

Model Architecture

Training Process

Performance

Training Visualization

Training Loss

F1 Score on Val Set During Training

Threshold of obtaining Best F1 Score

Validation Loss

Wandb Dashboard Link

Implementation Details

Key Technical Challenges

Running the Project

Requirements

Training

Inference

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages