- Project Overview
- Dataset
- Solution Approach
- Performance
- Training Visualization
- Implementation Details
- Running the Project
- Future Improvements
This repository contains my solution for the Kaggle competition Quora Insincere Questions Classification. The challenge focuses on identifying and filtering out toxic and misleading questions from the Quora platform.
Quora is a platform that empowers people to learn from each other through asking questions and connecting with others who provide unique insights. A significant challenge for the platform is to filter out "insincere" questions—those founded upon false premises or intended to make statements rather than seek genuine answers.
An insincere question may have these characteristics:
- Non-neutral or exaggerated tone targeting groups of people
- Rhetorical questions implying statements about groups
- Disparaging or inflammatory content
- Discriminatory ideas or stereotype confirmation
- Attacks against specific individuals or groups
- Outlandish premises about groups of people
- Disparagement of immutable characteristics
- Questions not grounded in reality or based on false information
- Sexual content used for shock value rather than seeking genuine answers
The dataset consists of over 1.3 million questions with binary labels indicating whether a question is sincere (0) or insincere (1).
- Training set: 1,306,122 questions
- Test set: 56,370 questions
- Features:
qid: Unique question identifierquestion_text: The text of the Quora questiontarget: Binary label (1 for insincere, 0 for sincere)
I implemented a deep learning approach using PyTorch with fine-tuned RoBERTa (FacebookAI/roberta-base) as the base model. The architecture includes:
- RoBERTa pre-trained language model for feature extraction
- A custom classification head with dropout for regularization
- Binary cross-entropy loss with logits for handling class imbalance
- Utilized stratified sampling to maintain class distribution
- Implemented gradient clipping to prevent exploding gradients
- Used AdamW optimizer with learning rate scheduling
- Applied exponential learning rate decay to optimize convergence
- Final F1 Score: 0.70198 on the test set
- Ranking: 560 out of 1397 teams
- Key Training Metrics:
- Best validation loss: 0.086472
- Best F1 score: 0.72935 during validation
- Convergence after 15999 steps of minibatch size 64
The model implementation is divided into several modules:
configurations.py: Configuration parameters for training and inferencedataset.py: Data loading and preprocessing pipelinemodel.py: Model architecture definitiontrainer.py: Training loop with validation logicinference.py: Prediction functionalitymain.py: Entry point with CLI interface
- Handling Class Imbalance: Implemented weighted loss functions to address the imbalance between sincere and insincere questions.
- Exploding Gradients: Identified and mitigated exploding gradients during training with gradient clipping (max norm of 5000).
- Optimization: Dynamic learning rate scheduling to improve model convergence and performance.
torch>=1.8.0
transformers>=4.5.0
pandas>=1.2.0
numpy>=1.19.0
scikit-learn>=0.24.0
rich>=10.0.0
wandb>=0.12.0 (optional)
python main.py --mode trainpython main.py --mode inference --data_path test.csv- Experiment with ensemble methods combining multiple pre-trained language models



