Detect fake vs real news articles using Machine Learning, TF-IDF, and Logistic Regression, complete with training scripts, evaluation charts, and an interactive Streamlit web app.
- Overview
- Demo
- Project Structure
- Installation
- Dataset
- Training the Model
- Evaluation & Charts
- How It Works
- Running the Streamlit App
- Code Modules
- Technologies Used
- License
- Author
- Future Improvements
The Fake News & Misinformation Detector is a complete end-to-end Natural Language Processing (NLP) project that classifies news headlines and articles as REAL or FAKE.
It combines TF-IDF feature extraction with a Logistic Regression classifier, achieving perfect accuracy on the cleaned dataset.
The project also includes:
- Model evaluation with visual charts
- Interactive Streamlit web app
- Reusable and modular code structure
When launched, the app allows you to paste or type any news headline or paragraph and analyze its credibility in real time.
- Prediction: REAL or FAKE
- Probability bar visualization
- Adjustable fake-detection threshold
fake-news-detector/
│
├── data/
│ ├── True.csv # Real news (999 rows)
│ ├── Fake.csv # Fake news (999 rows)
│
├── outputs/
│ ├── model.joblib # Trained Logistic Regression model
│ ├── vectorizer.joblib # TF-IDF vectorizer
│ ├── pipeline.joblib # Combined pipeline (optional)
│ ├── metrics.json # Model performance report
│ ├── confusion_matrix.png # Confusion Matrix plot
│ ├── roc_curve.png # ROC curve plot
│ └── pr_curve.png # Precision-Recall curve plot
│
├── src/
│ ├── text_clean.py # Text preprocessing utilities
│ ├── utils.py # I/O helpers
│ ├── train_model.py # Training and evaluation script
│ ├── detect_fake_news.py # CLI prediction script
│ └── streamlit_app.py # Streamlit web application
│
└── README.md
git clone https://github.com/yourusername/fake-news-detector.git
cd fake-news-detectorpip install -r requirements.txtOr install manually:
pip install pandas numpy scikit-learn matplotlib streamlit joblib| File | Type | Rows | Columns |
|---|---|---|---|
True.csv |
Real news | 999 | title, text, subject, date |
Fake.csv |
Fake news | 999 | title, text, subject, date |
Dataset Source:
This project uses and modifies the Fake and Real News Dataset by Clément Bisaillon (Kaggle).
Data was cleaned, header-fixed, and downsampled to 999 REAL and 999 FAKE news articles for balanced training and clear visualization.
Used purely for educational and research purposes.
Run the following command from the project root:
python src/train_model.py --real data/True.csv --fake data/Fake.csv --text-col text --outdir outputsThis script will:
- Load both datasets (real and fake).
- Clean and merge them using
text_clean.py. - Extract TF-IDF features.
- Train a Logistic Regression classifier.
- Save outputs:
outputs/model.jobliboutputs/vectorizer.jobliboutputs/metrics.json- Performance charts (
confusion_matrix.png,roc_curve.png,pr_curve.png)
After training, the model achieves perfect classification accuracy on this dataset.
| True Label | Predicted REAL | Predicted FAKE |
|---|---|---|
| REAL | 999 ✅ | 0 ❌ |
| FAKE | 0 ❌ | 999 ✅ |
The model correctly classified all 1,998 samples.
The ROC curve touches the top-left corner AUC = 1.00
Perfect separability between classes.
Both precision and recall reach 1.00, meaning zero false predictions.
| Metric | Value |
|---|---|
| Accuracy | 100 % |
| Precision (FAKE) | 1.00 |
| Recall (FAKE) | 1.00 |
| F1-Score | 1.00 |
| ROC-AUC | 1.00 |
Although perfect accuracy is achieved on this dataset, it’s a controlled sample. Real-world news data will naturally introduce noise and uncertainty.
- Text Cleaning → Remove punctuation, URLs, emails, non-ASCII chars.
- TF-IDF Vectorization → Convert words into weighted numerical features.
- Logistic Regression → Predict probability of “FAKE” label.
- Thresholding → If
p(fake) ≥ 0.5→ FAKE, else REAL.
python src/detect_fake_news.py --model outputs/model.joblib --vectorizer outputs/vectorizer.joblib --text "It s tough sometimes to imagine that Donald Trump has five children since it s clear from Monday s speech in front of 40,000 Boy Scouts and other attendees at the Boy Scouts Jamboree in West Virginia that he has absolutely no idea what kind of talk is appropriate for children.While most adults would take this opportunity to offer some pearls of adult wisdom or cheerlead the Boy Scouts toward their futures, Trump chose to deliver a tirade of Trumpisms.Like almost any time Trump has tried to string together more than a couple of words at a time, most of his speech was an inarticulate mess which consisted of his trademark whining, a wee bit of swearing and a pointless anecdote about a burned out rich guy at a cocktail party."Output:
Label: FAKE | Fake probability: 0.560 | Threshold: 0.40
streamlit run src/streamlit_app.pyThen open the local web interface:
http://localhost:8501
- Paste any headline or paragraph
- Analyze with one click
- Adjust FAKE probability threshold
- See model file locations and loaded status in sidebar
| Module | Purpose |
|---|---|
text_clean.py |
Handles text normalization (lowercasing, regex-based cleaning) |
utils.py |
Ensures output directories exist and handles JSON I/O |
train_model.py |
Loads data, trains the model, and generates metrics and plots |
detect_fake_news.py |
CLI script for predicting individual samples |
streamlit_app.py |
Streamlit web app for interactive user testing |
- Python 3.10+
- scikit-learn → TF-IDF Vectorizer, Logistic Regression
- pandas / numpy → Data manipulation
- matplotlib → Model visualization
- joblib → Model persistence
- Streamlit → Web interface
- Integrate BERT / DistilBERT for contextual language understanding
- Extend dataset for multi-language fake news detection
- Add Explainable AI (LIME / SHAP) for model transparency
- Deploy live on Streamlit Cloud or Hugging Face Spaces