- Node.js (for frontend)
- Python 3.13+ (for backend)
- uv (recommended) or pip for Python package management
First, crawl the publication data from the source website:
cd backend/search_engine
python crawler.pyThis will:
- Crawl publication data from Coventry University's research portal
- Respect robots.txt and implement polite crawling delays
- Extract publication titles, authors, abstracts, and dates
- Save the data to
backend/search_engine/data/crawled_data.json
After crawling, build the search indexes:
cd backend/search_engine
python indexer.pyThis will:
- Process the crawled data and create field-based positional indexes
- Build TF-IDF matrices for titles, authors, and abstracts
- Save the indexes to
backend/search_engine/data/index.joblib
Train the document classification model:
cd backend/classification
python classifier.pyThis will:
- Load training data from
backend/classification/data/cleaned_data.csv - Train a Naive Bayes classifier with TF-IDF features
- Evaluate the model using K-fold cross-validation
- Save the trained model to
backend/classification/data/document_classifier.joblib
# Using pip
pip install -r backend/requirements.txt
# Or using uv (recommended)
uv pip install -r backend/requirements.txtnpm install# Using python
python fastapi dev backend/api.py
# Or using uv (recommended)
uv run fastapi dev backend/api.pyThe backend will be available at http://127.0.0.1:8000 and docs at http://127.0.0.1:8000/docs
npm run devThe frontend will be available at http://localhost:5173/
- Crawl Data:
python backend/search_engine/crawler.py - Build Index:
python backend/search_engine/indexer.py - Train Classifier:
python backend/classification/classifier.py - Install Dependencies:
pip install -r backend/requirements.txtandnpm install - Run Backend:
python fastapi dev backend/api.py - Run Frontend:
npm run dev
- Web Crawling: Automated data collection from research publications
- Search Engine: Field-based search with TF-IDF ranking and phrase queries
- Document Classification: Automatic categorization of research documents
- Interactive Web Interface: Modern React-based frontend for search and exploration
backend/
├── search_engine/
│ └── data/
│ ├── crawled_data.json # Raw crawled publication data
│ └── index.joblib # Search indexes and TF-IDF models
└── classification/
└── data/
├── cleaned_data.csv # Training data for classifier
└── document_classifier.joblib # Trained classification model