A robust asynchronous web scraper designed to extract manga information from AnimeClick.it. Built with Python, it handles pagination, tag-based browsing, and detailed manga information extraction while respecting the site's resources. Uses Datpulse for proxy management and anti-detection measures.
- 🚀 Asynchronous web scraping with AsyncWebCrawler
- 📚 Complete manga information extraction
- 🏷️ Tag-based manga categorization
- 💾 Structured JSON and CSV output
- 🤖 Anti-bot detection measures via Datpulse proxies
- ⏱️ Rate limiting and polite crawling
- 🔄 Automatic deduplication of manga entries
- Clone the repository:
git clone https://github.com/yourusername/animeclick-manga-scraper.git
cd animeclick-manga-scraper
uv sync- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
cp .env.example .env
# Edit .env with your Datpulse API key and other settingspython main.py
python dataset_maker.py
python recommendation.py
This script provides three types of analysis for manga plot summaries:
-
Genre Classification
- Uses
Jiva/xlm-roberta-large-it-mnlimodel for zero-shot classification - Classifies manga into multiple genres
- Shows top 3 most likely genres with confidence scores
- Uses
-
Emotion Analysis (MilaNLProc)
- Uses
MilaNLProc/feel-it-italian-emotionmodel - Analyzes emotions: anger, fear, joy, sadness
- Provides confidence scores as percentages
- Uses
-
Extended Emotion Analysis (aiknowyou)
- Uses
aiknowyou/it-emotion-analyzermodel - Analyzes six emotions: sadness, joy, love, anger, fear, surprise
- Shows top 3 emotions with confidence scores
- Uses
All analyses are color-coded in the terminal output for better readability using the colorama package.
{
"tag": "robot",
"tag_url": "/manga/tags/robot",
"extraction_date": "2024-02-27T12:00:00",
"manga_count": 42,
"manga_list": [
{
"href": "/manga/12345/manga-title",
"title": "Manga Title"
}
]
}{
"url": "https://www.animeclick.it/manga/12345/manga-title",
"extraction_date": "2024-02-27T12:00:00",
"details": {
"titolo_originale": "Original Japanese Title",
"titolo_inglese": "English Title",
"titolo_kanji": "漢字タイトル",
"nazionalita": "Giappone",
"casa_editrice": "Publisher Name",
"storia": "Story Author",
"disegni": "Artist Name",
"categorie": ["Shounen", "Seinen"],
"generi": ["Action", "Adventure"],
"anno": "2024",
"volumi": "10",
"capitoli": "42",
"stato_patria": "completato",
"stato_italia": "inedito",
"serializzato_su": "Magazine Name",
"trama": "Plot summary of the manga..."
}
}{
"url": "https://www.animeclick.it/manga/59956/1-2-3-de-kimeteageru",
"titolo_originale": "Original Japanese Title",
"titolo_inglese": "English Title",
"titolo_kanji": "漢字タイトル",
"nazionalita": "Giappone",
"casa_editrice": "Publisher Name",
"storia": "Story Author",
"disegni": "Artist Name",
"categorie": ["Shounen", "Seinen"],
"generi": ["Action", "Adventure"],
"anno": "2024",
"volumi": "10",
"capitoli": "42",
"stato_patria": "completato",
"stato_italia": "inedito",
"serializzato_su": "Magazine Name",
"trama": "Plot summary of the manga..."
"generi": [
"Scolastico",
"Sport"
],
"tag_generici": [
"club-scolastico",
"Wrestling"
]
}The final dataset files contain the following fields for each manga:
url: Unique identifier and source URLtitolo_originale: Original titletitolo_inglese: English titletitolo_kanji: Title in kanjinazionalita: Nationality/origincasa_editrice: Publisherstoria: Story authordisegni: Artistanno: Year of publicationstato_patria: Status in original countrystato_italia: Status in Italyserializzato_su: Serialization magazinetrama: Plot summarygeneri: List og genrastag_generici: List of genra tags
- Headless mode for efficient operation
- Datpulse proxy integration for IP rotation
- Anti-detection measures
- Configurable viewport settings
- Precise CSS selectors for data extraction
- Data cleaning and transformation
- Null value handling
- Error recovery mechanisms
- 2-second delay between manga requests
- 1-second delay between tag requests
- Configurable delay settings
- Proxy rotation via Datpulse
- Network error recovery
- JSON validation
- Empty result detection
- Continuous operation on individual failures
- Proxy fallback mechanisms
The following environment variables are required in the .env file:
PROXY_USERNAME=YOUR USER
PROXY_PASSWORD=YOUR PASSWORD
PROXY_ADDRESS=gw.dataimpulse.com
PROXY_PORT=823
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This scraper is for educational purposes only. Please respect AnimeClick.it's terms of service and robots.txt when using this tool.