Reddit topic scraper using Piloterr Website Crawler API's.
This project is a modular, production-ready web scraping pipeline for Reddit using the Piloterr API. It is built to support high-volume data extraction for use cases like:
- NLP / LLM training
- Sentiment analysis
- AI dataset generation
- Reddit trend monitoring
Even for large-scale scraping, Piloterr avoids blocking issues by leveraging a vast proxy pool and robust anti-bot bypass mechanisms.
This tool covers a complete Reddit scraping workflow:
# Step 1: Get all Reddit topics
topics = scrape_all()
# Step 2: Scrape posts from a topic page
posts = scrape_post("https://www.reddit.com/t/a_bird_story/", scroll=2)
# Step 3: Scrape comments from a specific post
comments = scrape_comment("https://www.reddit.com/r/movies/comments/aa1vas/mr_rogers_biopic_starring_tom_hanks_officially/")The output can be stored in CSVs or JSON-like structures, making it directly usable for analytics, model fine-tuning, or database integration.
pip install requests beautifulsoup4Copy the example credentials:
cp credential.exemple.py credential.pyEdit credential.py and paste your API key (visit piloterr.com if you don't have one):
x_api_key = "your_actual_api_key_here"python main.pyThis runs the full pipeline:
- Extracts all Reddit topics
- Saves them to output/all_reddit_topics.csv
- Scrapes sample posts and their comments
- Basic HTML fetcher (no JS rendering)
- Used for static pages like Reddit Topics directory
- Simulates full browser (scrolling included)
- Ideal for dynamic content: post lists, comment trees
- Use scroll=0for fast loading (e.g., post headers only)
- Use scroll≥2for full post feeds and deep comment threads
- Wait time helps ensure full page load before parsing
Extracts all categorized topics from reddit.com/topics.
| Function | Description | 
|---|---|
| get_letter_pages() | Lists A-Z topic index pages | 
| get_subpages() | Fetches pagination for each letter section | 
| scrape_topics() | Extracts topics + links from one page | 
| scrape_all() | Full extraction pipeline across all pages | 
| save_csv() | Stores topic list to output/all_reddit_topics.csv | 
Each topic = { "Movies": "/t/movies/" }
Extracts post data from a topic link like https://www.reddit.com/t/science/.
Returns a list of post dicts:
{
  "title": "How AI is Changing Healthcare",
  "author": "TechNerd23",
  "link": "https://www.reddit.com/r/Health/comments/abc123/",
  "date": "1681234567",
  "comment_count": "45",
  "score": "210"
}🧠 Best used with scroll=2 or higher for infinite feeds.
Retrieves both post metadata and structured comment trees.
Returns:
{
  "post_details": {
    "title": "...",
    "author": "...",
    "score": "...",
    ...
  },
  "comment_details": [
    {
      "author": "user1",
      "score": "12",
      "depth": "1",
      "content": ["Paragraph 1", "Paragraph 2"]
    }
  ]
}Supports:
- Nested replies via depth
- Metadata like parent_id,score,timestamp
🧠 Ideal for NLP/sentiment studies or reply-chain reconstruction.