🔍 Reddit Scraper using Piloterr API

Reddit topic scraper using Piloterr Website Crawler API's.

This project is a modular, production-ready web scraping pipeline for Reddit using the Piloterr API. It is built to support high-volume data extraction for use cases like:

NLP / LLM training
Sentiment analysis
AI dataset generation
Reddit trend monitoring

Even for large-scale scraping, Piloterr avoids blocking issues by leveraging a vast proxy pool and robust anti-bot bypass mechanisms.

1️⃣ Introduction & Brief Use

This tool covers a complete Reddit scraping workflow:

# Step 1: Get all Reddit topics
topics = scrape_all()

# Step 2: Scrape posts from a topic page
posts = scrape_post("https://www.reddit.com/t/a_bird_story/", scroll=2)

# Step 3: Scrape comments from a specific post
comments = scrape_comment("https://www.reddit.com/r/movies/comments/aa1vas/mr_rogers_biopic_starring_tom_hanks_officially/")

The output can be stored in CSVs or JSON-like structures, making it directly usable for analytics, model fine-tuning, or database integration.

2️⃣ Setup, Dependencies & Running `main.py`

Install dependencies

pip install requests beautifulsoup4

Add your Piloterr API key

Copy the example credentials:

cp credential.exemple.py credential.py

Edit credential.py and paste your API key (visit piloterr.com if you don't have one):

x_api_key = "your_actual_api_key_here"

Run the main script

python main.py

This runs the full pipeline:

Extracts all Reddit topics
Saves them to output/all_reddit_topics.csv
Scrapes sample posts and their comments

3️⃣ Function Breakdown

a) `piloterr.py` – API Integration

`website_crawler(site_url)`

Basic HTML fetcher (no JS rendering)
Used for static pages like Reddit Topics directory

`website_rendering(site_url, wait_in_seconds, scroll)`

Simulates full browser (scrolling included)
Ideal for dynamic content: post lists, comment trees

🧠 Tips:

Use scroll=0 for fast loading (e.g., post headers only)
Use scroll≥2 for full post feeds and deep comment threads
Wait time helps ensure full page load before parsing

b) `reddit_topics.py` – Scraping Reddit Topics

Extracts all categorized topics from reddit.com/topics.

Functions:

Function	Description
`get_letter_pages()`	Lists A-Z topic index pages
`get_subpages()`	Fetches pagination for each letter section
`scrape_topics()`	Extracts topics + links from one page
`scrape_all()`	Full extraction pipeline across all pages
`save_csv()`	Stores topic list to `output/all_reddit_topics.csv`

Each topic = { "Movies": "/t/movies/" }

c) `reddit_posts.py` – Scraping Posts in a Topic

Extracts post data from a topic link like https://www.reddit.com/t/science/.

`scrape_post(topic_url, wait_in_seconds, scroll)`

Returns a list of post dicts:

{
  "title": "How AI is Changing Healthcare",
  "author": "TechNerd23",
  "link": "https://www.reddit.com/r/Health/comments/abc123/",
  "date": "1681234567",
  "comment_count": "45",
  "score": "210"
}

🧠 Best used with scroll=2 or higher for infinite feeds.

d) `reddit_comments.py` Scraping Comments from a Post

Retrieves both post metadata and structured comment trees.

`scrape_comment(post_url, wait_in_seconds, scroll)`

Returns:

{
  "post_details": {
    "title": "...",
    "author": "...",
    "score": "...",
    ...
  },
  "comment_details": [
    {
      "author": "user1",
      "score": "12",
      "depth": "1",
      "content": ["Paragraph 1", "Paragraph 2"]
    }
  ]
}

Supports:

Nested replies via depth
Metadata like parent_id, score, timestamp

🧠 Ideal for NLP/sentiment studies or reply-chain reconstruction.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
html_sample		html_sample
script		script
.gitignore		.gitignore
README.md		README.md
credential.example.py		credential.example.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍 Reddit Scraper using Piloterr API

1️⃣ Introduction & Brief Use

2️⃣ Setup, Dependencies & Running `main.py`

Install dependencies

Add your Piloterr API key

Run the main script

3️⃣ Function Breakdown

a) `piloterr.py` – API Integration

`website_crawler(site_url)`

`website_rendering(site_url, wait_in_seconds, scroll)`

🧠 Tips:

b) `reddit_topics.py` – Scraping Reddit Topics

Functions:

c) `reddit_posts.py` – Scraping Posts in a Topic

`scrape_post(topic_url, wait_in_seconds, scroll)`

d) `reddit_comments.py` Scraping Comments from a Post

`scrape_comment(post_url, wait_in_seconds, scroll)`

About

Uh oh!

Releases

Packages

Languages

harivonyR/Reddit_topic_scraper

Folders and files

Latest commit

History

Repository files navigation

🔍 Reddit Scraper using Piloterr API

1️⃣ Introduction & Brief Use

2️⃣ Setup, Dependencies & Running main.py

Install dependencies

Add your Piloterr API key

Run the main script

3️⃣ Function Breakdown

a) piloterr.py – API Integration

website_crawler(site_url)

website_rendering(site_url, wait_in_seconds, scroll)

🧠 Tips:

b) reddit_topics.py – Scraping Reddit Topics

Functions:

c) reddit_posts.py – Scraping Posts in a Topic

scrape_post(topic_url, wait_in_seconds, scroll)

d) reddit_comments.py Scraping Comments from a Post

scrape_comment(post_url, wait_in_seconds, scroll)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2️⃣ Setup, Dependencies & Running `main.py`

a) `piloterr.py` – API Integration

`website_crawler(site_url)`

`website_rendering(site_url, wait_in_seconds, scroll)`

b) `reddit_topics.py` – Scraping Reddit Topics

c) `reddit_posts.py` – Scraping Posts in a Topic

`scrape_post(topic_url, wait_in_seconds, scroll)`

d) `reddit_comments.py` Scraping Comments from a Post

`scrape_comment(post_url, wait_in_seconds, scroll)`

Packages