GitHub Stargazers Scraper and Email Finder

A set of Python scripts to scrape GitHub stargazers from a repository and optionally find their email addresses through various GitHub data sources.

Project Overview

This project consists of two main scripts:

github_stargazers.py: Scrapes GitHub stargazers' usernames and URLs from a repository
github_emails.py: Finds email addresses for GitHub users from various sources

Process Flow

Use github_stargazers.py to gather all stargazers from a GitHub repository
Use github_emails.py to find email addresses for the collected users

Requirements

Python 3.6 or higher
Required Python packages:
- requests
- beautifulsoup4

Installation

Clone this repository or download the script file.
Install the required Python packages:

pip install requests beautifulsoup4

Usage

Scraping Stargazers

python github_stargazers.py [repo_url] --output [output_file]

Arguments

repo_url: The URL of the GitHub stargazers page (e.g., https://github.com/username/repo/stargazers)
--output or -o: The name of the output CSV file (default: stargazers.csv)
--start or -s: Page to start scraping from (default: 1)
--retries or -r: Maximum number of retry attempts for failed requests (default: 3)

Examples

# Scrape stargazers from page 1
python github_stargazers.py https://github.com/browser-use/browser-use/stargazers

# Scrape stargazers starting from a specific page
python github_stargazers.py https://github.com/browser-use/browser-use/stargazers --start 3

# Scrape stargazers and save to a custom file
python github_stargazers.py https://github.com/browser-use/browser-use/stargazers --output browser_use_fans.csv

Finding Email Addresses

Use the github_emails.py script to find email addresses for GitHub users from their profiles, commits, or events.

python github_emails.py --input [input_csv] --output [output_csv] --token [github_token]

Arguments

--input or -i: CSV file containing GitHub usernames and URLs (default: complete_stargazers.csv)
--output or -o: Output CSV file to save results (default: github_emails.csv)
--token or -t: GitHub API token for authentication (highly recommended to avoid rate limits)
--delay or -d: Delay between API requests in seconds (default: 1.0)
--max-retries or -r: Maximum number of retry attempts for failed requests (default: 3)
--start or -s: Starting position in the input file (1-indexed, default: 1)
--stop or -e: Stopping position in the input file (useful for batch processing)
--resume: Resume from where you left off, skipping already processed users

Examples

# Basic usage
python github_emails.py

# With GitHub token (recommended)
python github_emails.py --token YOUR_GITHUB_TOKEN

# Custom input and output files
python github_emails.py --input my_stargazers.csv --output my_emails.csv

# Adjust delay between requests
python github_emails.py --delay 2.0

# Process users from position 1 to 100
python github_emails.py --start 1 --stop 100

# Process the next batch of 100 users
python github_emails.py --start 101 --stop 200

# Process a specific range of users
python github_emails.py --start 500 --stop 550

# Resume processing from where you left off
python github_emails.py --resume

Batch Processing Strategy

For large datasets, you can use one of these strategies:

Sequential Ranges: Process users in specific ranges

python github_emails.py --start 1 --stop 100
python github_emails.py --start 101 --stop 200
python github_emails.py --start 201 --stop 300

Resume-Based Processing: Start processing and resume if interrupted
```
python github_emails.py --resume
```

Split Input File: Split your stargazers CSV into smaller files and process each separately

split -l 100 complete_stargazers.csv batch_
python github_emails.py --input batch_aa --output emails_batch_aa.csv
python github_emails.py --input batch_ab --output emails_batch_ab.csv

GitHub API Rate Limits

The email finder script makes extensive use of the GitHub API, which has rate limits:

Unauthenticated requests: 60 requests per hour
Authenticated requests: 5,000 requests per hour

Due to these limits, it's highly recommended to use a GitHub personal access token when running the email finder script, especially on large datasets.

How to Get a GitHub Personal Access Token

Go to your GitHub account settings
Click on "Developer settings" in the left sidebar
Select "Personal access tokens" → "Tokens (classic)"
Click "Generate new token" → "Generate new token (classic)"
Give your token a description (e.g., "GitHub Email Finder")
Select the following scopes:
- public_repo (to access public repositories)
- read:user (to read user profile data)
Click "Generate token"
Copy the generated token and use it with the --token parameter

Note: Keep your token secure! Don't share it or commit it to version control.

How Email Finding Works

The email finder script tries multiple methods to find a user's email:

Public Profile: Checks if the user has made their email public on their GitHub profile
Commit Data: Examines the user's recent commits for email information
Event History: Looks through the user's public events for email data
Patch Data: As a fallback, tries to extract emails from commit patches

Note: Using a GitHub API token is highly recommended to avoid rate limiting.

Output

Stargazers Script

The script creates a CSV file with two columns:

Username: The GitHub username of the stargazer
GitHub URL: The URL to the stargazer's GitHub profile

Email Finder Script

The script creates a CSV file with the following columns:

Username: The GitHub username of the user
GitHub URL: The URL to the user's GitHub profile
Email: The email address found (if any)
Location: The user's location from their GitHub profile (if available)
Organization: The user's company/organization from their GitHub profile (if available)
Source: Where the email was found (Profile, Commit, Event, Patch, or None)

Notes

Both scripts use delays between requests to be respectful to GitHub's servers.
If you encounter rate limiting, you might need to authenticate with GitHub API, increase the delay, or use a token.
Processing a large number of users can take a significant amount of time, even with authentication.
The email finder script will automatically append to an existing output file, so you can process users in batches without overwriting previous results.
The script collects additional profile information (location and organization) when available.

Ethical Considerations

When collecting and using email addresses:

Respect Privacy: Only use collected emails in accordance with privacy regulations (GDPR, CCPA, etc.)
Avoid Spam: Don't use the emails for unsolicited mass emails
Secure Data: Store collected emails securely and don't share them publicly
Proper Disclosure: When contacting people, disclose how you obtained their email
Opt-out Option: Always provide a way for people to opt out of communications

Discord Message Fetcher

A Python script to fetch messages from Discord channels and extract user information.

Features

Fetch messages from Discord channels using the Discord API
Bidirectional pagination to get messages before and after a reference point
Extract user information from messages (usernames, IDs, avatars, etc.)
Export user data to CSV files
Export message data to JSON files
Find messages from specific users

Usage

Discord Authentication

To use the Discord message fetcher, you need a Discord authentication token. This can be obtained from your Discord web client:

Open Discord in your web browser
Press F12 to open Developer Tools
Go to the Network tab
Perform an action (like sending a message)
Find a request to the Discord API
Look for the "authorization" header in the request headers
Copy the token value

Note: Treat your Discord token like a password. Never share it publicly or commit it to version control.

Running the Script

python discord_dm.py

When running the script without arguments, you will be prompted to enter your Discord authentication token and other configuration options.

You can also provide command-line arguments:

python discord_dm.py --token YOUR_DISCORD_TOKEN --channel CHANNEL_ID --user USER_ID

Command-line Arguments

-t, --token: Discord authorization token
-c, --channel: Discord channel ID to fetch messages from (default: 1303749221354311752)
-u, --user: Target user ID to focus on (leave empty for all users)
-r, --reference: Reference message ID for bidirectional fetching
-b, --before: Maximum messages to fetch before reference point (default: 250)
-a, --after: Maximum messages to fetch after reference point (default: 250)
-o, --output-dir: Directory to save output files (default: discord_output)

Examples

# Basic usage with token
python discord_dm.py --token YOUR_DISCORD_TOKEN

# Specify channel and target user
python discord_dm.py --token YOUR_DISCORD_TOKEN --channel 1234567890 --user 9876543210

# Fetch messages around a specific reference point
python discord_dm.py --token YOUR_DISCORD_TOKEN --reference 1122334455

# Customize message count limits
python discord_dm.py --token YOUR_DISCORD_TOKEN --before 500 --after 300

# Change output directory
python discord_dm.py --token YOUR_DISCORD_TOKEN --output-dir my_discord_data

python discord_dm.py --token YOUR_DISCORD_TOKEN --channel Channel_ID --before 17000 --output-dir discord_output

Environment Variables

You can also set your Discord token as an environment variable:

export DISCORD_TOKEN="your_discord_token_here"

Output

The script generates two output files:

A CSV file containing user information with the following fields:
- id: Discord user ID
- username: Discord username
- global_name: Global display name
- discriminator: User discriminator
- avatar: Avatar hash
- bot: Boolean indicating if the user is a bot
A JSON file containing all messages with their complete metadata

Technical Details

The script uses Discord's snowflake IDs for pagination, allowing it to retrieve messages before and after specific points in the conversation. It also handles rate limiting by implementing delays between API requests.

Troubleshooting

API Errors: Ensure your Discord token is valid and has not expired
Rate Limiting: The script includes delays to manage rate limits, but you may need to adjust these if you encounter issues
No Messages Found: Verify that the channel ID is correct and accessible with your token

Troubleshooting

Rate Limiting: Increase delay between requests or use a GitHub token
No Emails Found: Some users don't have public emails or haven't made commits
Script Crashes: Use the --resume option to continue from where you left off
Empty Results: Make sure your input CSV has the correct format (Username, GitHub URL)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
discord_dm.py		discord_dm.py
github_emails.py		github_emails.py
github_stargazers.py		github_stargazers.py
requirements.txt		requirements.txt

ua110110/gitscrapper

Folders and files

Latest commit

History

Repository files navigation

GitHub Stargazers Scraper and Email Finder

Project Overview

Process Flow

Requirements

Installation

Usage

Scraping Stargazers

Arguments

Examples

Finding Email Addresses

Arguments

Examples

Batch Processing Strategy

GitHub API Rate Limits

How to Get a GitHub Personal Access Token

How Email Finding Works

Output

Stargazers Script

Email Finder Script

Notes

Ethical Considerations

Discord Message Fetcher

Features

Usage

Discord Authentication

Running the Script

Command-line Arguments

Examples

Environment Variables

Output

Technical Details

Troubleshooting

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages