A set of Python scripts to scrape GitHub stargazers from a repository and optionally find their email addresses through various GitHub data sources.
This project consists of two main scripts:
- github_stargazers.py: Scrapes GitHub stargazers' usernames and URLs from a repository
- github_emails.py: Finds email addresses for GitHub users from various sources
- Use
github_stargazers.pyto gather all stargazers from a GitHub repository - Use
github_emails.pyto find email addresses for the collected users
- Python 3.6 or higher
- Required Python packages:
- requests
- beautifulsoup4
- Clone this repository or download the script file.
- Install the required Python packages:
pip install requests beautifulsoup4python github_stargazers.py [repo_url] --output [output_file]repo_url: The URL of the GitHub stargazers page (e.g., https://github.com/username/repo/stargazers)--outputor-o: The name of the output CSV file (default: stargazers.csv)--startor-s: Page to start scraping from (default: 1)--retriesor-r: Maximum number of retry attempts for failed requests (default: 3)
# Scrape stargazers from page 1
python github_stargazers.py https://github.com/browser-use/browser-use/stargazers
# Scrape stargazers starting from a specific page
python github_stargazers.py https://github.com/browser-use/browser-use/stargazers --start 3
# Scrape stargazers and save to a custom file
python github_stargazers.py https://github.com/browser-use/browser-use/stargazers --output browser_use_fans.csvUse the github_emails.py script to find email addresses for GitHub users from their profiles, commits, or events.
python github_emails.py --input [input_csv] --output [output_csv] --token [github_token]--inputor-i: CSV file containing GitHub usernames and URLs (default: complete_stargazers.csv)--outputor-o: Output CSV file to save results (default: github_emails.csv)--tokenor-t: GitHub API token for authentication (highly recommended to avoid rate limits)--delayor-d: Delay between API requests in seconds (default: 1.0)--max-retriesor-r: Maximum number of retry attempts for failed requests (default: 3)--startor-s: Starting position in the input file (1-indexed, default: 1)--stopor-e: Stopping position in the input file (useful for batch processing)--resume: Resume from where you left off, skipping already processed users
# Basic usage
python github_emails.py
# With GitHub token (recommended)
python github_emails.py --token YOUR_GITHUB_TOKEN
# Custom input and output files
python github_emails.py --input my_stargazers.csv --output my_emails.csv
# Adjust delay between requests
python github_emails.py --delay 2.0
# Process users from position 1 to 100
python github_emails.py --start 1 --stop 100
# Process the next batch of 100 users
python github_emails.py --start 101 --stop 200
# Process a specific range of users
python github_emails.py --start 500 --stop 550
# Resume processing from where you left off
python github_emails.py --resumeFor large datasets, you can use one of these strategies:
-
Sequential Ranges: Process users in specific ranges
python github_emails.py --start 1 --stop 100 python github_emails.py --start 101 --stop 200 python github_emails.py --start 201 --stop 300 -
Resume-Based Processing: Start processing and resume if interrupted
python github_emails.py --resume -
Split Input File: Split your stargazers CSV into smaller files and process each separately
split -l 100 complete_stargazers.csv batch_ python github_emails.py --input batch_aa --output emails_batch_aa.csv python github_emails.py --input batch_ab --output emails_batch_ab.csv
The email finder script makes extensive use of the GitHub API, which has rate limits:
- Unauthenticated requests: 60 requests per hour
- Authenticated requests: 5,000 requests per hour
Due to these limits, it's highly recommended to use a GitHub personal access token when running the email finder script, especially on large datasets.
- Go to your GitHub account settings
- Click on "Developer settings" in the left sidebar
- Select "Personal access tokens" → "Tokens (classic)"
- Click "Generate new token" → "Generate new token (classic)"
- Give your token a description (e.g., "GitHub Email Finder")
- Select the following scopes:
public_repo(to access public repositories)read:user(to read user profile data)
- Click "Generate token"
- Copy the generated token and use it with the
--tokenparameter
Note: Keep your token secure! Don't share it or commit it to version control.
The email finder script tries multiple methods to find a user's email:
- Public Profile: Checks if the user has made their email public on their GitHub profile
- Commit Data: Examines the user's recent commits for email information
- Event History: Looks through the user's public events for email data
- Patch Data: As a fallback, tries to extract emails from commit patches
Note: Using a GitHub API token is highly recommended to avoid rate limiting.
The script creates a CSV file with two columns:
- Username: The GitHub username of the stargazer
- GitHub URL: The URL to the stargazer's GitHub profile
The script creates a CSV file with the following columns:
- Username: The GitHub username of the user
- GitHub URL: The URL to the user's GitHub profile
- Email: The email address found (if any)
- Location: The user's location from their GitHub profile (if available)
- Organization: The user's company/organization from their GitHub profile (if available)
- Source: Where the email was found (Profile, Commit, Event, Patch, or None)
- Both scripts use delays between requests to be respectful to GitHub's servers.
- If you encounter rate limiting, you might need to authenticate with GitHub API, increase the delay, or use a token.
- Processing a large number of users can take a significant amount of time, even with authentication.
- The email finder script will automatically append to an existing output file, so you can process users in batches without overwriting previous results.
- The script collects additional profile information (location and organization) when available.
When collecting and using email addresses:
- Respect Privacy: Only use collected emails in accordance with privacy regulations (GDPR, CCPA, etc.)
- Avoid Spam: Don't use the emails for unsolicited mass emails
- Secure Data: Store collected emails securely and don't share them publicly
- Proper Disclosure: When contacting people, disclose how you obtained their email
- Opt-out Option: Always provide a way for people to opt out of communications
A Python script to fetch messages from Discord channels and extract user information.
- Fetch messages from Discord channels using the Discord API
- Bidirectional pagination to get messages before and after a reference point
- Extract user information from messages (usernames, IDs, avatars, etc.)
- Export user data to CSV files
- Export message data to JSON files
- Find messages from specific users
To use the Discord message fetcher, you need a Discord authentication token. This can be obtained from your Discord web client:
- Open Discord in your web browser
- Press F12 to open Developer Tools
- Go to the Network tab
- Perform an action (like sending a message)
- Find a request to the Discord API
- Look for the "authorization" header in the request headers
- Copy the token value
Note: Treat your Discord token like a password. Never share it publicly or commit it to version control.
python discord_dm.pyWhen running the script without arguments, you will be prompted to enter your Discord authentication token and other configuration options.
You can also provide command-line arguments:
python discord_dm.py --token YOUR_DISCORD_TOKEN --channel CHANNEL_ID --user USER_ID-t,--token: Discord authorization token-c,--channel: Discord channel ID to fetch messages from (default: 1303749221354311752)-u,--user: Target user ID to focus on (leave empty for all users)-r,--reference: Reference message ID for bidirectional fetching-b,--before: Maximum messages to fetch before reference point (default: 250)-a,--after: Maximum messages to fetch after reference point (default: 250)-o,--output-dir: Directory to save output files (default: discord_output)
# Basic usage with token
python discord_dm.py --token YOUR_DISCORD_TOKEN
# Specify channel and target user
python discord_dm.py --token YOUR_DISCORD_TOKEN --channel 1234567890 --user 9876543210
# Fetch messages around a specific reference point
python discord_dm.py --token YOUR_DISCORD_TOKEN --reference 1122334455
# Customize message count limits
python discord_dm.py --token YOUR_DISCORD_TOKEN --before 500 --after 300
# Change output directory
python discord_dm.py --token YOUR_DISCORD_TOKEN --output-dir my_discord_data
python discord_dm.py --token YOUR_DISCORD_TOKEN --channel Channel_ID --before 17000 --output-dir discord_outputYou can also set your Discord token as an environment variable:
export DISCORD_TOKEN="your_discord_token_here"The script generates two output files:
-
A CSV file containing user information with the following fields:
- id: Discord user ID
- username: Discord username
- global_name: Global display name
- discriminator: User discriminator
- avatar: Avatar hash
- bot: Boolean indicating if the user is a bot
-
A JSON file containing all messages with their complete metadata
The script uses Discord's snowflake IDs for pagination, allowing it to retrieve messages before and after specific points in the conversation. It also handles rate limiting by implementing delays between API requests.
- API Errors: Ensure your Discord token is valid and has not expired
- Rate Limiting: The script includes delays to manage rate limits, but you may need to adjust these if you encounter issues
- No Messages Found: Verify that the channel ID is correct and accessible with your token
- Rate Limiting: Increase delay between requests or use a GitHub token
- No Emails Found: Some users don't have public emails or haven't made commits
- Script Crashes: Use the
--resumeoption to continue from where you left off - Empty Results: Make sure your input CSV has the correct format (Username, GitHub URL)