Bucket Harvest

A streamlined toolkit for efficiently collecting and analyzing GitHub repository and issue data using parallel bucket processing strategies.

Overview

Bucket Harvest provides two powerful workflows:

Organization Analysis (org_to_repos) - Analyze all repositories in a GitHub organization
Issue Collection (repo_to_issues) - Collect and analyze recent issues from specific repositories

Both workflows use intelligent bucketing and parallel processing to efficiently handle large datasets while respecting GitHub API rate limits.

Architecture

GitHub Data Hierarchy

graph TD
    A[GitHub Organization<br/>e.g., google, microsoft, stripe] -->|contains| B1[Repository 1<br/>google/guava]
    A -->|contains| B2[Repository 2<br/>google/material-design]
    A -->|contains| B3[Repository 3<br/>google/gson]
    A -->|contains| B4[...]

    B1 -->|contains| C1[Issue #1<br/>Bug: Feature X broken]
    B1 -->|contains| C2[Issue #2<br/>Enhancement: Add Y]
    B1 -->|contains| C3[Issue #3<br/>Documentation update]
    B1 -->|contains| C4[...]

    B2 -->|contains| D1[Issue #1]
    B2 -->|contains| D2[Issue #2]
    B2 -->|contains| D3[...]

    style A fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style B1 fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style B2 fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style B3 fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style C1 fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style C2 fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style C3 fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style D1 fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style D2 fill:#f3e5f5,stroke:#4a148c,stroke-width:2px

Bucket-Harvest Workflow

graph LR
    A[User Chooses<br/>Organization<br/>e.g., 'google'] -->|input| B[Bucket-Harvest<br/>org_to_repos]
    B -->|analyzes all repos<br/>ranks by activity| C[Output:<br/>Highest Activity Repos<br/>CSV Report]
    C -->|user reviews| D[User Chooses<br/>Repository<br/>e.g., 'google/guava']
    D -->|input| E[Bucket-Harvest<br/>repo_to_issues]
    E -->|collects most<br/>recent 100 issues| F[Output:<br/>100 Recent Issues<br/>Markdown Files]

    style A fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
    style B fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style C fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
    style E fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style F fill:#e3f2fd,stroke:#1565c0,stroke-width:2px

Key Points:

Green boxes: User decisions (you choose the org and repo)
Yellow boxes: Bucket-Harvest automation (tool finds and collects data)
Blue boxes: Outputs (data delivered to you for analysis)

Quick Start

Prerequisites

Python 3.7+ with dependencies:
```
pip install -r requirements.txt
```

GitHub Token - Create .env file with your token:

cp .env.template .env
# Edit .env and add your token

GitHub CLI (for issue collection):
```
gh auth login
```

Workflow 1: Organization Repository Analysis

Analyze all active repositories in any GitHub organization:

# Step 1: Create buckets of active repositories
python scripts/bucket_harvest/org_to_repos/create_org_buckets.py <org_name>

# Step 2: Process buckets in parallel to collect detailed metrics
python scripts/bucket_harvest/org_to_repos/process_org_buckets.py <org_name>

Examples:

# Analyze Google's repositories
python scripts/bucket_harvest/org_to_repos/create_org_buckets.py google
python scripts/bucket_harvest/org_to_repos/process_org_buckets.py google

# Analyze Stripe with custom settings
python scripts/bucket_harvest/org_to_repos/create_org_buckets.py stripe --buckets 5 --days 14
python scripts/bucket_harvest/org_to_repos/process_org_buckets.py stripe --workers 3

Output: scripts/bucket_harvest/org_to_repos/data/<org>/.{org}_analysis.csv

Workflow 2: Repository Issue Collection

Collect the 100 most recent issues from any repository:

# Single command to collect issues
python scripts/bucket_harvest/repo_to_issues/collect_recent_issues.py <owner/repo>

# Or use the convenient wrapper
python bucket-harvest.py <owner/repo>

Examples:

# Collect issues from React
python bucket-harvest.py facebook/react

# Collect issues from TypeScript
python bucket-harvest.py microsoft/typescript

# Analyze collected issues
python scripts/bucket_harvest/parallel_issue_analyzer.py facebook/react

Output: scripts/bucket_harvest/repo_to_issues/data/<owner_repo>/.<repo>/

Workflows in Detail

Organization Analysis

Purpose: Get comprehensive metrics on all active repositories in an organization

Metrics Collected:

Stars, forks, contributors
Commits and PRs in last 30 days
Primary language
Repository health score
Description and metadata

Use Cases:

Competitive analysis
Technology stack research
Open source ecosystem mapping
Identifying active projects

Command Options:

# create_org_buckets.py
--buckets N    # Number of buckets (default: 10)
--days N       # Activity window in days (default: 30)

# process_org_buckets.py
--workers N    # Parallel workers (default: 5)

Issue Collection & Analysis

Purpose: Collect and analyze the most recent issues from a repository

What's Collected:

Issue title, description, labels
Author and creation date
Full comment threads
Issue state and metadata

Use Cases:

Finding contribution opportunities
Understanding project pain points
Community sentiment analysis
Bug bounty research

Features:

Parallel processing (10 threads)
Markdown output for easy analysis
Excludes pull requests
Handles rate limiting automatically

Directory Structure

bucket-harvest/
├── bucket-harvest.py              # Convenient wrapper script
├── bucket-harvest.bat             # Windows batch wrapper
├── .env.template                  # Environment template
├── requirements.txt               # Python dependencies
├── README.md                      # This file
└── scripts/
    └── bucket_harvest/
        ├── parallel_issue_analyzer.py  # AI-powered issue analysis
        ├── agents/                     # Agent prompt templates
        ├── utils/                      # Shared utilities
        ├── org_to_repos/              # Organization analysis
        │   ├── create_org_buckets.py
        │   ├── process_org_buckets.py
        │   ├── README.md
        │   └── data/                  # Output directory
        └── repo_to_issues/            # Issue collection
            ├── collect_recent_issues.py
            ├── create_issue_buckets.py
            ├── generate_bucket_scripts.py
            ├── BUCKET_strategy.md
            ├── README.md
            └── data/                  # Output directory

Output Formats

Organization Analysis Output

CSV file (.{org}_analysis.csv) with columns:

repo - Repository name
star_count - GitHub stars
contributor_count - Number of contributors
github_url - Repository URL
primary_language - Main programming language
description - Repository description
commits_last_30d - Recent commits
closed_pr_last_30d - Recent closed PRs
repo_health_score - Calculated health metric

Health Score Formula: (commits_last_30d + closed_pr_last_30d) / 2

Issue Collection Output

Individual markdown files per issue:

# Issue #1234: Fix authentication bug

**GitHub URL:** https://github.com/owner/repo/issues/1234
**Created:** 2025-01-15
**Author:** username
**State:** open
**Labels:** bug; authentication

---

## Issue Description
[Full issue description]

---

## Comments
[All comments with authors and dates]

Advanced Usage

Customizing Organization Analysis

# Large organization with many repositories
python create_org_buckets.py microsoft --buckets 20 --days 60
python process_org_buckets.py microsoft --workers 10

# Small/focused organization
python create_org_buckets.py stripe --buckets 3 --days 14
python process_org_buckets.py stripe --workers 2

Issue Analysis Pipeline

# Step 1: Collect issues
python bucket-harvest.py shopify/cli

# Step 2: Define selection criteria
# Edit user/selection-criteria.md with your criteria

# Step 3: Analyze with AI
python scripts/bucket_harvest/parallel_issue_analyzer.py shopify/cli

# Results saved to .notes/issue-analysis-shopify_cli-{timestamp}.md

Using with Claude Code

This toolkit is designed to work seamlessly with Claude Code:

User asks: "Analyze Google's GitHub repositories"
Claude runs the organization analysis workflow
Claude reads the output CSV and provides insights
Claude can drill down into specific repositories

Performance

Organization Analysis:

~150 repositories: 5-8 minutes
~80 repositories: 3-5 minutes
Parallel processing with configurable workers
Built-in rate limit handling

Issue Collection:

100 issues: 1-2 minutes (parallel)
100 issues: 5-8 minutes (sequential)
4-6x speedup with parallel processing

Error Handling

Both workflows include robust error handling:

Automatic retry with exponential backoff
Rate limit detection and waiting
Graceful handling of deleted/private repositories
Partial results on failure
Clear error messages

Troubleshooting

"API_GITHUB_TOKEN not found"

Create a .env file with your GitHub token:

API_GITHUB_TOKEN=your_token_here

"Organization not found (404)"

Check organization name spelling
Ensure organization is public
Use lowercase: google not Google

"Rate limit exceeded"

Wait for rate limit reset (shown in error)
Reduce number of workers: --workers 2
Check authentication: gh auth status

"gh: command not found"

Install GitHub CLI: https://cli.github.com/

Best Practices

Start Small - Test with small organizations/repos first
Monitor Rate Limits - Use fewer workers if hitting limits
Save Outputs - Results are cached in data/ directories
Use Exclusions - Add unwanted issues to user/exclusions.txt
Document Criteria - Update user/selection-criteria.md for analysis

Supported Platforms

Windows (native and WSL)
macOS
Linux

No bash-specific features required - pure Python implementation.

Contributing

This toolkit follows these principles:

Generic and reusable (works with any org/repo)
Respects GitHub API guidelines
Clean, documented code
Defensive use only (analysis, not exploitation)

License

MIT

Support

For issues or questions:

Check workflow-specific READMEs in subdirectories
Review error messages carefully
Verify authentication and permissions
Test with smaller datasets first

Part of the GitHub Detective toolkit for bounty hunting and open source research

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts/bucket_harvest		scripts/bucket_harvest
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
MIGRATION.md		MIGRATION.md
README.md		README.md
bucket-harvest.bat		bucket-harvest.bat
bucket-harvest.py		bucket-harvest.py
requirements.txt		requirements.txt

License

lmcrean/bucket-harvest

Folders and files

Latest commit

History

Repository files navigation

Bucket Harvest

Overview

Architecture

GitHub Data Hierarchy

Bucket-Harvest Workflow

Quick Start

Prerequisites

Workflow 1: Organization Repository Analysis

Workflow 2: Repository Issue Collection

Workflows in Detail

Organization Analysis

Issue Collection & Analysis

Directory Structure

Output Formats

Organization Analysis Output

Issue Collection Output

Advanced Usage

Customizing Organization Analysis

Issue Analysis Pipeline

Using with Claude Code

Performance

Error Handling

Troubleshooting

"API_GITHUB_TOKEN not found"

"Organization not found (404)"

"Rate limit exceeded"

"gh: command not found"

Best Practices

Supported Platforms

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages