Skip to content

mollyfud/jfk-dl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JFK Documents Bulk Downloader

A command-line tool for downloading ZIP files from the National Archives JFK Bulk Download page or any other URL containing ZIP file links.

Note: This tool has been tested on macOS. Windows and Linux support should work as described, but has not been extensively tested. Feedback and bug reports are welcome!

Features

  • Downloads all ZIP files or a specified subset
  • Handles connection errors with retries
  • Supports resuming interrupted downloads
  • Shows progress bar for downloads
  • Verifies downloaded files
  • Checks for existing files and prompts before overwriting
  • Configurable input URL and output directory
  • Parallel downloading capability

Installation

Option 1: Using venv (Standard Python)

# Clone the repository
git clone https://github.com/yourusername/jfk-dl.git
cd jfk-dl

# Create and activate virtual environment
# On Windows:
python -m venv venv
venv\Scripts\activate

# On macOS/Linux:
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Option 2: Using uv (Faster alternative)

# Clone the repository
git clone https://github.com/yourusername/jfk-dl.git
cd jfk-dl

# Install dependencies with uv
# If you don't have uv installed:
# pip install uv

# On all platforms:
uv venv  # Creates a .venv directory by default
uv pip install -r requirements.txt

# Activate the virtual environment
# On Windows: .venv\Scripts\activate
# On macOS/Linux: source .venv/bin/activate

Quick Start

Once installed, you can download all JFK document ZIP files with:

# Activate the virtual environment if needed
# For standard venv:
# On Windows: venv\Scripts\activate
# On macOS/Linux: source venv/bin/activate
#
# For uv:
# On Windows: .venv\Scripts\activate
# On macOS/Linux: source .venv/bin/activate

# Run the downloader with default settings
# Without arguments, the script will display help
python bulk_download.py

Usage

python bulk_download.py [OPTIONS]

Options

  • --url URL: URL containing ZIP files to download (default: https://www.archives.gov/research/jfk/jfkbulkdownload)
  • --output-dir DIR: Directory to save downloaded files (default: auto-generated based on URL)
  • --max-files N: Maximum number of files to download (default: 0, download all)
  • --retry ATTEMPTS: Maximum number of retry attempts (default: 3)
  • --workers N: Number of parallel downloads (default: 4)
  • --force: Force download without prompting, even if files exist
  • --skip-existing: Skip files that already exist without prompting (default: True)
  • --no-skip-existing: Prompt for each existing file
  • --smart-check: Smart check: skip files with matching size (default: True)
  • --no-smart-check: Disable smart file size checking
  • --filter PATTERN: Filter files by filename pattern (e.g., 'doc-*')
  • --extension EXT: File extension to look for (e.g., 'zip', 'pdf', 'docx') without the dot (default: zip)
  • --cowboyup: Run with defaults without showing help message (only needed when running with no other arguments)

Examples

1. Download all ZIP files from the JFK Archives (default)

## Download 2016 to 2023 bulk files
python bulk_download.py --cowboyup

2. Download files from the 2025 JFK Archives release

## Download 2025 files, it is smart enough to skip existing ones as they keep adding
python bulk_download.py --url https://www.archives.gov/research/jfk/release-2025 --output-dir data/raw/archive_gov/2025 --extension pdf

## For testing, you can limit to just a few files
python bulk_download.py --url https://www.archives.gov/research/jfk/release-2025 --output-dir data/raw/archive_gov/2025 --extension pdf --max-files 5

Additional Options:

  • Use --max-files 5 to download only the first 5 files
  • Use --filter "record-*" to download files matching a pattern
  • Use --workers 8 to increase parallel downloads for faster performance

License

MIT

About

JFK file download for LLM analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%