🦀🦀🦀 Powerful and flexible website scraper written in Rust 🦀🦀🦀
Inspired by scrape-it
- Simple YAML Configuration: Define scraping rules using intuitive YAML syntax
- CSS Selectors: Extract data using standard CSS selectors
- Nested Data: Support for complex nested data structures
- Data Transformations: Built-in text processing (regex, case conversion, type casting)
- Multiple URLs: Batch scrape multiple pages in one configuration
- HTTP Features: Custom headers, proxy support, timeout, and retry logic
- Rate Limiting: Respectful scraping with configurable delays
- Caching: Cache responses for faster development and testing
- Multiple Output Formats: Export to JSON or CSV
- Parallel Processing: Fast extraction using Rayon
- Type Conversion: Automatic conversion to numbers and booleans
- Pagination Support: Automatically scrape multiple pages using URL patterns or "next" links
cargo install --path .# Basic usage
main config.krk.yaml
# Save to file
main config.krk.yaml -o output.json
# Export to CSV
main config.krk.yaml -o output.csv -f csvconfig:
url: https://example.com
data:
title:
selector: h1
description:
selector: .descriptionconfig:
# Single URL
url: https://example.com
# OR multiple URLs for batch scraping
urls:
- https://example.com/page1
- https://example.com/page2
# Custom headers
headers:
User-Agent: "Mozilla/5.0"
Cookie: "session=abc123"
# Timeout in seconds (default: 30)
timeout: 60
# Number of retry attempts (default: 0)
retries: 3
# Delay between requests in milliseconds (default: 0)
delay: 1000
# Proxy configuration
proxy: http://proxy.example.com:8080
# Cache configuration
cacheDir: ./.cache
useCache: truedata:
# Simple text extraction
title:
selector: h1
# Extract from attribute
image:
selector: img.featured
attr: src
# Select nth element (0-indexed)
firstParagraph:
selector: p
nth: 0
# Default value if not found
author:
selector: .author
default: "Unknown"
# Disable trimming
rawText:
selector: .content
trim: falsedata:
# Extract using regex
price:
selector: .price
regex: '\d+\.\d+'
toNumber: true
# Text replacement
cleanTitle:
selector: h1
replace: ["Breaking: ", ""]
# Case conversion
upperTitle:
selector: h1
uppercase: true
lowerTitle:
selector: h1
lowercase: true
# Strip HTML tags
cleanText:
selector: .content
stripHtml: true
# Type conversion
rating:
selector: .rating
toNumber: true
isActive:
selector: .status
toBoolean: truedata:
articles:
selector: article
data:
title:
selector: h2
author:
selector: .author
tags:
selector: .tag
data:
name:
selector: spanAutomatically scrape multiple pages:
# Strategy 1: URL pattern with page numbers
config:
url: https://example.com/products
pagination:
pagePattern: "?page={page}"
startPage: 1
endPage: 10
stopOnEmpty: false
# Strategy 2: Follow "next" links
config:
url: https://example.com/blog
pagination:
nextSelector: "a.next-page"
maxPages: 20
stopOnEmpty: true
# Strategy 3: Full URL pattern
config:
url: https://example.com
pagination:
pagePattern: "https://example.com/search?q=rust&page={page}"
startPage: 1
maxPages: 5Pagination Options:
pagePattern: URL pattern with{page}placeholdernextSelector: CSS selector for "next page" linkstartPage: Starting page number (default: 1)maxPages: Maximum pages to scrape (0 = unlimited for nextSelector)endPage: Ending page number (for pagePattern)stopOnEmpty: Stop if no results found on page
config:
url: https://news.ycombinator.com
data:
stories:
selector: .athing
data:
title:
selector: .titleline > a
score:
selector: .score
toNumber: trueconfig:
urls:
- https://example.com/page1
- https://example.com/page2
delay: 2000
headers:
User-Agent: "Karkinos/1.0"
data:
title:
selector: h1
content:
selector: articleconfig:
url: https://example.com/products
data:
products:
selector: .product
data:
name:
selector: .product-name
stripHtml: true
price:
selector: .price
regex: '\d+\.\d+'
toNumber: true
inStock:
selector: .availability
toBoolean: trueconfig:
url: https://example.com
cacheDir: ./.scrape-cache
useCache: true
timeout: 30
retries: 2
data:
content:
selector: .main-contentconfig:
url: https://example.com/blog
pagination:
nextSelector: "a.next-page"
maxPages: 10
stopOnEmpty: true
delay: 1000
data:
articles:
selector: article
data:
title:
selector: h2
date:
selector: .post-datemain config.krk.yaml -o output.jsonmain config.krk.yaml -o output.csv -f csvNote: CSV output flattens simple fields. Nested arrays are JSON-encoded.
cargo run --bin genThis creates krk-schema.json for configuration validation.
cargo testMIT