Skip to content

ggagosh/karkinos

Repository files navigation

karkinos Καρκινος

🦀🦀🦀 Powerful and flexible website scraper written in Rust 🦀🦀🦀

Inspired by scrape-it

Features

  • Simple YAML Configuration: Define scraping rules using intuitive YAML syntax
  • CSS Selectors: Extract data using standard CSS selectors
  • Nested Data: Support for complex nested data structures
  • Data Transformations: Built-in text processing (regex, case conversion, type casting)
  • Multiple URLs: Batch scrape multiple pages in one configuration
  • HTTP Features: Custom headers, proxy support, timeout, and retry logic
  • Rate Limiting: Respectful scraping with configurable delays
  • Caching: Cache responses for faster development and testing
  • Multiple Output Formats: Export to JSON or CSV
  • Parallel Processing: Fast extraction using Rayon
  • Type Conversion: Automatic conversion to numbers and booleans
  • Pagination Support: Automatically scrape multiple pages using URL patterns or "next" links

Installation

cargo install --path .

Usage

# Basic usage
main config.krk.yaml

# Save to file
main config.krk.yaml -o output.json

# Export to CSV
main config.krk.yaml -o output.csv -f csv

Configuration

Basic Structure

config:
  url: https://example.com
data:
  title:
    selector: h1
  description:
    selector: .description

Configuration Options

HTTP Configuration

config:
  # Single URL
  url: https://example.com

  # OR multiple URLs for batch scraping
  urls:
    - https://example.com/page1
    - https://example.com/page2

  # Custom headers
  headers:
    User-Agent: "Mozilla/5.0"
    Cookie: "session=abc123"

  # Timeout in seconds (default: 30)
  timeout: 60

  # Number of retry attempts (default: 0)
  retries: 3

  # Delay between requests in milliseconds (default: 0)
  delay: 1000

  # Proxy configuration
  proxy: http://proxy.example.com:8080

  # Cache configuration
  cacheDir: ./.cache
  useCache: true

Data Extraction

data:
  # Simple text extraction
  title:
    selector: h1

  # Extract from attribute
  image:
    selector: img.featured
    attr: src

  # Select nth element (0-indexed)
  firstParagraph:
    selector: p
    nth: 0

  # Default value if not found
  author:
    selector: .author
    default: "Unknown"

  # Disable trimming
  rawText:
    selector: .content
    trim: false

Data Transformations

data:
  # Extract using regex
  price:
    selector: .price
    regex: '\d+\.\d+'
    toNumber: true

  # Text replacement
  cleanTitle:
    selector: h1
    replace: ["Breaking: ", ""]

  # Case conversion
  upperTitle:
    selector: h1
    uppercase: true

  lowerTitle:
    selector: h1
    lowercase: true

  # Strip HTML tags
  cleanText:
    selector: .content
    stripHtml: true

  # Type conversion
  rating:
    selector: .rating
    toNumber: true

  isActive:
    selector: .status
    toBoolean: true

Nested Data

data:
  articles:
    selector: article
    data:
      title:
        selector: h2
      author:
        selector: .author
      tags:
        selector: .tag
        data:
          name:
            selector: span

Pagination

Automatically scrape multiple pages:

# Strategy 1: URL pattern with page numbers
config:
  url: https://example.com/products
  pagination:
    pagePattern: "?page={page}"
    startPage: 1
    endPage: 10
    stopOnEmpty: false

# Strategy 2: Follow "next" links
config:
  url: https://example.com/blog
  pagination:
    nextSelector: "a.next-page"
    maxPages: 20
    stopOnEmpty: true

# Strategy 3: Full URL pattern
config:
  url: https://example.com
  pagination:
    pagePattern: "https://example.com/search?q=rust&page={page}"
    startPage: 1
    maxPages: 5

Pagination Options:

  • pagePattern: URL pattern with {page} placeholder
  • nextSelector: CSS selector for "next page" link
  • startPage: Starting page number (default: 1)
  • maxPages: Maximum pages to scrape (0 = unlimited for nextSelector)
  • endPage: Ending page number (for pagePattern)
  • stopOnEmpty: Stop if no results found on page

Examples

Example 1: Basic Scraping

config:
  url: https://news.ycombinator.com
data:
  stories:
    selector: .athing
    data:
      title:
        selector: .titleline > a
      score:
        selector: .score
        toNumber: true

Example 2: Multiple URLs with Rate Limiting

config:
  urls:
    - https://example.com/page1
    - https://example.com/page2
  delay: 2000
  headers:
    User-Agent: "Karkinos/1.0"
data:
  title:
    selector: h1
  content:
    selector: article

Example 3: Advanced Transformations

config:
  url: https://example.com/products
data:
  products:
    selector: .product
    data:
      name:
        selector: .product-name
        stripHtml: true
      price:
        selector: .price
        regex: '\d+\.\d+'
        toNumber: true
      inStock:
        selector: .availability
        toBoolean: true

Example 4: Cached Development

config:
  url: https://example.com
  cacheDir: ./.scrape-cache
  useCache: true
  timeout: 30
  retries: 2
data:
  content:
    selector: .main-content

Example 5: Pagination

config:
  url: https://example.com/blog
  pagination:
    nextSelector: "a.next-page"
    maxPages: 10
    stopOnEmpty: true
  delay: 1000
data:
  articles:
    selector: article
    data:
      title:
        selector: h2
      date:
        selector: .post-date

Output Formats

JSON (default)

main config.krk.yaml -o output.json

CSV

main config.krk.yaml -o output.csv -f csv

Note: CSV output flattens simple fields. Nested arrays are JSON-encoded.

Development

Generate JSON Schema

cargo run --bin gen

This creates krk-schema.json for configuration validation.

Run Tests

cargo test

License

MIT

About

Very simple website scrapper

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages