karkinos `Καρκινος`

🦀🦀🦀 Powerful and flexible website scraper written in Rust 🦀🦀🦀

Features

Simple YAML Configuration: Define scraping rules using intuitive YAML syntax
CSS Selectors: Extract data using standard CSS selectors
Nested Data: Support for complex nested data structures
Data Transformations: Built-in text processing (regex, case conversion, type casting)
Multiple URLs: Batch scrape multiple pages in one configuration
HTTP Features: Custom headers, proxy support, timeout, and retry logic
Rate Limiting: Respectful scraping with configurable delays
Caching: Cache responses for faster development and testing
Multiple Output Formats: Export to JSON or CSV
Parallel Processing: Fast extraction using Rayon
Type Conversion: Automatic conversion to numbers and booleans
Pagination Support: Automatically scrape multiple pages using URL patterns or "next" links

Installation

cargo install --path .

Usage

# Basic usage
main config.krk.yaml

# Save to file
main config.krk.yaml -o output.json

# Export to CSV
main config.krk.yaml -o output.csv -f csv

Configuration

Basic Structure

config:
  url: https://example.com
data:
  title:
    selector: h1
  description:
    selector: .description

Configuration Options

HTTP Configuration

config:
  # Single URL
  url: https://example.com

  # OR multiple URLs for batch scraping
  urls:
    - https://example.com/page1
    - https://example.com/page2

  # Custom headers
  headers:
    User-Agent: "Mozilla/5.0"
    Cookie: "session=abc123"

  # Timeout in seconds (default: 30)
  timeout: 60

  # Number of retry attempts (default: 0)
  retries: 3

  # Delay between requests in milliseconds (default: 0)
  delay: 1000

  # Proxy configuration
  proxy: http://proxy.example.com:8080

  # Cache configuration
  cacheDir: ./.cache
  useCache: true

Data Extraction

data:
  # Simple text extraction
  title:
    selector: h1

  # Extract from attribute
  image:
    selector: img.featured
    attr: src

  # Select nth element (0-indexed)
  firstParagraph:
    selector: p
    nth: 0

  # Default value if not found
  author:
    selector: .author
    default: "Unknown"

  # Disable trimming
  rawText:
    selector: .content
    trim: false

Data Transformations

data:
  # Extract using regex
  price:
    selector: .price
    regex: '\d+\.\d+'
    toNumber: true

  # Text replacement
  cleanTitle:
    selector: h1
    replace: ["Breaking: ", ""]

  # Case conversion
  upperTitle:
    selector: h1
    uppercase: true

  lowerTitle:
    selector: h1
    lowercase: true

  # Strip HTML tags
  cleanText:
    selector: .content
    stripHtml: true

  # Type conversion
  rating:
    selector: .rating
    toNumber: true

  isActive:
    selector: .status
    toBoolean: true

Nested Data

data:
  articles:
    selector: article
    data:
      title:
        selector: h2
      author:
        selector: .author
      tags:
        selector: .tag
        data:
          name:
            selector: span

Pagination

Automatically scrape multiple pages:

# Strategy 1: URL pattern with page numbers
config:
  url: https://example.com/products
  pagination:
    pagePattern: "?page={page}"
    startPage: 1
    endPage: 10
    stopOnEmpty: false

# Strategy 2: Follow "next" links
config:
  url: https://example.com/blog
  pagination:
    nextSelector: "a.next-page"
    maxPages: 20
    stopOnEmpty: true

# Strategy 3: Full URL pattern
config:
  url: https://example.com
  pagination:
    pagePattern: "https://example.com/search?q=rust&page={page}"
    startPage: 1
    maxPages: 5

Pagination Options:

pagePattern: URL pattern with {page} placeholder
nextSelector: CSS selector for "next page" link
startPage: Starting page number (default: 1)
maxPages: Maximum pages to scrape (0 = unlimited for nextSelector)
endPage: Ending page number (for pagePattern)
stopOnEmpty: Stop if no results found on page

Examples

Example 1: Basic Scraping

config:
  url: https://news.ycombinator.com
data:
  stories:
    selector: .athing
    data:
      title:
        selector: .titleline > a
      score:
        selector: .score
        toNumber: true

Example 2: Multiple URLs with Rate Limiting

config:
  urls:
    - https://example.com/page1
    - https://example.com/page2
  delay: 2000
  headers:
    User-Agent: "Karkinos/1.0"
data:
  title:
    selector: h1
  content:
    selector: article

Example 3: Advanced Transformations

config:
  url: https://example.com/products
data:
  products:
    selector: .product
    data:
      name:
        selector: .product-name
        stripHtml: true
      price:
        selector: .price
        regex: '\d+\.\d+'
        toNumber: true
      inStock:
        selector: .availability
        toBoolean: true

Example 4: Cached Development

config:
  url: https://example.com
  cacheDir: ./.scrape-cache
  useCache: true
  timeout: 30
  retries: 2
data:
  content:
    selector: .main-content

Example 5: Pagination

config:
  url: https://example.com/blog
  pagination:
    nextSelector: "a.next-page"
    maxPages: 10
    stopOnEmpty: true
  delay: 1000
data:
  articles:
    selector: article
    data:
      title:
        selector: h2
      date:
        selector: .post-date

Output Formats

JSON (default)

main config.krk.yaml -o output.json

CSV

main config.krk.yaml -o output.csv -f csv

Note: CSV output flattens simple fields. Nested arrays are JSON-encoded.

Development

Generate JSON Schema

cargo run --bin gen

This creates krk-schema.json for configuration validation.

Run Tests

cargo test

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
example		example
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
README.md		README.md
karkinos.iml		karkinos.iml
krk-schema.json		krk-schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

karkinos `Καρκινος`

Features

Installation

Usage

Configuration

Basic Structure

Configuration Options

HTTP Configuration

Data Extraction

Data Transformations

Nested Data

Pagination

Examples

Example 1: Basic Scraping

Example 2: Multiple URLs with Rate Limiting

Example 3: Advanced Transformations

Example 4: Cached Development

Example 5: Pagination

Output Formats

JSON (default)

CSV

Development

Generate JSON Schema

Run Tests

License

About

Uh oh!

Releases 2

Contributors 2

Uh oh!

Languages

ggagosh/karkinos

Folders and files

Latest commit

History

Repository files navigation

karkinos Καρκινος

Features

Installation

Usage

Configuration

Basic Structure

Configuration Options

HTTP Configuration

Data Extraction

Data Transformations

Nested Data

Pagination

Examples

Example 1: Basic Scraping

Example 2: Multiple URLs with Rate Limiting

Example 3: Advanced Transformations

Example 4: Cached Development

Example 5: Pagination

Output Formats

JSON (default)

CSV

Development

Generate JSON Schema

Run Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors 2

Uh oh!

Languages

karkinos `Καρκινος`