09 Dec 08:49

60d6173

Release v0.7.8 Latest

Latest

🎉 Crawl4AI v0.7.8 Released!

📦 Installation

PyPI:

pip install crawl4ai==0.7.8

Docker:

docker pull unclecode/crawl4ai:0.7.8
docker pull unclecode/crawl4ai:latest

Note: Docker images are being built and will be available shortly.
Check the Docker Release workflow for build status.

📝 What's Changed

See CHANGELOG.md for details.

Assets 2

14 Nov 09:28

github-actions

v0.7.7

6244f56

Release v0.7.7

🎉 Crawl4AI v0.7.7 Released!

This release introduces a complete self-hosting platform with enterprise-grade real-time monitoring. This release transforms Crawl4AI Docker from a simple containerized crawler into a production-ready platform with full operational transparency and control.

🚀 What's New

Major Feature: Real-time Monitoring & Self-Hosting Platform

Docker deployment now includes:

📊 Interactive Monitoring Dashboard (/dashboard)
🔌 Comprehensive Monitor API
⚡ WebSocket Streaming
🔥 Smart Browser Pool (3-tier architecture)
🧹 Janitor System
📈 Production-Ready

🐛 Critical Bug Fixes

Fixed async LLM extraction blocking issue (#1055) - now supports true parallel processing
Fixed CDP endpoint verification with exponential backoff (#1445)
Fixed arun_many to always return a list, even on exception

Configuration & Features

Updated browser and crawler config documentation to match implementation
Enhanced DFS deep crawl strategy with seen URL tracking
Fixed sitemap parsing and URL normalization in AsyncUrlSeeder (#1559)
Fixed viewport configuration in managed browsers (#1490)
Fixed remove_overlay_elements functionality (#1396)

Docker & Infrastructure

Fixed LLM API key handling for multi-provider support
Standardized Docker port to 11235 across all configs
Improved error handling with comprehensive status codes
Fixed fit_html serialization in /crawl and /crawl/stream endpoints

Security

Updated pyOpenSSL from >=24.3.0 to >=25.3.0 (security vulnerability fix)
Added verification tests for security updates

📦 Installation

PyPI:

pip install crawl4ai==0.7.7

Docker:

docker pull unclecode/crawl4ai:0.7.7
docker pull unclecode/crawl4ai:latest

Note: Docker images are being built and will be available shortly.
Check the Docker Release workflow for build status.

📝 What's Changed

See CHANGELOG.md for details.

Assets 2

22 Oct 12:06

github-actions

v0.7.6

ca100c6

Release v0.7.6

🎉 Crawl4AI v0.7.6 Released!

Crawl4AI v0.7.6 - Webhook Support for Docker Job Queue API

Users can now:

Use webhooks with both /crawl/job and /llm/job endpoints
Get real-time notifications instead of polling
Configure webhook delivery with custom headers
Include full data in webhook payloads
Set global webhook URLs in config.yml
Benefit from automatic retry with exponential backoff

📦 Installation

PyPI:

pip install crawl4ai==0.7.6

Docker:

docker pull unclecode/crawl4ai:0.7.6
docker pull unclecode/crawl4ai:latest

Note: Docker images are being built and will be available shortly.
Check the Docker Release workflow for build status.

📝 What's Changed

See CHANGELOG.md for details.

Assets 2

21 Oct 08:15

github-actions

v0.7.5

f6a02c4

Release v0.7.5

🚀 Crawl4AI v0.7.5: Docker Hooks & Security Update

🎯 What's New

🔧 Docker Hooks System

Inject custom Python functions at 8 key pipeline points for authentication, performance optimization, and content processing.

Function-Based API with IDE support:

from crawl4ai import hooks_to_string

async def on_page_context_created(page, context, **kwargs):
    """Block images to speed up crawling"""
    await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
    return page

hooks_code = hooks_to_string({"on_page_context_created": on_page_context_created})

8 Available Hook Points:
on_browser_created, on_page_context_created, before_goto, after_goto, on_user_agent_updated, on_execution_started, before_retrieve_html, before_return_html

🤖 Enhanced LLM Integration

Custom temperature parameter for creativity control
Multi-provider support (OpenAI, Gemini, custom endpoints)
base_url configuration for self-hosted models
Improved Docker API integration

🔒 HTTPS Preservation

New preserve_https_for_internal_links option maintains secure protocols throughout crawling — critical for authenticated sessions and security-conscious applications.

🛠️ Major Bug Fixes

URL Processing: Fixed '+' sign preservation in query parameters (#1332)
JWT Authentication: Resolved Docker JWT validation issues (#1442)
Playwright Stealth: Fixed stealth features integration (#1481)
Proxy Configuration: Enhanced parsing with new proxy_config structure
Memory Management: Fixed leaks in long-running sessions
Docker Serialization: Resolved JSON encoding errors (#1419)
LLM Providers: Fixed custom provider integration for adaptive crawler (#1291)
Performance: Resolved backoff strategy failures (#989)

📦 Installation

PyPI:
pip install crawl4ai==0.7.5

Docker:
docker pull unclecode/crawl4ai:0.7.5
docker pull unclecode/crawl4ai:latest

Platforms Supported: Linux/AMD64, Linux/ARM64 (Apple Silicon, AWS Graviton)

⚠️ Breaking Changes

Python 3.10+ Required (upgraded from 3.9)
Proxy Parameter Deprecated - Use new proxy_config structure
New Dependency - cssselect added for better CSS handling

📚 Resources

📖 Full Release Notes: https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md
📘 Documentation: https://docs.crawl4ai.com
💬 Discord Community: https://discord.gg/jP8KfhDhyN
🐛 Issues: https://github.com/unclecode/crawl4ai/issues

🙏 Contributors

Thank you to everyone who reported issues, provided feedback, and contributed to this release!

Full Changelog: v0.7.4...v0.7.5

Assets 2

17 Aug 12:12

github-actions

v0.7.4

e651e04

Release v0.7.4

🎉 Crawl4AI v0.7.4 Released!

📦 Installation

PyPI:

pip install crawl4ai==0.7.4

Docker:

docker pull unclecode/crawl4ai:0.7.4
docker pull unclecode/crawl4ai:latest

📝 What's Changed

See CHANGELOG.md for details.

Assets 2

09 Aug 12:38

github-actions

v0.7.3

21f79fe

Release v0.7.3

🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update

Welcome to Crawl4AI v0.7.3! This release brings powerful new capabilities for stealth crawling, intelligent URL configuration, memory optimization, and enhanced data extraction. Whether you're dealing with bot-protected sites, mixed content types, or large-scale crawling operations, this update has you covered.

💖 GitHub Sponsors Now Live!

After powering 51,000+ developers and becoming the #1 trending web crawler, we're launching GitHub Sponsors to ensure Crawl4AI stays independent and innovative forever.

🏆 Be a Founding Sponsor (First 50 Only!)

🌱 Believer ($5/mo): Join the movement + sponsors-only Discord
🚀 Builder ($50/mo): Priority support + early feature access
💼 Growing Team ($500/mo): Bi-weekly syncs + optimization help
🏢 Data Infrastructure Partner ($2000/mo): Full partnership + dedicated support

Why sponsor? Own your data pipeline. No API limits. Direct access to the creator.

Become a Sponsor → | See Benefits

🎯 Major Features

🕵️ Undetected Browser Support

Break through sophisticated bot detection systems with our new stealth capabilities:

from crawl4ai import AsyncWebCrawler, BrowserConfig

# Enable stealth mode for undetectable crawling
browser_config = BrowserConfig(
    browser_type="undetected",  # Use undetected Chrome
    headless=True,              # Can run headless with stealth
    extra_args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-web-security"
    ]
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    # Successfully bypass Cloudflare, Akamai, and custom bot detection
    result = await crawler.arun("https://protected-site.com")
    print(f"✅ Bypassed protection! Content: {len(result.markdown)} chars")

What it enables:

Access previously blocked corporate sites and databases
Gather competitor data from protected sources
Monitor pricing on e-commerce sites with anti-bot measures
Collect news and social media content despite protection systems

🎨 Multi-URL Configuration System

Apply different crawling strategies to different URL patterns automatically:

from crawl4ai import CrawlerRunConfig

# Define specialized configs for different content types
configs = [
    # Documentation sites - aggressive caching, include links
    CrawlerRunConfig(
        url_matcher=["*docs*", "*documentation*"],
        cache_mode="write",
        markdown_generator_options={"include_links": True}
    ),
    
    # News/blog sites - fresh content, scroll for lazy loading
    CrawlerRunConfig(
        url_matcher=lambda url: 'blog' in url or 'news' in url,
        cache_mode="bypass",
        js_code="window.scrollTo(0, document.body.scrollHeight/2);"
    ),
    
    # API endpoints - structured extraction
    CrawlerRunConfig(
        url_matcher=["*.json", "*api*"],
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o-mini",
            extraction_type="structured"
        )
    ),
    
    # Default fallback for everything else
    CrawlerRunConfig()
]

# Crawl multiple URLs with perfect configurations
results = await crawler.arun_many([
    "https://docs.python.org/3/",      # → Uses documentation config
    "https://blog.python.org/",        # → Uses blog config  
    "https://api.github.com/users",    # → Uses API config
    "https://example.com/"             # → Uses default config
], config=configs)

Perfect for:

Mixed content sites (blogs, docs, downloads)
Multi-domain crawling with different needs per domain
Eliminating complex conditional logic in extraction code
Optimizing performance by giving each URL exactly what it needs

🧠 Memory Monitoring & Optimization

Track and optimize memory usage during large-scale operations:

from crawl4ai.memory_utils import MemoryMonitor

# Monitor memory during crawling
monitor = MemoryMonitor()
monitor.start_monitoring()

# Perform memory-intensive operations
results = await crawler.arun_many([
    "https://heavy-js-site.com",
    "https://large-images-site.com", 
    "https://dynamic-content-site.com"
] * 100)  # Large batch

# Get detailed memory report
report = monitor.get_report()
print(f"Peak memory usage: {report['peak_mb']:.1f} MB")
print(f"Memory efficiency: {report['efficiency']:.1f}%")

# Automatic optimization suggestions
if report['peak_mb'] > 1000:  # > 1GB
    print("💡 Consider batch size optimization")
    print("💡 Enable aggressive garbage collection")

Benefits:

Prevent memory-related crashes in production services
Right-size server resources based on actual usage patterns
Identify bottlenecks for performance optimization
Plan horizontal scaling based on memory requirements

📊 Enhanced Table Extraction

Direct pandas DataFrame conversion from web tables:

result = await crawler.arun("https://site-with-tables.com")

# New streamlined approach
if result.tables:
    print(f"Found {len(result.tables)} tables")
    
    import pandas as pd
    for i, table in enumerate(result.tables):
        # Instant DataFrame conversion
        df = pd.DataFrame(table['data'])
        print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
        print(df.head())
        
        # Rich metadata available
        print(f"Source: {table.get('source_xpath', 'Unknown')}")
        print(f"Headers: {table.get('headers', [])}")

# Old way (now deprecated)
# tables_data = result.media.get('tables', [])  # ❌ Don't use this

Improvements:

Faster transition from web data to analysis-ready DataFrames
Cleaner integration with data processing pipelines
Simplified table extraction for automated reporting
Better table structure preservation

🐳 Docker LLM Provider Flexibility

Switch between LLM providers without rebuilding images:

# Option 1: Direct environment variables
docker run -d \
  -e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
  -e GROQ_API_KEY="your-key" \
  -p 11235:11235 \
  unclecode/crawl4ai:0.7.3

# Option 2: Using .llm.env file (recommended for production)
docker run -d \
  --env-file .llm.env \
  -p 11235:11235 \
  unclecode/crawl4ai:0.7.3

Create .llm.env file:

LLM_PROVIDER=openai/gpt-4o-mini
OPENAI_API_KEY=your-openai-key
GROQ_API_KEY=your-groq-key

Override per request when needed:

# Use cheaper models for simple tasks, premium for complex ones
response = requests.post("http://localhost:11235/crawl", json={
    "url": "https://complex-page.com",
    "extraction_strategy": {
        "type": "llm",
        "provider": "openai/gpt-4"  # Override default
    }
})

🔧 Bug Fixes & Improvements

URL Matcher Fallback: Resolved edge cases in pattern matching logic
Memory Management: Fixed memory leaks in long-running sessions
Sitemap Processing: Improved redirect handling in sitemap fetching
Table Extraction: Enhanced detection and extraction accuracy
Error Handling: Better messages and recovery from network failures

📚 Documentation & Architecture

Architecture Refactoring: Moved 2,450+ lines to backup for cleaner codebase
Real-World Examples: Added practical use cases with actual URLs
Migration Guides: Complete transition from result.media to result.tables
Comprehensive Guides: Full documentation for undetected browsers and multi-config

📦 Installation & Upgrade

PyPI Installation

# Fresh install
pip install crawl4ai==0.7.3

# Upgrade from previous version
pip install --upgrade crawl4ai==0.7.3

Docker Images

# Specific version
docker pull unclecode/crawl4ai:0.7.3

# Latest (points to 0.7.3)
docker pull unclecode/crawl4ai:latest

# Version aliases
docker pull unclecode/crawl4ai:0.7    # Minor version
docker pull unclecode/crawl4ai:0      # Major version

Migration Notes

result.tables replaces result.media.get('tables')
Undetected browser requires browser_type="undetected"
Multi-config uses url_matcher parameter in CrawlerRunConfig

🎉 What's Next?

This release sets the foundation for even more advanced features coming in v0.8:

AI-powered content understanding
Advanced crawling strategies
Enhanced data pipeline integrations
More stealth and anti-detection capabilities

📝 Complete Documentation

Full Release Notes - Detailed technical explanations
Changelog - Complete list of changes
Documentation - Full API reference and guides
Discord Community - Get help and share experiences

Live Long and import crawl4ai

Crawl4AI continues to evolve with your needs. This release makes it stealthier, smarter, and more scalable. Try the new undetected browser and multi-config features—they're game changers!

- The Crawl4AI Team

📝 This release draft was composed and edited by human but rewritten and finalized by AI. If you notice any mistakes, please raise an issue.

Assets 2

25 Jul 10:19

unclecode

v0.7.2

4864730

v0.7.2: CI/CD & Dependency Optimization Update

🚀 Crawl4AI v0.7.2: CI/CD & Dependency Optimization Update

July 25, 2025 • 3 min read

This release introduces automated CI/CD pipelines for seamless releases and optimizes dependencies for a lighter, more efficient package.

🎯 What's New

🔄 Automated Release Pipeline

GitHub Actions CI/CD: Automated PyPI and Docker Hub releases on tag push
Multi-platform Docker images: Support for both AMD64 and ARM64 architectures
Version consistency checks: Ensures tag, package, and Docker versions align
Automated release notes: GitHub releases created automatically

📦 Dependency Optimization

Moved sentence-transformers to optional dependencies: Significantly reduces default installation size
Lighter Docker images: Optimized Dockerfile for faster builds and smaller images
Better dependency management: Core vs. optional dependencies clearly separated

🏗️ CI/CD Pipeline

The new automated release process ensures consistent, reliable releases:

# Trigger releases with a simple tag
git tag v0.7.2
git push origin v0.7.2

# Automatically:
# ✅ Validates version consistency
# ✅ Builds and publishes to PyPI
# ✅ Builds multi-platform Docker images
# ✅ Pushes to Docker Hub with proper tags
# ✅ Creates GitHub release

💾 Lighter Installation

Default installation is now significantly smaller:

# Core installation (smaller, faster)
pip install crawl4ai==0.7.2

# With ML features (includes sentence-transformers)
pip install crawl4ai[transformer]==0.7.2

# Full installation
pip install crawl4ai[all]==0.7.2

🐳 Docker Improvements

Enhanced Docker support with multi-platform images:

# Pull the latest version
docker pull unclecode/crawl4ai:0.7.2
docker pull unclecode/crawl4ai:latest

# Available tags:
# - unclecode/crawl4ai:0.7.2 (specific version)
# - unclecode/crawl4ai:0.7 (minor version)
# - unclecode/crawl4ai:0 (major version)
# - unclecode/crawl4ai:latest

🔧 Technical Details

Dependency Changes

sentence-transformers moved from required to optional dependencies
Reduces default installation by ~500MB
No impact on functionality when transformer features aren't needed

CI/CD Configuration

GitHub Actions workflows for automated releases
Version validation before publishing
Parallel PyPI and Docker Hub deployments
Automatic tagging strategy for Docker images

🚀 Installation

pip install crawl4ai==0.7.2

No breaking changes - direct upgrade from v0.7.0 or v0.7.1.

Questions? Issues?

P.S. The new CI/CD pipeline will make future releases faster and more reliable. Thanks for your patience as we improve our release process!

Assets 2

17 Jul 09:48

unclecode

v0.7.1

0163bd7

v0.7.1:Update

🛠️ Crawl4AI v0.7.1: Minor Cleanup Update

July 17, 2025 • 2 min read

A small maintenance release that removes unused code and improves documentation.

🎯 What's Changed

Removed unused StealthConfig from crawl4ai/browser_manager.py
Updated documentation with better examples and parameter explanations
Fixed virtual scroll configuration examples in docs

🧹 Code Cleanup

Removed unused StealthConfig import and configuration that wasn't being used anywhere in the codebase. The project uses its own custom stealth implementation through JavaScript injection instead.

# Removed unused code:
from playwright_stealth import StealthConfig
stealth_config = StealthConfig(...)  # This was never used

📖 Documentation Updates

Fixed adaptive crawling parameter examples
Updated session management documentation
Corrected virtual scroll configuration examples

🚀 Installation

pip install crawl4ai==0.7.1

No breaking changes - upgrade directly from v0.7.0.

Questions? Issues?

GitHub: github.com/unclecode/crawl4ai
Discord: discord.gg/crawl4ai

Assets 2

12 Jul 11:13

unclecode

v0.7.0

7b9ba30

v0.7.0: The Adaptive Intelligence Update

🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update

January 28, 2025 • 10 min read

Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.

🎯 What's New at a Glance

Adaptive Crawling: Your crawler now learns and adapts to website patterns
Virtual Scroll Support: Complete content extraction from infinite scroll pages
Link Preview with 3-Layer Scoring: Intelligent link analysis and prioritization
Async URL Seeder: Discover thousands of URLs in seconds with intelligent filtering
PDF Parsing: Extract data from PDF documents
Performance Optimizations: Significant speed and memory improvements

🧠 Adaptive Crawling: Intelligence Through Pattern Learning

The Problem: Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.

My Solution: I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.

Technical Deep-Dive

The Adaptive Crawler maintains a persistent state for each domain, tracking:

Pattern success rates
Selector stability over time
Content structure variations
Extraction confidence scores

from crawl4ai import AdaptiveCrawler, AdaptiveConfig, CrawlState

# Initialize with custom learning parameters
config = AdaptiveConfig(
    confidence_threshold=0.7,    # Min confidence to use learned patterns
    max_history=100,            # Remember last 100 crawls per domain
    learning_rate=0.2,          # How quickly to adapt to changes
    patterns_per_page=3,        # Patterns to learn per page type
    extraction_strategy='css'   # 'css' or 'xpath'
)

adaptive_crawler = AdaptiveCrawler(config)

# First crawl - crawler learns the structure
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        "https://news.example.com/article/12345",
        config=CrawlerRunConfig(
            adaptive_config=config,
            extraction_hints={  # Optional hints to speed up learning
                "title": "article h1",
                "content": "article .body-content"
            }
        )
    )
    
    # Crawler identifies and stores patterns
    if result.success:
        state = adaptive_crawler.get_state("news.example.com")
        print(f"Learned {len(state.patterns)} patterns")
        print(f"Confidence: {state.avg_confidence:.2%}")

# Subsequent crawls - uses learned patterns
result2 = await crawler.arun(
    "https://news.example.com/article/67890",
    config=CrawlerRunConfig(adaptive_config=config)
)
# Automatically extracts using learned patterns!

Expected Real-World Impact:

News Aggregation: Maintain 95%+ extraction accuracy even as news sites update their templates
E-commerce Monitoring: Track product changes across hundreds of stores without constant maintenance
Research Data Collection: Build robust academic datasets that survive website redesigns
Reduced Maintenance: Cut selector update time by 80% for frequently-changing sites

🌊 Virtual Scroll: Complete Content Capture

The Problem: Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.

My Solution: I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.

Implementation Details

from crawl4ai import VirtualScrollConfig

# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
    container_selector="[data-testid='primaryColumn']",
    scroll_count=20,                    # Number of scrolls
    scroll_by="container_height",       # Smart scrolling by container size
    wait_after_scroll=1.0,             # Let content load
    capture_method="incremental",       # Capture new content on each scroll
    deduplicate=True                   # Remove duplicate elements
)

# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
    container_selector="main .product-grid",
    scroll_count=30,
    scroll_by=800,                     # Fixed pixel scrolling
    wait_after_scroll=1.5,             # Images need time
    stop_on_no_change=True            # Smart stopping
)

# For news feeds with lazy loading
news_config = VirtualScrollConfig(
    container_selector=".article-feed",
    scroll_count=50,
    scroll_by="page_height",           # Viewport-based scrolling
    wait_after_scroll=0.5,
    wait_for_selector=".article-card",  # Wait for specific elements
    timeout=30000                      # Max 30 seconds total
)

# Use it in your crawl
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        "https://twitter.com/trending",
        config=CrawlerRunConfig(
            virtual_scroll_config=twitter_config,
            # Combine with other features
            extraction_strategy=JsonCssExtractionStrategy({
                "tweets": {
                    "selector": "[data-testid='tweet']",
                    "fields": {
                        "text": {"selector": "[data-testid='tweetText']", "type": "text"},
                        "likes": {"selector": "[data-testid='like']", "type": "text"}
                    }
                }
            })
        )
    )
    
    print(f"Captured {len(result.extracted_content['tweets'])} tweets")

Key Capabilities:

DOM Recycling Awareness: Detects and handles virtual DOM element recycling
Smart Scroll Physics: Three modes - container height, page height, or fixed pixels
Content Preservation: Captures content before it's destroyed
Intelligent Stopping: Stops when no new content appears
Memory Efficient: Streams content instead of holding everything in memory

Expected Real-World Impact:

Social Media Analysis: Capture entire Twitter threads with hundreds of replies, not just top 10
E-commerce Scraping: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
News Aggregation: Get all articles from modern news sites, not just above-the-fold content
Research Applications: Complete data extraction from academic databases using virtual pagination

🔗 Link Preview: Intelligent Link Analysis and Scoring

The Problem: You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.

My Solution: I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.

The Three-Layer Scoring System

from crawl4ai import LinkPreviewConfig

# Configure intelligent link analysis
link_config = LinkPreviewConfig(
    # What to analyze
    include_internal=True,
    include_external=True,
    max_links=100,              # Analyze top 100 links
    
    # Relevance scoring
    query="machine learning tutorials",  # Your interest
    score_threshold=0.3,        # Minimum relevance score
    
    # Performance
    concurrent_requests=10,     # Parallel processing
    timeout_per_link=5000,      # 5s per link
    
    # Advanced scoring weights
    scoring_weights={
        "intrinsic": 0.3,       # Link quality indicators
        "contextual": 0.5,      # Relevance to query
        "popularity": 0.2       # Link prominence
    }
)

# Use in your crawl
result = await crawler.arun(
    "https://tech-blog.example.com",
    config=CrawlerRunConfig(
        link_preview_config=link_config,
        score_links=True
    )
)

# Access scored and sorted links
for link in result.links["internal"][:10]:  # Top 10 internal links
    print(f"Score: {link['total_score']:.3f}")
    print(f"  Intrinsic: {link['intrinsic_score']:.1f}/10")  # Position, attributes
    print(f"  Contextual: {link['contextual_score']:.1f}/1")  # Relevance to query
    print(f"  URL: {link['href']}")
    print(f"  Title: {link['head_data']['title']}")
    print(f"  Description: {link['head_data']['meta']['description'][:100]}...")

Scoring Components:

Intrinsic Score (0-10): Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
- URL structure and depth
Contextual Score (0-1): Relevance to your query
- Semantic similarity using embeddings
- Keyword matching in link text and title
- Meta description analysis
- Content preview scoring
Total Score: Weighted combination for final ranking

Expected Real-World Impact:

Research Efficiency: Find relevant papers 10x faster by following only high-score links
Competitive Analysis: Automatically identify important pages on competitor sites
Content Discovery: Build topic-focused crawlers that stay on track
SEO Audits: Identify and prioritize high-value internal linking opportunities

🎣 Async URL Seeder: Automated URL Discovery at Scale

The Problem: You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.

My Solution: I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.

Technical Architecture

...

Assets 2

12 May 13:44

unclecode

v0.6.3

897e017

v0.6.3

Release 0.6.3 (unreleased)

Features

extraction: add RegexExtractionStrategy for pattern-based extraction, including built-in patterns for emails, URLs, phones, dates, support for custom regexes, an LLM-assisted pattern generator, optimized HTML preprocessing via fit_html, and enhanced network response body capture (9b5ccac)
docker-api: introduce job-based polling endpoints—POST /crawl/job & GET /crawl/job/{task_id} for crawls, POST /llm/job & GET /llm/job/{task_id} for LLM tasks—backed by Redis task management with configurable TTL, moved schemas to schemas.py, and added demo_docker_polling.py example (94e9959)
browser: improve profile management and cleanup—add process cleanup for existing Chromium instances on Windows/Unix, fix profile creation by passing full browser config, ship detailed browser/CLI docs and initial profile-creation test, bump version to 0.6.3 (9499164)

Fixes

crawler: remove automatic page closure in take_screenshot and take_screenshot_naive, preventing premature teardown; callers now must explicitly close pages (BREAKING CHANGE) (a3e9ef9)

Documentation

format bash scripts in docs/apps/linkdin/README.md so examples copy & paste cleanly (87d4b0f)
update the same README with full litellm argument details for correct script usage (bd5a9ac)

Refactoring

logger: centralize color codes behind an Enum in async_logger, browser_profiler, content_filter_strategy and related modules for cleaner, type-safe formatting (cd2b490)

Experimental

start migration of logging stack to rich (WIP, work ongoing) (b2f3cb0)

Assets 2

Uh oh!

Releases: unclecode/crawl4ai

Release v0.7.8

🎉 Crawl4AI v0.7.8 Released!

📦 Installation

📝 What's Changed

Uh oh!

Release v0.7.7

🎉 Crawl4AI v0.7.7 Released!

🚀 What's New

📦 Installation

📝 What's Changed

Uh oh!

Release v0.7.6

🎉 Crawl4AI v0.7.6 Released!

📦 Installation

📝 What's Changed

Uh oh!

Release v0.7.5

🚀 Crawl4AI v0.7.5: Docker Hooks & Security Update

🎯 What's New

🔧 Docker Hooks System

Uh oh!

Release v0.7.4

🎉 Crawl4AI v0.7.4 Released!

📦 Installation

📝 What's Changed

Uh oh!

Release v0.7.3

🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update

💖 GitHub Sponsors Now Live!

🏆 Be a Founding Sponsor (First 50 Only!)

🎯 Major Features

🕵️ Undetected Browser Support

🎨 Multi-URL Configuration System

🧠 Memory Monitoring & Optimization

📊 Enhanced Table Extraction

🐳 Docker LLM Provider Flexibility

🔧 Bug Fixes & Improvements

📚 Documentation & Architecture

📦 Installation & Upgrade

PyPI Installation

Docker Images

Migration Notes

🎉 What's Next?

📝 Complete Documentation

Uh oh!

v0.7.2: CI/CD & Dependency Optimization Update

🚀 Crawl4AI v0.7.2: CI/CD & Dependency Optimization Update

🎯 What's New

🔄 Automated Release Pipeline

📦 Dependency Optimization

🏗️ CI/CD Pipeline

💾 Lighter Installation

🐳 Docker Improvements

🔧 Technical Details

Dependency Changes

CI/CD Configuration

🚀 Installation

Uh oh!

v0.7.1:Update

🛠️ Crawl4AI v0.7.1: Minor Cleanup Update

🎯 What's Changed

🧹 Code Cleanup

📖 Documentation Updates

🚀 Installation

Uh oh!

v0.7.0: The Adaptive Intelligence Update

🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update

🎯 What's New at a Glance

🧠 Adaptive Crawling: Intelligence Through Pattern Learning

Technical Deep-Dive

🌊 Virtual Scroll: Complete Content Capture

Implementation Details

🔗 Link Preview: Intelligent Link Analysis and Scoring

The Three-Layer Scoring System

🎣 Async URL Seeder: Automated URL Discovery at Scale

Technical Architecture

Uh oh!

v0.6.3