unclecode
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 40 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 1 addition & 1 deletion b/‎Dockerfile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 43 additions & 4 deletions b/‎README.md‎
Lines changed: 43 additions & 4 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 122 additions & 0 deletions b/‎SECURITY.md‎
Lines changed: 122 additions & 0 deletions
diff --git a/‎crawl4ai/__version__.py‎
Lines changed: 1 addition & 1 deletion b/‎crawl4ai/__version__.py‎
Lines changed: 1 addition & 1 deletion
@@ -267,6 +267,7 @@ continue_config.json
 .private/
 
 .claude/
+.context/
 
 CLAUDE_MONITOR.md
 CLAUDE.md
@@ -295,3 +296,4 @@ scripts/
 *.db
 *.rdb
 *.ldb
+MEMORY.md
@@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.8.0] - 2026-01-12
+
+### Security
+- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
+  - Prevents arbitrary module imports in user-provided hook code
+  - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
+  - Credit: Neo by ProjectDiscovery
+- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
+  - Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
+  - Only allows `http://`, `https://`, and `raw:` URLs
+  - Credit: Neo by ProjectDiscovery
+
+### Breaking Changes
+- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
+- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing
+
+### Added
+- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
+- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
+- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
+- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
+- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
+- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
+- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
+- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
+- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
+- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
+- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
+- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
+
+### Fixed
+- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
+- **Caching System**: Various improvements to cache validation and persistence
+
+### Documentation
+- Multi-sample schema generation section
+- URL seeder smart TTL cache parameters
+- v0.8.0 migration guide
+- Security policy and disclosure process
+
 ## [Unreleased]
 
 ### Added
 
@@ -1,7 +1,7 @@
 FROM python:3.12-slim-bookworm AS build
 
 # C4ai version
-ARG C4AI_VER=0.7.8
+ARG C4AI_VER=0.8.0
 ENV C4AI_VERSION=$C4AI_VER
 LABEL c4ai.version=$C4AI_VER
 
 
@@ -37,13 +37,13 @@ Limited slots._
 
 Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
 
-[✨ Check out latest update v0.7.8](#-recent-updates)
+[✨ Check out latest update v0.8.0](#-recent-updates)
 
-✨ **New in v0.7.8**: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues (ContentRelevanceFilter, ProxyConfig, cache permissions), LLM extraction improvements (configurable backoff, HTML input format), URL handling fixes, and dependency updates (pypdf, Pydantic v2). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
+✨ **New in v0.8.0**: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. Critical security fixes for Docker API (hooks disabled by default, file:// URLs blocked). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
 
-✨ Recent v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
+✨ Recent v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
 
-✨ Previous v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
+✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
 
 <details>
   <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -562,6 +562,45 @@ async def test_news_crawl():
 
 ## ✨ Recent Updates
 
+<details open>
+<summary><strong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode</strong></summary>
+
+This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
+
+- **🔄 Deep Crawl Crash Recovery**:
+  - `on_state_change` callback fires after each URL for real-time state persistence
+  - `resume_state` parameter to continue from a saved checkpoint
+  - JSON-serializable state for Redis/database storage
+  - Works with BFS, DFS, and Best-First strategies
+  ```python
+  from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+  strategy = BFSDeepCrawlStrategy(
+      max_depth=3,
+      resume_state=saved_state,  # Continue from checkpoint
+      on_state_change=save_to_redis,  # Called after each URL
+  )
+  ```
+
+- **⚡ Prefetch Mode for Fast URL Discovery**:
+  - `prefetch=True` skips markdown, extraction, and media processing
+  - 5-10x faster than full processing
+  - Perfect for two-phase crawling: discover first, process selectively
+  ```python
+  config = CrawlerRunConfig(prefetch=True)
+  result = await crawler.arun("https://example.com", config=config)
+  # Returns HTML and links only - no markdown generation
+  ```
+
+- **🔒 Security Fixes (Docker API)**:
+  - Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+  - `file://` URLs blocked on API endpoints to prevent LFI
+  - `__import__` removed from hook execution sandbox
+
+[Full v0.8.0 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
+
+</details>
+
 <details>
 <summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>
 
 
@@ -0,0 +1,122 @@
+# Security Policy
+
+## Supported Versions
+
+| Version | Supported          |
+| ------- | ------------------ |
+| 0.8.x   | :white_check_mark: |
+| 0.7.x   | :x: (upgrade recommended) |
+| < 0.7   | :x:                |
+
+## Reporting a Vulnerability
+
+We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.
+
+### How to Report
+
+**DO NOT** open a public GitHub issue for security vulnerabilities.
+
+Instead, please report via one of these methods:
+
+1. **GitHub Security Advisories (Preferred)**
+   - Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
+   - Click "New draft security advisory"
+   - Fill in the details
+
+2. **Email**
+   - Send details to: security@crawl4ai.com
+   - Use subject: `[SECURITY] Brief description`
+   - Include:
+     - Description of the vulnerability
+     - Steps to reproduce
+     - Potential impact
+     - Any suggested fixes
+
+### What to Expect
+
+- **Acknowledgment**: Within 48 hours
+- **Initial Assessment**: Within 7 days
+- **Resolution Timeline**: Depends on severity
+  - Critical: 24-72 hours
+  - High: 7 days
+  - Medium: 30 days
+  - Low: 90 days
+
+### Disclosure Policy
+
+- We follow responsible disclosure practices
+- We will coordinate with you on disclosure timing
+- Credit will be given to reporters (unless anonymity is requested)
+- We may request CVE assignment for significant vulnerabilities
+
+## Security Best Practices for Users
+
+### Docker API Deployment
+
+If you're running the Crawl4AI Docker API in production:
+
+1. **Enable Authentication**
+   ```yaml
+   # config.yml
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+   ```bash
+   # Set a strong secret key
+   export SECRET_KEY="your-secure-random-key-here"
+   ```
+
+2. **Hooks are Disabled by Default** (v0.8.0+)
+   - Only enable if you trust all API users
+   - Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary
+
+3. **Network Security**
+   - Run behind a reverse proxy (nginx, traefik)
+   - Use HTTPS in production
+   - Restrict access to trusted IPs if possible
+
+4. **Container Security**
+   - Run as non-root user (default in our container)
+   - Use read-only filesystem where possible
+   - Limit container resources
+
+### Library Usage
+
+When using Crawl4AI as a Python library:
+
+1. **Validate URLs** before crawling untrusted input
+2. **Sanitize extracted content** before using in other systems
+3. **Be cautious with hooks** - they execute arbitrary code
+
+## Known Security Issues
+
+### Fixed in v0.8.0
+
+| ID | Severity | Description | Fix |
+|----|----------|-------------|-----|
+| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
+| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |
+
+See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.
+
+## Security Features
+
+### v0.8.0+
+
+- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
+- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
+- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
+- **JWT Authentication**: Optional but recommended for production
+- **Rate Limiting**: Configurable request limits
+- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled
+
+## Acknowledgments
+
+We thank the following security researchers for responsibly disclosing vulnerabilities:
+
+- **[Neo by ProjectDiscovery](https://projectdiscovery.io/blog/introducing-neo)** - RCE and LFI vulnerabilities (December 2025)
+
+---
+
+*Last updated: January 2026*
@@ -1,7 +1,7 @@
 # crawl4ai/__version__.py
 
 # This is the version that will be used for stable releases
-__version__ = "0.7.8"
+__version__ = "0.8.0"
 
 # For nightly builds, this gets set during build process
 __nightly_version__ = None