feat(core): release version 0.5.0 with deep crawling and CLI

unclecode · unclecode · commit 367cd71db9e4 · 2025-02-21T19:55:02.000+08:00
This major release adds deep crawling capabilities, memory-adaptive dispatcher,
multiple crawling strategies, Docker deployment, and a new CLI. It also includes
significant improvements to proxy handling, PDF processing, and LLM integration.

BREAKING CHANGES:
- Add memory-adaptive dispatcher as default for arun_many()
- Move max_depth to CrawlerRunConfig
- Replace ScrapingMode enum with strategy pattern
- Update BrowserContext API
- Make model fields optional with defaults
- Remove content_filter parameter from CrawlerRunConfig
- Remove synchronous WebCrawler and old CLI
- Update Docker deployment configuration
- Replace FastFilterChain with FilterChain
- Change license to Apache 2.0 with attribution clause
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,10 +5,109 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
----
+
+## Version 0.5.0 (2025-02-21)
+
+### Added
+
+- *(crawler)* [**breaking**] Add memory-adaptive dispatcher with rate limiting
+- *(scraping)* [**breaking**] Add LXML-based scraping mode for improved performance
+- *(content-filter)* Add LLMContentFilter for intelligent markdown generation
+- *(dispatcher)* [**breaking**] Add streaming support for URL processing
+- *(browser)* [**breaking**] Improve browser context management and add shared data support
+- *(config)* [**breaking**] Add streaming support and config cloning
+- *(crawler)* Add URL redirection tracking
+- *(extraction)* Add LLM-powered schema generation utility
+- *(proxy)* Add proxy configuration support to CrawlerRunConfig
+- *(robots)* Add robots.txt compliance support
+- *(release)* [**breaking**] Prepare v0.4.3 beta release
+- *(proxy)* Add proxy rotation support and documentation
+- *(browser)* Add CDP URL configuration support
+- *(demo)* Uncomment feature demos and add fake-useragent dependency
+- *(pdf)* Add PDF processing capabilities
+- *(crawler)* [**breaking**] Enhance JavaScript execution and PDF processing
+- *(docker)* Add Docker deployment configuration and API server
+- *(docker)* Add Docker service integration and config serialization
+- *(docker)* [**breaking**] Enhance Docker deployment setup and configuration
+- *(api)* Improve cache handling and add API tests
+- *(crawler)* [**breaking**] Add deep crawling capabilities with BFS strategy
+- *(proxy)* [**breaking**] Add proxy rotation strategy
+- *(deep-crawling)* Add DFS strategy and update exports; refactor CLI entry point
+- *(cli)* Add command line interface with comprehensive features
+- *(config)* Enhance serialization and add deep crawling exports
+- *(crawler)* Add HTTP crawler strategy for lightweight web scraping
+- *(docker)* [**breaking**] Implement supervisor and secure API endpoints
+- *(docker)* [**breaking**] Add JWT authentication and improve server architecture
 
 ### Changed
-Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
+
+- *(browser)* Update browser channel default to 'chromium' in BrowserConfig.from_args method
+- *(crawler)* Optimize response handling and default settings
+- *(crawler)* - Update hello_world example with proper content filtering
+- - Update hello_world.py example
+- *(docs)* [**breaking**] Reorganize documentation structure and update styles
+- *(dispatcher)* [**breaking**] Migrate to modular dispatcher system with enhanced monitoring
+- *(scraping)* [**breaking**] Replace ScrapingMode enum with strategy pattern
+- *(browser)* Improve browser path management
+- *(models)* Rename final_url to redirected_url for consistency
+- *(core)* [**breaking**] Improve type hints and remove unused file
+- *(docs)* Improve code formatting in features demo
+- *(user-agent)* Improve user agent generation system
+- *(core)* [**breaking**] Reorganize project structure and remove legacy code
+- *(docker)* Clean up import statements in server.py
+- *(docker)* Remove unused models and utilities for cleaner codebase
+- *(docker)* [**breaking**] Improve server architecture and configuration
+- *(deep-crawl)* [**breaking**] Reorganize deep crawling functionality into dedicated module
+- *(deep-crawling)* [**breaking**] Reorganize deep crawling strategies and add new implementations
+- *(crawling)* [**breaking**] Improve type hints and code cleanup
+- *(crawler)* [**breaking**] Improve HTML handling and cleanup codebase
+- *(crawler)* [**breaking**] Remove content filter functionality
+- *(examples)* Update API usage in features demo
+- *(config)* [**breaking**] Enhance serialization and config handling
+
+### Docs
+
+- Add Code of Conduct for the project (#410)
+
+### Documentation
+
+- *(extraction)* Add clarifying comments for CSS selector behavior
+- *(readme)* Update personal story and project vision
+- *(urls)* [**breaking**] Update documentation URLs to new domain
+- *(api)* Add streaming mode documentation and examples
+- *(readme)* Update version and feature announcements for v0.4.3b1
+- *(examples)* Update demo scripts and fix output formats
+- *(examples)* Update v0.4.3 features demo to v0.4.3b2
+- *(readme)* Update version references and fix links
+- *(multi-url)* [**breaking**] Improve documentation clarity and update examples
+- *(examples)* Update proxy rotation demo and disable other demos
+- *(api)* Improve formatting and readability of API documentation
+- *(examples)* Add SERP API project example
+- *(urls)* Update documentation URLs to new domain
+- *(readme)* Resolve merge conflict and update version info
+
+### Fixed
+
+- *(browser)* Update default browser channel to chromium and simplify channel selection logic
+- *(browser)* [**breaking**] Default to Chromium channel for new headless mode (#387)
+- *(browser)* Resolve merge conflicts in browser channel configuration
+- Prevent memory leaks by ensuring proper closure of Playwright pages
+- Not working long page screenshot (#403)
+- *(extraction)* JsonCss selector and crawler improvements
+- *(models)* [**breaking**] Make model fields optional with default values
+- *(dispatcher)* Adjust memory threshold and fix dispatcher initialization
+- *(install)* Ensure proper exit after running doctor command
+
+### Miscellaneous Tasks
+
+- *(cleanup)* Remove unused files and improve type hints
+- Add .gitattributes file
+
+## License Update
+
+Crawl4AI v0.5.0 updates the license to Apache 2.0 *with a required attribution clause*.  This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you *must* clearly attribute the project in any public use or distribution.  See the updated `LICENSE` file for the full legal text and specific requirements.
+
+---
 
 ## Version 0.4.3b2 (2025-01-21)
 
@@ -286,12 +385,6 @@ This release introduces several powerful new features, including robots.txt comp
 - Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
 - Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
 
-
-
-
-
-
-
 ## [0.3.75] December 1, 2024
 
 ### PruningContentFilter
diff --git a/LICENSE b/LICENSE
@@ -48,4 +48,22 @@ You may add Your own copyright statement to Your modifications and may provide a
 
 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
 
-END OF TERMS AND CONDITIONS
+END OF TERMS AND CONDITIONS
+
+---
+Attribution Requirement
+
+All distributions, publications, or public uses of this software, or derivative works based on this software, must include the following attribution:
+
+"This product includes software developed by UncleCode (https://x.com/unclecode) as part of the Crawl4AI project (https://github.com/unclecode/crawl4ai)."
+
+This attribution must be displayed in a prominent and easily accessible location, such as:
+
+-   For software distributions: In a NOTICE file, README file, or equivalent documentation.
+-   For publications (research papers, articles, blog posts): In the acknowledgments section or a footnote.
+-   For websites/web applications: In an "About" or "Credits" section.
+-   For command-line tools: In the help/usage output.
+
+This requirement ensures proper credit is given for the use of Crawl4AI and helps promote the project.
+
+---
diff --git a/README.md b/README.md
@@ -574,9 +574,83 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
 
 We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
 
-## 📄 License 
+I'll help modify the license section with badges. For the halftone effect, here's a version with it:
 
-Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
+Here's the updated license section:
+
+## 📄 License & Attribution
+
+This project is licensed under the Apache License 2.0 with a required attribution clause. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
+
+### Attribution Requirements
+When using Crawl4AI, you must include one of the following attribution methods:
+
+#### 1. Badge Attribution (Recommended)
+Add one of these badges to your README, documentation, or website:
+
+| Theme | Badge |
+|-------|-------|
+| **Disco Theme (Animated)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/></a> |
+| **Night Theme (Dark with Neon)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/></a> |
+| **Dark Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/></a> |
+| **Light Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/></a> |
+ 
+
+HTML code for adding the badges:
+```html
+<!-- Disco Theme (Animated) -->
+<a href="https://github.com/unclecode/crawl4ai">
+  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
+</a>
+
+<!-- Night Theme (Dark with Neon) -->
+<a href="https://github.com/unclecode/crawl4ai">
+  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
+</a>
+
+<!-- Dark Theme (Classic) -->
+<a href="https://github.com/unclecode/crawl4ai">
+  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
+</a>
+
+<!-- Light Theme (Classic) -->
+<a href="https://github.com/unclecode/crawl4ai">
+  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
+</a>
+
+<!-- Simple Shield Badge -->
+<a href="https://github.com/unclecode/crawl4ai">
+  <img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
+</a>
+```
+
+#### 2. Text Attribution
+Add this line to your documentation:
+```
+This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
+```
+
+## 📚 Citation
+
+If you use Crawl4AI in your research or project, please cite:
+
+```bibtex
+@software{crawl4ai2024,
+  author = {UncleCode},
+  title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub Repository},
+  howpublished = {\url{https://github.com/unclecode/crawl4ai}},
+  commit = {Please use the commit hash you're working with}
+}
+```
+
+Text citation format:
+```
+UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software]. 
+GitHub. https://github.com/unclecode/crawl4ai
+```
 
 ## 📧 Contact 
 
diff --git a/cliff.toml b/cliff.toml
@@ -0,0 +1,24 @@
+[changelog]
+# Template format
+header = """
+# Changelog\n
+All notable changes to this project will be documented in this file.\n
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).\n
+"""
+
+# Organize commits by type
+[git]
+conventional_commits = true
+filter_unconventional = true
+commit_parsers = [
+    { message = "^feat", group = "Added"},
+    { message = "^fix", group = "Fixed"},
+    { message = "^doc", group = "Documentation"},
+    { message = "^perf", group = "Performance"},
+    { message = "^refactor", group = "Changed"},
+    { message = "^style", group = "Changed"},
+    { message = "^test", group = "Testing"},
+    { message = "^chore\\(release\\): prepare for", skip = true},
+    { message = "^chore", group = "Miscellaneous Tasks"},
+]
diff --git a/docs/assets/powered-by-dark.svg b/docs/assets/powered-by-dark.svg
@@ -0,0 +1,25 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
+  <!-- Dark Theme -->
+  <g>
+    <defs>
+      <pattern id="halftoneDark" width="4" height="4" patternUnits="userSpaceOnUse">
+        <circle cx="2" cy="2" r="1" fill="#eee" opacity="0.1"/>
+      </pattern>
+      <pattern id="halftoneTextDark" width="3" height="3" patternUnits="userSpaceOnUse">
+        <circle cx="1.5" cy="1.5" r="2" fill="#aaa" opacity="0.2"/>
+      </pattern>
+    </defs>
+    <!-- White border - added as outer rectangle -->
+    <rect width="120" height="35" rx="5" fill="#111"/>
+    <!-- Dark background slightly smaller to show thicker border -->
+    <rect x="2" y="2" width="116" height="31" rx="4" fill="#1a1a1a"/>
+    <rect x="2" y="2" width="116" height="31" rx="4" fill="url(#halftoneDark)"/>
+    
+    <!-- Logo with halftone -->
+    <path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#eee" stroke-width="2"/>
+    <path d="M18 17.5 L27 17.5" stroke="#eee" stroke-width="2"/>
+    <circle cx="22.5" cy="17.5" r="2" fill="#eee"/>
+    
+    <text x="40" y="23" fill="#eee" font-family="Arial, sans-serif" font-weight="500" font-size="14">Crawl4AI</text>
+  </g>
+</svg>
diff --git a/docs/assets/powered-by-disco.svg b/docs/assets/powered-by-disco.svg
@@ -0,0 +1,64 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
+  <g>
+    <defs>
+      <pattern id="cyberdots" width="4" height="4" patternUnits="userSpaceOnUse">
+        <circle cx="2" cy="2" r="1">
+          <animate attributeName="fill" 
+                   values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4" 
+                   dur="6s" 
+                   repeatCount="indefinite"/>
+          <animate attributeName="opacity" 
+                   values="0.2;0.4;0.2" 
+                   dur="4s" 
+                   repeatCount="indefinite"/>
+        </circle>
+      </pattern>
+      <filter id="neonGlow" x="-20%" y="-20%" width="140%" height="140%">
+        <feGaussianBlur stdDeviation="1" result="blur"/>
+        <feFlood flood-color="#FF2EC4" flood-opacity="0.2">
+          <animate attributeName="flood-color"
+                   values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
+                   dur="8s"
+                   repeatCount="indefinite"/>
+        </feFlood>
+        <feComposite in2="blur" operator="in"/>
+        <feMerge>
+          <feMergeNode/>
+          <feMergeNode in="SourceGraphic"/>
+        </feMerge>
+      </filter>
+    </defs>
+    
+    <rect width="120" height="35" rx="5" fill="#0A0A0F"/>
+    <rect x="2" y="2" width="116" height="31" rx="4" fill="#16161E"/>
+    <rect x="2" y="2" width="116" height="31" rx="4" fill="url(#cyberdots)"/>
+    
+    <!-- Logo with animated neon -->
+    <path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#8B5CF6" stroke-width="2" filter="url(#neonGlow)">
+      <animate attributeName="stroke"
+               values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
+               dur="8s"
+               repeatCount="indefinite"/>
+    </path>
+    <path d="M18 17.5 L27 17.5" stroke="#8B5CF6" stroke-width="2" filter="url(#neonGlow)">
+      <animate attributeName="stroke"
+               values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
+               dur="8s"
+               repeatCount="indefinite"/>
+    </path>
+    <circle cx="22.5" cy="17.5" r="2" fill="#0BC5EA">
+      <animate attributeName="fill" 
+               values="#0BC5EA;#FF2EC4;#8B5CF6;#0BC5EA" 
+               dur="8s" 
+               repeatCount="indefinite"/>
+    </circle>
+    
+    <text x="40" y="23" font-family="Arial, sans-serif" font-weight="500" font-size="14" filter="url(#neonGlow)">
+      <animate attributeName="fill"
+               values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
+               dur="8s"
+               repeatCount="indefinite"/>
+      Crawl4AI
+    </text>
+  </g>
+</svg>
diff --git a/docs/assets/powered-by-light.svg b/docs/assets/powered-by-light.svg
@@ -0,0 +1,21 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
+  <g>
+    <defs>
+      <pattern id="halftoneLight" width="4" height="4" patternUnits="userSpaceOnUse">
+        <circle cx="2" cy="2" r="1" fill="#111" opacity="0.1"/>
+      </pattern>
+    </defs>
+    <!-- Dark border -->
+    <rect width="120" height="35" rx="5" fill="#DDD"/>
+    <!-- Light background -->
+    <rect x="2" y="2" width="116" height="31" rx="4" fill="#fff"/>
+    <rect x="2" y="2" width="116" height="31" rx="4" fill="url(#halftoneLight)"/>
+    
+    <!-- Logo -->
+    <path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#111" stroke-width="2"/>
+    <path d="M18 17.5 L27 17.5" stroke="#111" stroke-width="2"/>
+    <circle cx="22.5" cy="17.5" r="2" fill="#111"/>
+    
+    <text x="40" y="23" fill="#111" font-family="Arial, sans-serif" font-weight="500" font-size="14">Crawl4AI</text>
+  </g>
+</svg>
diff --git a/docs/assets/powered-by-night.svg b/docs/assets/powered-by-night.svg
diff --git a/docs/md_v2/blog/index.md b/docs/md_v2/blog/index.md
diff --git a/docs/md_v2/blog/releases/0.5.0.md b/docs/md_v2/blog/releases/0.5.0.md