-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
crawl4ai version
0.6.3
Expected Behavior
When using BFSDeepCrawlStrategy(max_depth=0) with delay_before_return_html=20.0 in CrawlerRunConfig, the crawler should:
- Wait for the specified 20 seconds before capturing HTML
- Allow JavaScript content to fully load during this delay
- Capture the fully rendered page with all dynamic content loaded
For JavaScript-heavy pages that show a loading spinner ("Loading...") while content loads asynchronously, the 20-second delay should be sufficient for content to appear.
Expected result: ~40,000+ characters of fully loaded content with document links extracted.
Current Behavior
When BFSDeepCrawlStrategy(max_depth=0) is present in the configuration, the delay_before_return_html parameter appears to be completely ignored or significantly reduced, resulting in:
- The page is captured before JavaScript finishes loading
- Only the initial loading screen is captured (loading spinner + error message)
- Content length is only 821 characters instead of 40,000+
- The page shows "Loading..." and "Cannot connect to server"
Important: The exact same configuration WITHOUT BFSDeepCrawlStrategy works perfectly and captures the full 40,904 characters with all content loaded.
Is this reproducible?
Yes
Inputs Causing the Bug
Test case: Any page with JavaScript-rendered content requiring 15-20 seconds to load (e.g., pages with heavy AJAX, infinite scroll, or dynamic content loading)
Settings:
# Browser Config
BrowserConfig(
headless=True,
java_script_enabled=True,
viewport_width=1920,
viewport_height=1080,
)
# Crawler Config (WITH bug)
CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=0), # Bug trigger
delay_before_return_html=20.0, # Ignored when above is present
magic=True,
simulate_user=True,
override_navigator=True,
)Steps to Reproduce
1. Install crawl4ai 0.6.3
pip install crawl4ai==0.6.3
2. Create a test script with the reproduction code (see Code Snippets section below)
3. Run Test 1 - WITH BFSDeepCrawlStrategy(max_depth=0):
- Observe: Only ~800 characters captured
- Observe: Content shows loading screen ("Loading..." text/spinner)
- Note: Timer shows ~20 seconds but content is incomplete
4. Run Test 2 - WITHOUT BFSDeepCrawlStrategy:
- Observe: ~40,000 characters captured
- Observe: Full page content with all dynamic elements loaded
- Note: Same ~20 second timing but content is complete
5. Compare results:
- WITH strategy: Premature capture (bug)
- WITHOUT strategy: Correct behaviorCode snippets
Minimal reproduction script
# !/usr/bin/env python3
"""Minimal script to reproduce the bug."""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def reproduce_bug():
"""Compare behavior with and without BFSDeepCrawlStrategy."""
browser_config = BrowserConfig(
headless=True,
java_script_enabled=True,
viewport_width=1920,
viewport_height=1080,
)
md_gen = DefaultMarkdownGenerator(options={"ignore_links": True})
url = "https://example.com" # Replace with JS-heavy test URL
# Test 1: WITH BFSDeepCrawlStrategy (bug)
config_bug = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=0), # Bug trigger
markdown_generator=md_gen,
delay_before_return_html=20.0, # Ignored!
magic=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=config_bug)
r = result[0] if isinstance(result, list) else result
print(f"WITH strategy: {len(r.markdown.raw_markdown):,} chars")
# Test 2: WITHOUT BFSDeepCrawlStrategy (works)
config_work = CrawlerRunConfig(
# No deep_crawl_strategy
markdown_generator=md_gen,
delay_before_return_html=20.0, # Works!
magic=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=config_work)
r = result[0] if isinstance(result, list) else result
print(f"WITHOUT strategy: {len(r.markdown.raw_markdown):,} chars")
asyncio.run(reproduce_bug())
---
Workaround: Config builder for single - page crawls
def build_single_page_config(delay_seconds: float = 20.0) -> CrawlerRunConfig:
"""
Correct configuration for single-page crawling without link following.
NOTE: Do NOT use BFSDeepCrawlStrategy(max_depth=0) - it breaks delay_before_return_html.
Simply omit deep_crawl_strategy entirely.
"""
return CrawlerRunConfig(
# ❌ DON'T: deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=0)
# ✅ DO: Omit it completely for single-page crawls
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
delay_before_return_html=delay_seconds,
wait_for="css:body",
magic=True,
simulate_user=True,
override_navigator=True,
)
---
Helper: Quick comparison function
async def compare_delay_behavior(url: str, delay: float = 20.0):
"""Quick test to verify if bug exists on a given URL."""
browser = BrowserConfig(headless=True, java_script_enabled=True)
# With bug
cfg_bug = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=0),
delay_before_return_html=delay,
)
# Without bug
cfg_ok = CrawlerRunConfig(
delay_before_return_html=delay,
)
async with AsyncWebCrawler(config=browser) as crawler:
r1 = await crawler.arun(url, config=cfg_bug)
chars_bug = len(r1[0].markdown.raw_markdown if isinstance(r1, list) else r1.markdown.raw_markdown)
r2 = await crawler.arun(url, config=cfg_ok)
chars_ok = len(r2[0].markdown.raw_markdown if isinstance(r2, list) else r2.markdown.raw_markdown)
print(f"WITH BFSDeepCrawlStrategy: {chars_bug:,} chars")
print(f"WITHOUT: {chars_ok:,} chars")
print(f"Bug present: {chars_bug < chars_ok * 0.5}") # >50% content loss
# Usage
asyncio.run(compare_delay_behavior("https://example.com"))OS
macOS (Darwin 24.6.0), also reproducible on Linux
Python version
3.11+
Browser
Chromium (via Playwright)
Browser version
Chrome 131.0.0.0 (via crawl4ai's Playwright integration)
Error logs & Screenshots (if applicable)
WITH BFSDeepCrawlStrategy (bug):
[FETCH]... | ✓ | ⏱: 20.92s
Content length: 821 chars
Preview: [Loading spinner and "Loading..." text visible]
WITHOUT BFSDeepCrawlStrategy (works):
[FETCH]... | ✓ | ⏱: 21.63s
Content length: 40,904 chars
Preview: [Full page content visible]
Note: Timer shows ~20s in both cases, but first test captures prematurely.