-
-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
crawl4ai version
0.5.0.post4
Expected Behavior
Crawler should crawl
Current Behavior
I get the following error
[ERROR]... × https://out-door.co.il/product/%d7%a4%d7%90%d7%a0%... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 528 in wrap_api_call (venv/lib/python3.12/site- │
│ packages/playwright/_impl/_connection.py): │
│ Error: Page.content: Target page, context or browser has been closed │
│ │
│ Code context: │
│ 523 parsed_st = _extract_stack_trace_information_from_stack(st, is_internal) │
│ 524 self._api_zone.set(parsed_st) │
│ 525 try: │
│ 526 return await cb() │
│ 527 except Exception as error: │
│ 528 → raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None │
│ 529 finally: │
│ 530 self._api_zone.set(None) │
│ 531 │
│ 532 def wrap_api_call_sync( │
│ 533 self, cb: Callable[[], Any], is_internal: bool = False │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
this happens after about 50 to 100 pages
I use ec2 t2.large and this is my code
@app.post("/crawl", response_model=CrawlResponse)
async def crawl(request: CrawlRequest):
"""
Run the crawler on the specified URL
"""
print(request)
try:
# Convert UUID to string for the query
crawler_config = execute_select_query(f"SELECT * FROM crawls WHERE id = '{request.crawler_id}'")
if not crawler_config:
raise HTTPException(
status_code=404,
detail=f"Crawler config not found for id: {request.crawler_id}"
)
crawler_config = crawler_config[0]
root_url = crawler_config['root_url']
logger.info(f"🔍 Starting crawl for URL: {root_url}")
depth = crawler_config.get('depth', 1)
include_external = crawler_config.get('include_external', False)
max_pages = crawler_config.get('max_pages', 5)
# Step 1: Create a pruning filter
prune_filter = PruningContentFilter(
# Lower → more content retained, higher → more content pruned
threshold=0.45,
# "fixed" or "dynamic"
threshold_type="dynamic",
# Ignore nodes with <5 words
min_word_threshold=5
)
# Step 2: Insert it into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter) #, options={"ignore_links": True}
# Step 3: Pass it to CrawlerRunConfig
# Configure the crawler
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=depth,
include_external=include_external,
max_pages=max_pages
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True,
markdown_generator=md_generator
)
crawled_pages = []
page_count = 0
# Run the crawler
async with AsyncWebCrawler() as crawler:
try:
async for result in await crawler.arun(crawler_config['root_url'], config=config):
processed_result = await process_crawl_result(crawler_config, result)
crawled_pages.append(processed_result)
page_count += 1
logger.info(f"Processed page {page_count}: {result.url}")
except Exception as crawl_error:
logger.error(f"Error during crawling: {str(crawl_error)}")
raise HTTPException(
status_code=500,
detail=f"Crawling process failed: {str(crawl_error)}"
)
result = {
"url": root_url,
"depth": depth,
"pages_crawled": page_count,
"crawled_pages": crawled_pages
}
return CrawlResponse(
status="success",
data=result
)
except Exception as e:
logger.error(f"Crawling error: {str(e)}")
raise HTTPException(
status_code=500,
detail=f"Crawling failed: {str(e)}"
)
any idea on how to debug it?
what does this error means?
My guess is that the headless browser is crashing, but I'm not sure how to debug it, and why it could happen
When I run a crawler with simpe fetch I can crawl all 483 pages in the web site, but with crawl4ai it crashes after about a 50 to 100 pages, and just print a list of these errors
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
ubuntu (ec2 t2.large)
Python version
3.12.3
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response