Diiferent results beetwen Crawl4AI 0.5.0 and 0.6.0 #1018

SECVBulRep · 2025-04-23T12:12:06Z

SECVBulRep
Apr 23, 2025

Hi, team!

Why same code gives me absolutly difffrent results with Crawl4AI 0.5.0 and 0.6.0:

`
lc = LLMConfig(provider="")

prompt = """
You are given the text content of a website.
bla bla bla
return []."""

extraction_strategy = LLMExtractionStrategy(
llm_config=lc,
extraction_type="schema",
instruction=prompt,
chunk_token_threshold=1200,
overlap_rate=0.1,
apply_chunking=True,
extra_args={"temperature": 0.1},
verbose=True
)

async def crawl_single_url(url: str, base_output_dir: Path):
parsed = urlparse(url)
domain_name = parsed.netloc.replace("www.", "")
output_dir = base_output_dir / domain_name
output_dir.mkdir(parents=True, exist_ok=True)
output_file = output_dir / "raw.json"

config = CrawlerRunConfig(
    # cache_mode=CacheMode.ENABLED,
    deep_crawl_strategy=DFSDeepCrawlStrategy(
        max_depth=3,  # Crawl initial page + 2 levels deep
        include_external=False,  # Stay within the same domain
        # max_pages=30,  # Maximum number of pages to crawl (optional)
        # score_threshold=0.5,  # Minimum score for URLs to be crawled (optional)
    ),
    scroll_delay=1,
    scan_full_page=True,
    wait_for_images=True,
    scraping_strategy=LXMLWebScrapingStrategy(),
    verbose=True,
    js_code="window.scrollTo(0, document.body.scrollHeight);",
    delay_before_return_html=2,
    extraction_strategy=extraction_strategy
)

async with AsyncWebCrawler() as crawler:
    try:
        results: CrawlResultContainer = await crawler.arun(url=url, config=config)
        print(f"\n Finished: {url}")

        if not hasattr(results, "__iter__"):
            print(f" Warning: Result from {url} is not iterable")
            return

        all_extracted = []

        for result in results:
            if not hasattr(result, "extracted_content") or not result.extracted_content:
                continue

            extracted = result.extracted_content.strip()

            try:
                parsed = json.loads(extracted)

                # Пропускаем, если error=true
                if (isinstance(parsed, list) and parsed and isinstance(parsed[0], dict) and parsed[0].get(
                        "error") is True) or \
                        (isinstance(parsed, dict) and parsed.get("error") is True):
                    print(f" Skipped due to error in content: {url}")
                    continue

                all_extracted.append(parsed)

            except Exception as e:
                print(f" Could not parse JSON from extracted content at {url}: {e}")
                continue

     
        if all_extracted:
            with open(output_file, "w", encoding="utf-8") as f:
                json.dump(all_extracted, f, ensure_ascii=False, indent=2)

    except Exception as e:
        print(f" Error crawling {url}: {e}")

async def crawl_all(urls: list[str]):
base_output_dir = Path("crawl_results")
base_output_dir.mkdir(exist_ok=True)

start = time.perf_counter()

tasks = [asyncio.create_task(crawl_single_url(url, base_output_dir)) for url in urls]

results = await asyncio.gather(*tasks, return_exceptions=True)

end = time.perf_counter()
duration = end - start
print(f"\nTotal crawl time: {duration:.2f} seconds")

for url, result in zip(urls, results):
    if isinstance(result, Exception):
        print(f" Failed: {url} - {result}")
    else:
        print(f"Success: {url}")

if name == "main":
urls = [
"https://depic.me",
]
asyncio.run(crawl_all(urls))
`

ver 0.6.0 returns:

[INIT].... → Crawl4AI 0.6.0
[FETCH]... ↓ https://depic.me | ✓ | ⏱: 4.49s
[SCRAPE].. ◆ https://depic.me | ✓ | ⏱: 0.01s
[LOG] Call LLM for https://depic.me - block index: 0
[LOG] Extracted 0 blocks from URL: https://depic.me block index: 0
[EXTRACT]. ■ Completed for https://depic.me... | Time: 13.930713609996019s
[COMPLETE] ● https://depic.me | ✓ | ⏱: 18.43s

Finished: https://depic.me

⏱ Total crawl time: 19.52 seconds
Success: https://depic.me

ver 0.5.0 return much more....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diiferent results beetwen Crawl4AI 0.5.0 and 0.6.0 #1018

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Diiferent results beetwen Crawl4AI 0.5.0 and 0.6.0 #1018

Uh oh!

Uh oh!

SECVBulRep Apr 23, 2025

Replies: 0 comments

SECVBulRep
Apr 23, 2025