Skip to content

[Bug]: Deep crawl configured with arun_many deep crawls the site but only returns result of pages in 'urls' parameter #1679

@ankitrajmehta

Description

@ankitrajmehta

crawl4ai version

0.7.4

Expected Behavior

When arun_many() is passed config with deep_crawl_strategy, it should either deepcrawl from all 'urls' passed into it, or ignore it and only crawl the sites passed in 'urls' parameter.

Current Behavior

The site is deep crawled from the 'urls' passed into arun_many, but only the results for the pages in 'urls' are returned. The other crawled pages are simply discarded and inaccessible. (See code and output below)

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy, BFSDeepCrawlStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer


async def deep_crawl():

    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,           # Crawl N levels deep
            include_external=False,         # Stay within domain
            max_pages=5         # Limit for efficiency
        ),
        simulate_user=True,
        magic=True,
        cache_mode=CacheMode.BYPASS,
        exclude_external_links=True,
        verbose=True
    )


    async with AsyncWebCrawler() as crawler:
        
        try:
            results = await crawler.arun_many(["https://www.vercel.com", "https://www.google.com"], config=config)
            
            print(f"\n Discovered and crawled {len(results)} pages")
            print("results urls:")
            for result in results:
                print(result.url)
        except Exception as e:
            print(f"Error during crawling: {e}")


import asyncio
asyncio.run(deep_crawl())

OS

Windows

Python version

3.11.9

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

Output of above code:

[INIT].... → Crawl4AI 0.7.4 
[FETCH]... ↓ https://www.vercel.com                                                                               | ✓ | ⏱: 2.03s 
[SCRAPE].. ◆ https://www.vercel.com                                                                               | ✓ | ⏱: 0.09s 
[COMPLETE] ● https://www.vercel.com                                                                               | ✓ | ⏱: 2.12s 
[FETCH]... ↓ https://vercel.com/agent                                                                             | ✓ | ⏱: 1.10s 
[SCRAPE].. ◆ https://vercel.com/agent                                                                             | ✓ | ⏱: 0.07s 
[COMPLETE] ● https://vercel.com/agent                                                                             | ✓ | ⏱: 1.17s 
[FETCH]... ↓ https://www.google.com                                                                               | ✓ | ⏱: 3.61s 
[SCRAPE].. ◆ https://www.google.com                                                                               | ✓ | ⏱: 0.03s 
[COMPLETE] ● https://www.google.com                                                                               | ✓ | ⏱: 3.64s 
[FETCH]... ↓ https://www.google.com/imghp?hl=ne                                                                   | ✓ | ⏱: 3.29s 
[SCRAPE].. ◆ https://www.google.com/imghp?hl=ne                                                                   | ✓ | ⏱: 0.02s 
[COMPLETE] ● https://www.google.com/imghp?hl=ne                                                                   | ✓ | ⏱: 3.31s 
[FETCH]... ↓ https://vercel.com/ai                                                                                | ✓ | ⏱: 3.56s 
[SCRAPE].. ◆ https://vercel.com/ai                                                                                | ✓ | ⏱: 0.09s 
[COMPLETE] ● https://vercel.com/ai                                                                                | ✓ | ⏱: 3.66s 
[FETCH]... ↓ https://accounts.google.com/ServiceLogin?continu...c=futura_exp_og_so_72776762_e&hl=ne&passive=true  | ✓ | ⏱: 4.12s 
[SCRAPE].. ◆ https://accounts.google.com/ServiceLogin?continu...c=futura_exp_og_so_72776762_e&hl=ne&passive=true  | ✓ | ⏱: 0.06s 
[COMPLETE] ● https://accounts.google.com/ServiceLogin?continu...c=futura_exp_og_so_72776762_e&hl=ne&passive=true  | ✓ | ⏱: 4.18s 
[FETCH]... ↓ https://mail.google.com/mail/&ogbl                                                                   | ✓ | ⏱: 4.50s 
[SCRAPE].. ◆ https://mail.google.com/mail/&ogbl                                                                   | ✓ | ⏱: 0.11s 
[COMPLETE] ● https://mail.google.com/mail/&ogbl                                                                   | ✓ | ⏱: 4.61s 
[FETCH]... ↓ https://vercel.com/ai-gateway                                                                        | ✓ | ⏱: 4.45s 
[SCRAPE].. ◆ https://vercel.com/ai-gateway                                                                        | ✓ | ⏱: 0.09s 
[COMPLETE] ● https://vercel.com/ai-gateway                                                                        | ✓ | ⏱: 4.55s 
[FETCH]... ↓ https://vercel.com/home                                                                              | ✓ | ⏱: 5.60s 
[SCRAPE].. ◆ https://vercel.com/home                                                                              | ✓ | ⏱: 0.09s 
[COMPLETE] ● https://vercel.com/home                                                                              | ✓ | ⏱: 5.69s 

 Discovered and crawled 2 pages
results urls:
https://www.google.com
https://www.vercel.com

Clearly 5 pages were deep-crawled from vercel.com, but only the results for the main 'vercel.com' was returned. The other crawled pages were discared.

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions