-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Open
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers
Description
crawl4ai version
0.7.4
Expected Behavior
When arun_many() is passed config with deep_crawl_strategy, it should either deepcrawl from all 'urls' passed into it, or ignore it and only crawl the sites passed in 'urls' parameter.
Current Behavior
The site is deep crawled from the 'urls' passed into arun_many, but only the results for the pages in 'urls' are returned. The other crawled pages are simply discarded and inaccessible. (See code and output below)
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy, BFSDeepCrawlStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
async def deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, # Crawl N levels deep
include_external=False, # Stay within domain
max_pages=5 # Limit for efficiency
),
simulate_user=True,
magic=True,
cache_mode=CacheMode.BYPASS,
exclude_external_links=True,
verbose=True
)
async with AsyncWebCrawler() as crawler:
try:
results = await crawler.arun_many(["https://www.vercel.com", "https://www.google.com"], config=config)
print(f"\n Discovered and crawled {len(results)} pages")
print("results urls:")
for result in results:
print(result.url)
except Exception as e:
print(f"Error during crawling: {e}")
import asyncio
asyncio.run(deep_crawl())OS
Windows
Python version
3.11.9
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
Output of above code:
[INIT].... → Crawl4AI 0.7.4
[FETCH]... ↓ https://www.vercel.com | ✓ | ⏱: 2.03s
[SCRAPE].. ◆ https://www.vercel.com | ✓ | ⏱: 0.09s
[COMPLETE] ● https://www.vercel.com | ✓ | ⏱: 2.12s
[FETCH]... ↓ https://vercel.com/agent | ✓ | ⏱: 1.10s
[SCRAPE].. ◆ https://vercel.com/agent | ✓ | ⏱: 0.07s
[COMPLETE] ● https://vercel.com/agent | ✓ | ⏱: 1.17s
[FETCH]... ↓ https://www.google.com | ✓ | ⏱: 3.61s
[SCRAPE].. ◆ https://www.google.com | ✓ | ⏱: 0.03s
[COMPLETE] ● https://www.google.com | ✓ | ⏱: 3.64s
[FETCH]... ↓ https://www.google.com/imghp?hl=ne | ✓ | ⏱: 3.29s
[SCRAPE].. ◆ https://www.google.com/imghp?hl=ne | ✓ | ⏱: 0.02s
[COMPLETE] ● https://www.google.com/imghp?hl=ne | ✓ | ⏱: 3.31s
[FETCH]... ↓ https://vercel.com/ai | ✓ | ⏱: 3.56s
[SCRAPE].. ◆ https://vercel.com/ai | ✓ | ⏱: 0.09s
[COMPLETE] ● https://vercel.com/ai | ✓ | ⏱: 3.66s
[FETCH]... ↓ https://accounts.google.com/ServiceLogin?continu...c=futura_exp_og_so_72776762_e&hl=ne&passive=true | ✓ | ⏱: 4.12s
[SCRAPE].. ◆ https://accounts.google.com/ServiceLogin?continu...c=futura_exp_og_so_72776762_e&hl=ne&passive=true | ✓ | ⏱: 0.06s
[COMPLETE] ● https://accounts.google.com/ServiceLogin?continu...c=futura_exp_og_so_72776762_e&hl=ne&passive=true | ✓ | ⏱: 4.18s
[FETCH]... ↓ https://mail.google.com/mail/&ogbl | ✓ | ⏱: 4.50s
[SCRAPE].. ◆ https://mail.google.com/mail/&ogbl | ✓ | ⏱: 0.11s
[COMPLETE] ● https://mail.google.com/mail/&ogbl | ✓ | ⏱: 4.61s
[FETCH]... ↓ https://vercel.com/ai-gateway | ✓ | ⏱: 4.45s
[SCRAPE].. ◆ https://vercel.com/ai-gateway | ✓ | ⏱: 0.09s
[COMPLETE] ● https://vercel.com/ai-gateway | ✓ | ⏱: 4.55s
[FETCH]... ↓ https://vercel.com/home | ✓ | ⏱: 5.60s
[SCRAPE].. ◆ https://vercel.com/home | ✓ | ⏱: 0.09s
[COMPLETE] ● https://vercel.com/home | ✓ | ⏱: 5.69s
Discovered and crawled 2 pages
results urls:
https://www.google.com
https://www.vercel.com
Clearly 5 pages were deep-crawled from vercel.com, but only the results for the main 'vercel.com' was returned. The other crawled pages were discared.
Metadata
Metadata
Assignees
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers