How to deep crawl a website which contains pdf urls also ? #1190
Unanswered
gauravmindzk
asked this question in
Forums - Q&A
Replies: 2 comments
-
|
can you send the website url |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
@gauravmindzk Hi. The deep crawler discovers PDF URLs but doesn't process them by default because PDFs require different handling than HTML pages. Here are two approaches: async def demo_deep_crawl():
....
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://website_to_crawl.org",
config=config):
i += 1
# Save HTML page markdown
if result.markdown:
filename = os.path.join("DirectoryName", f"{i}.md")
with open(filename, "w", encoding="utf-8") as f:
f.write(result.markdown.fit_markdown)
# Collect PDF URLs from links
if result.links:
internal_links = result.links.get("internal", []) if
isinstance(result.links, dict) else result.links.internal
external_links = result.links.get("external", []) if
isinstance(result.links, dict) else result.links.external
for link in internal_links + external_links:
href = link.get("href", "") if isinstance(link, dict) else link.href
if href.lower().endswith('.pdf'):
pdf_urls.append(href)
# Now process PDF URLs
print(f"\nFound {len(pdf_urls)} PDF URLs. Processing...")
pdf_config = CrawlerRunConfig(
scraping_strategy=PDFContentScrapingStrategy(
extract_images=False
),
cache_mode="bypass"
)
async with AsyncWebCrawler() as crawler:
for j, pdf_url in enumerate(set(pdf_urls)): # dedupe
try:
result = await crawler.arun(pdf_url, config=pdf_config)
if result.markdown:
filename = os.path.join("DirectoryName", f"pdf_{j+1}.md")
with open(filename, "w", encoding="utf-8") as f:
f.write(result.markdown.raw_markdown)
print(f"✓ Processed: {pdf_url}")
except Exception as e:
print(f"✗ Failed {pdf_url}: {e}")
asyncio.run(demo_deep_crawl())Option 2: Use arun_many for batch PDF processing (more efficient) # After collecting pdf_urls from Option 1...
from crawl4ai import CrawlResult
async with AsyncWebCrawler() as crawler:
pdf_config = CrawlerRunConfig(
scraping_strategy=PDFContentScrapingStrategy(),
)
results = await crawler.arun_many(
urls=list(set(pdf_urls)),
config=pdf_config
)
for j, result in enumerate(results):
if result.success and result.markdown:
with open(f"DirectoryName/pdf_{j+1}.md", "w") as f:
f.write(result.markdown.raw_markdown) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I want to deep crawl a website.
The code that I've come up with is below :
The webpage URLs are getting scrapped properly and I have the fit markdown of each of them but my current code is not able to scrape PDF urls of the website , I want to scrape both the webpages and the pdf urls that are hosted by the website.
How to do it ?
Beta Was this translation helpful? Give feedback.
All reactions