How modify page_timeout in crawler.arun_many mode #455

1933211129 · 2025-01-15T08:52:20Z

Hi @unclecode ,
I have been using crawl4ai for a while and I am excited about every update, thank you for your contributions!

#436 ，in this issue says page_timout does not work for crawler.arun_many. But now I want to modify page_timout in 'arun_many' mode, whether I add 'config' or modify the parameters in the source code 'async_crawler_strategy.py' or 'config.py' file, It never worked in 'arun_many' mode, I wanted to make it shorter, but now I can't. Looking forward to your reply!

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig


async def extract_urls_and_descriptions(url_list: list):
    """
    爬取多个URL的内部链接和描述信息。
    """
    results = {}
    index = 1 

    async with AsyncWebCrawler(verbose=False) as crawler:
        
        try:
            config = CrawlerRunConfig(
                  page_timeout=5000
              )
            crawled_results = await crawler.arun_many(
                urls=url_list,
                config=config
            )

            # 处理结果
            for result in crawled_results:
                if result.success:  
                    for category in ['internal']:  
                        for link in result.links.get(category, []):
                            link_url = link.get('href')
                            description = link.get('text', "")

                            
                            if link_url and (link_url.startswith("http") or link_url.startswith("https")):
                                results[index] = {link_url: description}  
                                index += 1  

        except Exception as e:
            print(f"爬取出错: {e}\n")

    return results

async def main():
    url_list = [
    "http://www.people.com.cn/",
    "http://www.xinhuanet.com/",
    "https://news.sina.com.cn/",
    "https://news.qq.com/",
    "https://www.ccdi.gov.cn/"


]
    results = await extract_urls_and_descriptions(url_list)
    print(results)

asyncio.run(main())

× Unexpected error in _crawl_web at line 1205 in _crawl_web (../usr/local/lib/python3.10/dist- │
│ packages/crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: Timeout 60000ms exceeded. │
│ Call log: │
│ - navigating to "https://www.ccdi.gov.cn/", waiting until "domcontentloaded" │
│ │
│ │
│ Code context: │
│ 1200 │
│ 1201 response = await page.goto( │
│ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 1203 ) │
│ 1204 except Error as e: │
│ 1205 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 1206 │
│ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) │
│ 1208 │
│ 1209 if response is None: │
│ 1210 status_code = 200

No matter how I modify the 'page_timeout' parameter, it always report 'Page.goto: Timeout 60000ms exceeded.'

The text was updated successfully, but these errors were encountered:

1933211129 · 2025-01-15T08:56:15Z

I pass in a lot of links at once to get their subchains, but some of the links in the middle may access errors, I want to skip them as soon as possible, not try too much time, but the current default of 60 seconds is too long for me, I can't adjust it now.😭

unclecode · 2025-01-15T14:49:08Z

@1933211129 Hello again, I have very good news for you. Tomorrow, I will drop a new version, and arun_many() has changed drastically. I made tons of optimizations for much faster and better parallel crawling. I tested it, and I will release it as a beta, so perhaps you can help test and debug it and provide your feedback. I will record a video to explain it.

Regarding setting the timeout, shouldn't be any problem, look at the code below:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig

async def main():
    config = BrowserConfig(
        headless=True,
    )

    async with AsyncWebCrawler(config=config) as crawler:
        crawl_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            page_timeout=1,
        )
        result = await crawler.arun(
            url="https://crawl4ai.com",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Look at the error message:

[INIT].... → Crawl4AI 0.4.248
[ERROR]... × https://crawl4ai.com... | Error: 
│ × Unexpected error in _crawl_web at line 1260 in _crawl_web (crawl4ai/async_crawler_strategy.py):                     │
│   Error: Failed on navigating ACS-GOTO:                                                                               │
│   Page.goto: Timeout 1ms exceeded.                                                                                    │
│   Call log:

As you can see it says Page.goto: Timeout 1ms exceeded.

Let me know if you have any problem with it. Anyway wait for the new version.

1933211129 · 2025-01-15T14:59:27Z

Yes, just like the code you wrote above, the page_timeout setting works effectively for crawler.arun, but it doesn't take effect for crawler.arun_many.

1933211129 · 2025-01-15T15:07:35Z

Additionally, I have another issue to report, which is related to result.markdown_v2.fit_markdown and result.links

In the current version 0.4.24, this function doesn't seem to work effectively for the links I tested previously—it returns raw_markdown. However, in version 0.4.21, fit_markdown was able to return very clean results, which is quite strange.

The same issue also appears with result.links. In versions prior to 0.4.x, it worked fine for retrieving the URL of the link. In version 0.4.1, however, it returned empty results without any errors. After upgrading to 0.4.24, it started working normally again.

This makes it a bit frustrating for my application development. To get cleaner markdown, I have to use version 0.4.21, but to get more stable results for result.links, I have to upgrade to version 0.4.24. This is very strange, and I’ve tested it multiple times—it doesn’t seem to be an issue with my network environment.

unclecode · 2025-01-16T12:18:16Z

@1933211129 Sorry to hear that, Can you share the link for this one, Perhaps I can test it before release the new version.

1933211129 · 2025-01-16T12:27:55Z

@unclecode
http://www.las.cas.cn/

This link produces different results in version 0.4.0 and version 0.4.2, even when using the same code. This occurs in both fit_markdown and links, specifically in the run_many mode.

I noticed that many others have also mentioned this issue in other threads. There seems to be a problem with the extraction of fit_markdown in run_many mode.

By the way, do you have any updates on when the new version of arun_many() will be released? Looking forward to it!

unclecode · 2025-01-16T12:42:45Z

@1933211129 I check the link, and I am checking to make sure there is no hidden bug between the two versions, and I confirm that I will release it. My desire is to do so before the weekend.

unclecode · 2025-01-16T12:59:34Z

@1933211129 At meantime check this https://docs.crawl4ai.com/advanced/multi-url-crawling/

unclecode · 2025-01-16T13:29:42Z

Please check this file and let me know if this is the expected result you need. This is the dumped version of the crawl result.

result.json

1933211129 · 2025-01-16T13:33:43Z

Yes, this is the content of the webpage, but the fit_markdown in arun_many mode isn't functioning as intended, and this issue occurs with other links as well. Therefore, in version 0.4.24x, I'm resorting to using raw_markdown to ensure that many links don't consistently return empty values.

unclecode · 2025-01-17T09:42:09Z

@1933211129 Please check this fit markdown, and let me know is this what you used to have?

markdown.md

1933211129 · 2025-01-18T04:54:40Z

@unclecode I apologize for just seeing your reply now. The results are fantastic, and there's no noise at all. Regarding the bug I previously reported with arun_many, I've temporarily adopted the solution suggested in #461 , explicitly calling the filter_content function to clean up the content. I'm really looking forward to the new version on Monday! Thank you once again!😊

unclecode · 2025-01-18T06:47:16Z

@1933211129 Glad to hear that. I release it by Monday :)

unclecode self-assigned this Jan 15, 2025

unclecode added the ❓ Question Q&A label Jan 15, 2025

unclecode closed this as completed Jan 18, 2025

Umpire2018 mentioned this issue Jan 21, 2025

Bug with "rotating proxies" #460

Closed

aravindkarnam mentioned this issue Jan 22, 2025

page_timout does not work for crawler.arun_many #436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How modify page_timeout in crawler.arun_many mode #455

How modify page_timeout in crawler.arun_many mode #455

1933211129 commented Jan 15, 2025 •

edited

Loading

1933211129 commented Jan 15, 2025

unclecode commented Jan 15, 2025

1933211129 commented Jan 15, 2025

1933211129 commented Jan 15, 2025

unclecode commented Jan 16, 2025

1933211129 commented Jan 16, 2025

unclecode commented Jan 16, 2025

unclecode commented Jan 16, 2025

unclecode commented Jan 16, 2025

1933211129 commented Jan 16, 2025

unclecode commented Jan 17, 2025

1933211129 commented Jan 18, 2025

unclecode commented Jan 18, 2025

How modify page_timeout in crawler.arun_many mode #455

How modify page_timeout in crawler.arun_many mode #455

Comments

1933211129 commented Jan 15, 2025 • edited Loading

1933211129 commented Jan 15, 2025

unclecode commented Jan 15, 2025

1933211129 commented Jan 15, 2025

1933211129 commented Jan 15, 2025

unclecode commented Jan 16, 2025

1933211129 commented Jan 16, 2025

unclecode commented Jan 16, 2025

unclecode commented Jan 16, 2025

unclecode commented Jan 16, 2025

1933211129 commented Jan 16, 2025

unclecode commented Jan 17, 2025

1933211129 commented Jan 18, 2025

unclecode commented Jan 18, 2025

1933211129 commented Jan 15, 2025 •

edited

Loading