Skip to content

How modify page_timeout in crawler.arun_many mode #455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1933211129 opened this issue Jan 15, 2025 · 13 comments
Closed

How modify page_timeout in crawler.arun_many mode #455

1933211129 opened this issue Jan 15, 2025 · 13 comments
Assignees
Labels

Comments

@1933211129
Copy link

1933211129 commented Jan 15, 2025

Hi @unclecode ,
I have been using crawl4ai for a while and I am excited about every update, thank you for your contributions!

#436 ,in this issue says page_timout does not work for crawler.arun_many. But now I want to modify page_timout in 'arun_many' mode, whether I add 'config' or modify the parameters in the source code 'async_crawler_strategy.py' or 'config.py' file, It never worked in 'arun_many' mode, I wanted to make it shorter, but now I can't. Looking forward to your reply!

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig


async def extract_urls_and_descriptions(url_list: list):
    """
    爬取多个URL的内部链接和描述信息。
    """
    results = {}
    index = 1 

    async with AsyncWebCrawler(verbose=False) as crawler:
        
        try:
            config = CrawlerRunConfig(
                  page_timeout=5000
              )
            crawled_results = await crawler.arun_many(
                urls=url_list,
                config=config
            )

            # 处理结果
            for result in crawled_results:
                if result.success:  
                    for category in ['internal']:  
                        for link in result.links.get(category, []):
                            link_url = link.get('href')
                            description = link.get('text', "")

                            
                            if link_url and (link_url.startswith("http") or link_url.startswith("https")):
                                results[index] = {link_url: description}  
                                index += 1  

        except Exception as e:
            print(f"爬取出错: {e}\n")

    return results
async def main():
    url_list = [
    "http://www.people.com.cn/",
    "http://www.xinhuanet.com/",
    "https://news.sina.com.cn/",
    "https://news.qq.com/",
    "https://www.ccdi.gov.cn/"


]
    results = await extract_urls_and_descriptions(url_list)
    print(results)

asyncio.run(main())

× Unexpected error in _crawl_web at line 1205 in _crawl_web (../usr/local/lib/python3.10/dist- │
│ packages/crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: Timeout 60000ms exceeded. │
│ Call log: │
│ - navigating to "https://www.ccdi.gov.cn/", waiting until "domcontentloaded" │
│ │
│ │
│ Code context: │
│ 1200 │
│ 1201 response = await page.goto( │
│ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 1203 ) │
│ 1204 except Error as e: │
│ 1205 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 1206 │
│ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) │
│ 1208 │
│ 1209 if response is None: │
│ 1210 status_code = 200

No matter how I modify the 'page_timeout' parameter, it always report 'Page.goto: Timeout 60000ms exceeded.'

@1933211129
Copy link
Author

I pass in a lot of links at once to get their subchains, but some of the links in the middle may access errors, I want to skip them as soon as possible, not try too much time, but the current default of 60 seconds is too long for me, I can't adjust it now.😭

@unclecode
Copy link
Owner

@1933211129 Hello again, I have very good news for you. Tomorrow, I will drop a new version, and arun_many() has changed drastically. I made tons of optimizations for much faster and better parallel crawling. I tested it, and I will release it as a beta, so perhaps you can help test and debug it and provide your feedback. I will record a video to explain it.

Regarding setting the timeout, shouldn't be any problem, look at the code below:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig

async def main():
    config = BrowserConfig(
        headless=True,
    )

    async with AsyncWebCrawler(config=config) as crawler:
        crawl_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            page_timeout=1,
        )
        result = await crawler.arun(
            url="https://crawl4ai.com",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Look at the error message:

[INIT].... → Crawl4AI 0.4.248
[ERROR]... × https://crawl4ai.com... | Error: 
│ × Unexpected error in _crawl_web at line 1260 in _crawl_web (crawl4ai/async_crawler_strategy.py):                     │
│   Error: Failed on navigating ACS-GOTO:                                                                               │
│   Page.goto: Timeout 1ms exceeded.                                                                                    │
│   Call log:                             

As you can see it says Page.goto: Timeout 1ms exceeded.

Let me know if you have any problem with it. Anyway wait for the new version.

@unclecode unclecode self-assigned this Jan 15, 2025
@1933211129
Copy link
Author

Yes, just like the code you wrote above, the page_timeout setting works effectively for crawler.arun, but it doesn't take effect for crawler.arun_many.

@1933211129
Copy link
Author

Additionally, I have another issue to report, which is related to result.markdown_v2.fit_markdown and result.links

In the current version 0.4.24, this function doesn't seem to work effectively for the links I tested previously—it returns raw_markdown. However, in version 0.4.21, fit_markdown was able to return very clean results, which is quite strange.

The same issue also appears with result.links. In versions prior to 0.4.x, it worked fine for retrieving the URL of the link. In version 0.4.1, however, it returned empty results without any errors. After upgrading to 0.4.24, it started working normally again.

This makes it a bit frustrating for my application development. To get cleaner markdown, I have to use version 0.4.21, but to get more stable results for result.links, I have to upgrade to version 0.4.24. This is very strange, and I’ve tested it multiple times—it doesn’t seem to be an issue with my network environment.

@unclecode
Copy link
Owner

@1933211129 Sorry to hear that, Can you share the link for this one, Perhaps I can test it before release the new version.

@1933211129
Copy link
Author

@unclecode
http://www.las.cas.cn/

This link produces different results in version 0.4.0 and version 0.4.2, even when using the same code. This occurs in both fit_markdown and links, specifically in the run_many mode.

I noticed that many others have also mentioned this issue in other threads. There seems to be a problem with the extraction of fit_markdown in run_many mode.

By the way, do you have any updates on when the new version of arun_many() will be released? Looking forward to it!

@unclecode
Copy link
Owner

@1933211129 I check the link, and I am checking to make sure there is no hidden bug between the two versions, and I confirm that I will release it. My desire is to do so before the weekend.

@unclecode
Copy link
Owner

@1933211129 At meantime check this https://docs.crawl4ai.com/advanced/multi-url-crawling/

@unclecode
Copy link
Owner

Please check this file and let me know if this is the expected result you need. This is the dumped version of the crawl result.

result.json

@1933211129
Copy link
Author

Yes, this is the content of the webpage, but the fit_markdown in arun_many mode isn't functioning as intended, and this issue occurs with other links as well. Therefore, in version 0.4.24x, I'm resorting to using raw_markdown to ensure that many links don't consistently return empty values.

@unclecode
Copy link
Owner

@1933211129 Please check this fit markdown, and let me know is this what you used to have?

markdown.md

@1933211129
Copy link
Author

@unclecode I apologize for just seeing your reply now. The results are fantastic, and there's no noise at all. Regarding the bug I previously reported with arun_many, I've temporarily adopted the solution suggested in #461 , explicitly calling the filter_content function to clean up the content. I'm really looking forward to the new version on Monday! Thank you once again!😊

@unclecode
Copy link
Owner

@1933211129 Glad to hear that. I release it by Monday :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants