-
Notifications
You must be signed in to change notification settings - Fork 3.5k
How modify page_timeout in crawler.arun_many mode #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I pass in a lot of links at once to get their subchains, but some of the links in the middle may access errors, I want to skip them as soon as possible, not try too much time, but the current default of 60 seconds is too long for me, I can't adjust it now.😭 |
@1933211129 Hello again, I have very good news for you. Tomorrow, I will drop a new version, and Regarding setting the timeout, shouldn't be any problem, look at the code below: import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
async def main():
config = BrowserConfig(
headless=True,
)
async with AsyncWebCrawler(config=config) as crawler:
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
page_timeout=1,
)
result = await crawler.arun(
url="https://crawl4ai.com",
config=crawl_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main()) Look at the error message:
As you can see it says Let me know if you have any problem with it. Anyway wait for the new version. |
Yes, just like the code you wrote above, the |
Additionally, I have another issue to report, which is related to In the current version 0.4.24, this function doesn't seem to work effectively for the links I tested previously—it returns The same issue also appears with This makes it a bit frustrating for my application development. To get cleaner markdown, I have to use version 0.4.21, but to get more stable results for |
@1933211129 Sorry to hear that, Can you share the link for this one, Perhaps I can test it before release the new version. |
@unclecode This link produces different results in version 0.4.0 and version 0.4.2, even when using the same code. This occurs in both fit_markdown and links, specifically in the I noticed that many others have also mentioned this issue in other threads. There seems to be a problem with the extraction of By the way, do you have any updates on when the new version of |
@1933211129 I check the link, and I am checking to make sure there is no hidden bug between the two versions, and I confirm that I will release it. My desire is to do so before the weekend. |
@1933211129 At meantime check this https://docs.crawl4ai.com/advanced/multi-url-crawling/ |
Please check this file and let me know if this is the expected result you need. This is the dumped version of the crawl result. |
Yes, this is the content of the webpage, but the |
@1933211129 Please check this fit markdown, and let me know is this what you used to have? |
@unclecode I apologize for just seeing your reply now. The results are fantastic, and there's no noise at all. Regarding the bug I previously reported with |
@1933211129 Glad to hear that. I release it by Monday :) |
Hi @unclecode ,
I have been using crawl4ai for a while and I am excited about every update, thank you for your contributions!
#436 ,in this issue says page_timout does not work for crawler.arun_many. But now I want to modify page_timout in 'arun_many' mode, whether I add 'config' or modify the parameters in the source code 'async_crawler_strategy.py' or 'config.py' file, It never worked in 'arun_many' mode, I wanted to make it shorter, but now I can't. Looking forward to your reply!
× Unexpected error in _crawl_web at line 1205 in _crawl_web (../usr/local/lib/python3.10/dist- │
│ packages/crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: Timeout 60000ms exceeded. │
│ Call log: │
│ - navigating to "https://www.ccdi.gov.cn/", waiting until "domcontentloaded" │
│ │
│ │
│ Code context: │
│ 1200 │
│ 1201 response = await page.goto( │
│ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 1203 ) │
│ 1204 except Error as e: │
│ 1205 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 1206 │
│ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) │
│ 1208 │
│ 1209 if response is None: │
│ 1210 status_code = 200
No matter how I modify the 'page_timeout' parameter, it always report 'Page.goto: Timeout 60000ms exceeded.'
The text was updated successfully, but these errors were encountered: