-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
crawl4ai version
6.0.0
Expected Behavior
A crawl should successful handle a site which actively manages client request rates.
Current Behavior
The current RateLimiter implementation uses a simple last request and current delay calculation, which could lead to uneven request distribution when multiple requests were made in quick succession.
The result of this is the more links discovered by as single request the more likely it is that we would trigger 429 and 503 response codes, when combined with max_retries this would cause the crawler to fail to successfully process all pages if the site implements rate limiting.
An example site: https://gamesjobslive.niceboard.co/
In addition to this it's currently not possible to configure the rate limiter for deep crawl as there is no way to set dispatcher.
Finally the rate limiter doesn't adapt to site which report their rate limits by the standard rate limiting headers, significantly increasing the number of retries and ultimately failures.
Is this reproducible?
Yes