Skip to content

[Bug]: RateLimiter provides ineffective protection against failures #1095

@stevenh

Description

@stevenh

crawl4ai version

6.0.0

Expected Behavior

A crawl should successful handle a site which actively manages client request rates.

Current Behavior

The current RateLimiter implementation uses a simple last request and current delay calculation, which could lead to uneven request distribution when multiple requests were made in quick succession.

The result of this is the more links discovered by as single request the more likely it is that we would trigger 429 and 503 response codes, when combined with max_retries this would cause the crawler to fail to successfully process all pages if the site implements rate limiting.

An example site: https://gamesjobslive.niceboard.co/

In addition to this it's currently not possible to configure the rate limiter for deep crawl as there is no way to set dispatcher.

Finally the rate limiter doesn't adapt to site which report their rate limits by the standard rate limiting headers, significantly increasing the number of retries and ultimately failures.

Is this reproducible?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions