Skip to content

Conversation

@Ghirtor
Copy link

@Ghirtor Ghirtor commented Mar 11, 2025

Summary

Fixes #794

List of files changed and why

async_crawler_strategy.py - removed the http client session context manager (only used in _handle_http method) which caused the problem when trying to crawl more than one url with more than one session allowed from the dispatcher (max_session_permit attribute). The problem occured when the context manager closed the session after completing a given task and another task was using this session to do its http request. The http client session lifecycle is already managed by the web crawler using its context manager or using explicit lifecycle management.

How Has This Been Tested?

Unit tests have been added in a new file 'test_http_crawler.py' to test:

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link

@perretv perretv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice detective work!
Does crawl4ai have any automated CI/CD to run tests?

@stevenh
Copy link

stevenh commented Mar 26, 2025

Not sure if there are automated tests yet, but I've done a bunch of work getting the existing tests into a work state #891.

It's still work in progress as I clean down and spilt the fixes into a more manageable sizes, but once thats in it should be pretty simple to setup tests to run in GitHub actions.

@aravindkarnam
Copy link
Collaborator

@Ghirtor I already patched this as part of #867.

Unfortunately I did not realise that #794 is duplicate of #867. You seem to have root caused the issue correctly (to session management). However since that patch is already merged, I closing this PR without merging. I really appreciate your time and effort in fixing this issue(and your solution is correct!), therefore I'll list you in the contributors for upcoming release.

Looking forward to seeing more contributions from you in the future. Thanks again !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: AsyncHTTPCrawlerStrategy fails with arun_many when len(urls)>2

4 participants