fix: caching, downloads, js_code and browser manager #969

stevenh · 2025-04-10T13:04:05Z

Summary

Fixes missing metadata from page cache, incomplete file downloads, the handling of js_code and playwright browser reuse, alongside adjacent issues in the affected files, see below for more details.

Details

Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy which wasn't waiting for the download to complete before returning, which resulted in race conditions and incomplete or missing downloads.

Fix incorrect field use for headers and user_agent in AsyncPlaywrightCrawlerStrategy.

Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to the base class Logger. This allows the logger to be used correctly and prevents type errors when using the logger.

Add missing abstract methods to AsyncCrawlerStrategy and implement in its subclasses.

Fix the extraction of the error message when processing console error log in log_console.

Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to ensure that the code is executed correctly and the result is returned when the script starts with a variable or constant declaration.

Remove unused time variables and commented out code to improve readability.

Remove duplicate Exception handling.

Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was previously using the parsed path corrupting the request.

Fix the browser manager to ensure that it always returns a valid browser instance. This eliminates the class level playwright instance which wasn't being cleaned up on close, resulting in subsequent callers receiving a closed instance and causing errors.

Fix the storage of links, media and metadata to ensure that the correct values are stored and returned. This prevents incorrect results when using the cached results.

Use Field for default values in Media, Links and ScrapingResult pydantic models to prevent invalid results.

Handle undefined string passed to MediaItem width.

Add parameters to the CrawlResult.init method to provide type hints for the fields, which provides better type checking.

Fix database query retry handling for the case where the database table is missing when using the SQLite database. This prevents the crawler from failing when the table is not present and allows it to continue crawling.

This prevents failing tests after one which drops the database table.

Remove commit call for queries which don't need it.

Sync table definition in legacy code so prevent mixed legacy and new code tests from conflicting with each other.

Fix the caching of markdown field in DB / files which was only storing the single value, which caused failures when using cached results.

Export the markdown field in StringCompatibleMarkdown, so we don't need to use a private field to ensure that the value is serialised correctly.

Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to ensure legacy parameters work as expected.

Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.

How Has This Been Tested?

This has been tested both separately with the updated tests as well as in combination with the wider tests as part of #891

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy which wasn't waiting for the download to complete before returning, which resulted in race conditions and incomplete or missing downloads. Fix incorrect field use for headers and user_agent in AsyncPlaywrightCrawlerStrategy. Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to the base class Logger. This allows the logger to be used correctly and prevents type errors when using the logger. Add missing abstract methods to AsyncCrawlerStrategy and implement in its subclasses. Fix the extraction of the error message when processing console error log in log_console. Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to ensure that the code is executed correctly and the result is returned when the script starts with a variable or constant declaration. Remove unused time variables and commented out code to improve readability. Remove duplicate Exception handling. Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was previously using the parsed path corrupting the request. Fix the browser manager to ensure that it always returns a valid browser instance. This eliminates the class level playwright instance which wasn't being cleaned up on close, resulting in subsequent callers receiving a closed instance and causing errors. Fix the storage of links, media and metadata to ensure that the correct values are stored and returned. This prevents incorrect results when using the cached results. Use Field for default values in Media, Links and ScrapingResult pydantic models to prevent invalid results. Handle undefined string passed to MediaItem width. Add parameters to the CrawlResult.__init__ method to provide type hints for the fields, which provides better type checking. Fix database query retry handling for the case where the database table is missing when using the SQLite database. This prevents the crawler from failing when the table is not present and allows it to continue crawling. This prevents failing tests after one which drops the database table. Remove commit call for query's which don't need it. Sync table definition in legacy code so prevent mixed legacy and new code tests from conflicting with each other. Fix the caching of markdown field in DB / files which was only storing the single value, which caused failures when using cached results. Export the markdown field in StringCompatibleMarkdown, so we don't need to use a private field to ensure that the value is serialised correctly. Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to ensure legacy parameters work as expected.

Add relevant tests for the following: - markdown generator - http crawler strategy

Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.

stevenh · 2025-04-10T14:53:55Z

While this removes the browser manager singleton optimisation, we might want to revisit that if it aligns with a desired API use case, however that would require more careful design and discussion hence I went down the correct vs performant route for now.

stevenh · 2025-08-18T10:12:28Z

Closing as never got any traction, so we've moved away from crawl4ai.

If someone wants to pick up the branch and reuse, feel free.

stevenh changed the title ~~fix: caching, file downloads and js_code handling~~ fix: caching, downloads, js handling and browser reuse Apr 10, 2025

stevenh changed the title ~~fix: caching, downloads, js handling and browser reuse~~ fix: caching, downloads, js_code and browser reuse Apr 10, 2025

stevenh changed the title ~~fix: caching, downloads, js_code and browser reuse~~ fix: caching, downloads, js_code and browser manager Apr 10, 2025

stevenh force-pushed the fix/caching-and-downloads branch 2 times, most recently from a026f1b to 2a80bd3 Compare April 10, 2025 14:30

stevenh added 2 commits April 10, 2025 15:32

chore: add relevant tests

f8da894

Add relevant tests for the following: - markdown generator - http crawler strategy

chore: add ruff settings to pyproject.toml

712b033

Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.

stevenh force-pushed the fix/caching-and-downloads branch from 2a80bd3 to 712b033 Compare April 10, 2025 14:32

stevenh marked this pull request as ready for review April 10, 2025 14:37

stevenh mentioned this pull request Apr 10, 2025

fix: tests to run under pytest #891

Closed

6 tasks

stevenh mentioned this pull request Apr 10, 2025

fix: core api including config and result handling #970

Closed

6 tasks

stevenh closed this Aug 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: caching, downloads, js_code and browser manager #969

fix: caching, downloads, js_code and browser manager #969

Uh oh!

stevenh commented Apr 10, 2025 •

edited

Loading

Uh oh!

stevenh commented Apr 10, 2025

Uh oh!

stevenh commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: caching, downloads, js_code and browser manager #969

fix: caching, downloads, js_code and browser manager #969

Uh oh!

Conversation

stevenh commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

How Has This Been Tested?

Checklist:

Uh oh!

stevenh commented Apr 10, 2025

Uh oh!

stevenh commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stevenh commented Apr 10, 2025 •

edited

Loading