Skip to content

Conversation

@stevenh
Copy link

@stevenh stevenh commented Apr 10, 2025

Summary

Fixes missing metadata from page cache, incomplete file downloads, the handling of js_code and playwright browser reuse, alongside adjacent issues in the affected files, see below for more details.

Details

Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy which wasn't waiting for the download to complete before returning, which resulted in race conditions and incomplete or missing downloads.

Fix incorrect field use for headers and user_agent in AsyncPlaywrightCrawlerStrategy.

Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to the base class Logger. This allows the logger to be used correctly and prevents type errors when using the logger.

Add missing abstract methods to AsyncCrawlerStrategy and implement in its subclasses.

Fix the extraction of the error message when processing console error log in log_console.

Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to ensure that the code is executed correctly and the result is returned when the script starts with a variable or constant declaration.

Remove unused time variables and commented out code to improve readability.

Remove duplicate Exception handling.

Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was previously using the parsed path corrupting the request.

Fix the browser manager to ensure that it always returns a valid browser instance. This eliminates the class level playwright instance which wasn't being cleaned up on close, resulting in subsequent callers receiving a closed instance and causing errors.

Fix the storage of links, media and metadata to ensure that the correct values are stored and returned. This prevents incorrect results when using the cached results.

Use Field for default values in Media, Links and ScrapingResult pydantic models to prevent invalid results.

Handle undefined string passed to MediaItem width.

Add parameters to the CrawlResult.init method to provide type hints for the fields, which provides better type checking.

Fix database query retry handling for the case where the database table is missing when using the SQLite database. This prevents the crawler from failing when the table is not present and allows it to continue crawling.

This prevents failing tests after one which drops the database table.

Remove commit call for queries which don't need it.

Sync table definition in legacy code so prevent mixed legacy and new code tests from conflicting with each other.

Fix the caching of markdown field in DB / files which was only storing the single value, which caused failures when using cached results.

Export the markdown field in StringCompatibleMarkdown, so we don't need to use a private field to ensure that the value is serialised correctly.

Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to ensure legacy parameters work as expected.

Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.

How Has This Been Tested?

This has been tested both separately with the updated tests as well as in combination with the wider tests as part of #891

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@stevenh stevenh changed the title fix: caching, file downloads and js_code handling fix: caching, downloads, js handling and browser reuse Apr 10, 2025
@stevenh stevenh changed the title fix: caching, downloads, js handling and browser reuse fix: caching, downloads, js_code and browser reuse Apr 10, 2025
@stevenh stevenh changed the title fix: caching, downloads, js_code and browser reuse fix: caching, downloads, js_code and browser manager Apr 10, 2025
Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy
which wasn't waiting for the download to complete before returning,
which resulted in race conditions and incomplete or missing downloads.

Fix incorrect field use for headers and user_agent in
AsyncPlaywrightCrawlerStrategy.

Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to
the base class Logger. This allows the logger to be used correctly
and prevents type errors when using the logger.

Add missing abstract methods to AsyncCrawlerStrategy and implement in
its subclasses.

Fix the extraction of the error message when processing console error
log in log_console.

Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to
ensure that the code is executed correctly and the result is returned
when the script starts with a variable or constant declaration.

Remove unused time variables and commented out code to improve
readability.

Remove duplicate Exception handling.

Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was
previously using the parsed path corrupting the request.

Fix the browser manager to ensure that it always returns a valid
browser instance. This eliminates the class level playwright instance
which wasn't being cleaned up on close, resulting in subsequent callers
receiving a closed instance and causing errors.

Fix the storage of links, media and metadata to ensure that the
correct values are stored and returned. This prevents incorrect results
when using the cached results.

Use Field for default values in Media, Links and ScrapingResult pydantic
models to prevent invalid results.

Handle undefined string passed to MediaItem width.

Add parameters to the CrawlResult.__init__ method to provide type
hints for the fields, which provides better type checking.

Fix database query retry handling for the case where the database table
is missing when using the SQLite database. This prevents the crawler
from failing when the table is not present and allows it to continue
crawling.

This prevents failing tests after one which drops the database table.

Remove commit call for query's which don't need it.

Sync table definition in legacy code so prevent mixed legacy and new
code tests from conflicting with each other.

Fix the caching of markdown field in DB / files which was only storing
the single value, which caused failures when using cached results.

Export the markdown field in StringCompatibleMarkdown, so we don't need
to use a private field to ensure that the value is serialised correctly.

Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to
ensure legacy parameters work as expected.
@stevenh stevenh force-pushed the fix/caching-and-downloads branch 2 times, most recently from a026f1b to 2a80bd3 Compare April 10, 2025 14:30
stevenh added 2 commits April 10, 2025 15:32
Add relevant tests for the following:
- markdown generator
- http crawler strategy
Add ruff settings to pyproject.toml to prevent deprecation warnings
beyond our control and to disable formatting for now to avoid the noise
it would generate.
@stevenh stevenh force-pushed the fix/caching-and-downloads branch from 2a80bd3 to 712b033 Compare April 10, 2025 14:32
@stevenh stevenh marked this pull request as ready for review April 10, 2025 14:37
@stevenh stevenh mentioned this pull request Apr 10, 2025
6 tasks
@stevenh
Copy link
Author

stevenh commented Apr 10, 2025

While this removes the browser manager singleton optimisation, we might want to revisit that if it aligns with a desired API use case, however that would require more careful design and discussion hence I went down the correct vs performant route for now.

@stevenh
Copy link
Author

stevenh commented Aug 18, 2025

Closing as never got any traction, so we've moved away from crawl4ai.

If someone wants to pick up the branch and reuse, feel free.

@stevenh stevenh closed this Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant