-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
fix: caching, downloads, js_code and browser manager #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy which wasn't waiting for the download to complete before returning, which resulted in race conditions and incomplete or missing downloads. Fix incorrect field use for headers and user_agent in AsyncPlaywrightCrawlerStrategy. Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to the base class Logger. This allows the logger to be used correctly and prevents type errors when using the logger. Add missing abstract methods to AsyncCrawlerStrategy and implement in its subclasses. Fix the extraction of the error message when processing console error log in log_console. Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to ensure that the code is executed correctly and the result is returned when the script starts with a variable or constant declaration. Remove unused time variables and commented out code to improve readability. Remove duplicate Exception handling. Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was previously using the parsed path corrupting the request. Fix the browser manager to ensure that it always returns a valid browser instance. This eliminates the class level playwright instance which wasn't being cleaned up on close, resulting in subsequent callers receiving a closed instance and causing errors. Fix the storage of links, media and metadata to ensure that the correct values are stored and returned. This prevents incorrect results when using the cached results. Use Field for default values in Media, Links and ScrapingResult pydantic models to prevent invalid results. Handle undefined string passed to MediaItem width. Add parameters to the CrawlResult.__init__ method to provide type hints for the fields, which provides better type checking. Fix database query retry handling for the case where the database table is missing when using the SQLite database. This prevents the crawler from failing when the table is not present and allows it to continue crawling. This prevents failing tests after one which drops the database table. Remove commit call for query's which don't need it. Sync table definition in legacy code so prevent mixed legacy and new code tests from conflicting with each other. Fix the caching of markdown field in DB / files which was only storing the single value, which caused failures when using cached results. Export the markdown field in StringCompatibleMarkdown, so we don't need to use a private field to ensure that the value is serialised correctly. Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to ensure legacy parameters work as expected.
a026f1b to
2a80bd3
Compare
Add relevant tests for the following: - markdown generator - http crawler strategy
Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.
2a80bd3 to
712b033
Compare
Author
|
While this removes the browser manager singleton optimisation, we might want to revisit that if it aligns with a desired API use case, however that would require more careful design and discussion hence I went down the correct vs performant route for now. |
6 tasks
Author
|
Closing as never got any traction, so we've moved away from crawl4ai. If someone wants to pick up the branch and reuse, feel free. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes missing metadata from page cache, incomplete file downloads, the handling of js_code and playwright browser reuse, alongside adjacent issues in the affected files, see below for more details.
Details
Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy which wasn't waiting for the download to complete before returning, which resulted in race conditions and incomplete or missing downloads.
Fix incorrect field use for headers and user_agent in AsyncPlaywrightCrawlerStrategy.
Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to the base class Logger. This allows the logger to be used correctly and prevents type errors when using the logger.
Add missing abstract methods to AsyncCrawlerStrategy and implement in its subclasses.
Fix the extraction of the error message when processing console error log in log_console.
Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to ensure that the code is executed correctly and the result is returned when the script starts with a variable or constant declaration.
Remove unused time variables and commented out code to improve readability.
Remove duplicate Exception handling.
Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was previously using the parsed path corrupting the request.
Fix the browser manager to ensure that it always returns a valid browser instance. This eliminates the class level playwright instance which wasn't being cleaned up on close, resulting in subsequent callers receiving a closed instance and causing errors.
Fix the storage of links, media and metadata to ensure that the correct values are stored and returned. This prevents incorrect results when using the cached results.
Use Field for default values in Media, Links and ScrapingResult pydantic models to prevent invalid results.
Handle undefined string passed to MediaItem width.
Add parameters to the CrawlResult.init method to provide type hints for the fields, which provides better type checking.
Fix database query retry handling for the case where the database table is missing when using the SQLite database. This prevents the crawler from failing when the table is not present and allows it to continue crawling.
This prevents failing tests after one which drops the database table.
Remove commit call for queries which don't need it.
Sync table definition in legacy code so prevent mixed legacy and new code tests from conflicting with each other.
Fix the caching of markdown field in DB / files which was only storing the single value, which caused failures when using cached results.
Export the markdown field in StringCompatibleMarkdown, so we don't need to use a private field to ensure that the value is serialised correctly.
Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to ensure legacy parameters work as expected.
Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.
How Has This Been Tested?
This has been tested both separately with the updated tests as well as in combination with the wider tests as part of #891
Checklist: