Skip to content

Commit e09474f

Browse files
committed
fix: core api including config and result handling
Fix a large number of issues with the core including config serialisation and crawl result handling, which were all identified during work to ensure that all the existing tests pass consistently. This includes: ------- fix: config serialisation Fix config serialisation by creating a new Serialisable type and adding missing module imports for ScoringStats and Logger. This allows the config to be serialised and deserialised correctly. Add missing initialisation for ScoringStats. Add missing stats parameter to URLScorer and all its subclasses to ensure that the stats are serialisable. ------- fix: parameter definitions, type hints and defaults Fix parameter definitions by adding missing Optional hit to those which default to None. Add type hints to improve linting validation and IDE support. Set default values for parameters which were missing them, ensuring type safety and preventing runtime errors. ------- fix: various comment typos Fix various typos in comments and doc strings to improve clarity. ------- fix: BaseDispatcher missing abstract methods Add missing abstract methods to BaseDispatcher and implement in subclasses. ------- fix: crawl result handling Fix the handling of crawl results, which were using inconsistent types. This now uses CrawlResultContainer for all crawl results, unwrapping as needed when performing deep crawls. This moves CrawlResultContainer into models ensuring it can be imported where needed, avoiding circular imports. Refactor CrawlResultContainer to subclass CrawlResult to provide type hinting in the single result case and ensure consistent handling of both synchronous and asynchronous results. ------- feat: implement run_urls_stream for SemaphoreDispatcher Implement run_urls_stream for SemaphoreDispatcher to allow streaming of URLs to be crawled. ------- chore: translate non english comments Translate non english comments to english to improve readability and maintainability for non native speakers. ------- chore: additional example for arun Add examples for arun to demonstrate usage of streamed and batch processing modes, clarifying the impact of deep crawl on the results. ------- fix: handling of CrawlerRunConfig Fix the handling of CrawlerRunConfig to ensure that the config is correctly initialised from legacy kwargs. ------- fix: invalid screenshot argument to aprocess_html Fix failure caused by generating a screenshot due to unsupported argument to aprocess_html. ------- fix: structured content extraction Fix structured content extraction not being run when NoExtractionStrategy is used. This ensures that the content is extracted correctly and returned in the crawl result. ------- fix: aclear_cache Fix aclear_cache, previously it was just clean up, which just closed active connections. It now calls aclear_db to remove all entries from the cache table. ------- fix: unused imports Remove unused imports to improve readability and reduce clutter. ------- fix: undefined filter_conf Fix use of undefined filter_conf in crawl_cmd when the config is not passed in and neither markdown-fix nor md-fit are specified as output options. ------- fix: bs4 imports Fix bs4 imports to use the correct module name and ensure that the correct classes are imported. This addresses linter errors. ------- fix: BM25Okapi idf calculation Fix the idf calculation in BM25Okapi to use the correct formula and ensure that the idf is calculated correctly. This prevents missing results when using BM25Okapi caused by zero idf values. Removed commented out code to improve readability. Eliminate unnecessary tag_weight calculation when score is 0. ------- fix: relevant content filter Fix the extraction of headers from the response, previously this only handled h1 tags, this now handles all header tags. Fix boundary check for the relevant content filter to ensure that the content is not excluded when the end aligns with 150 character limit. ------- fix: return type for extract_text_chunks Fix the return type for extract_text_chunks to be include the right types and values, so consumers know what to expect. ------- fix: invalid markdown parameter to ScrapingResult Remove all references to markdown for ScrapingResult as the source never returns markdown, so its pointless to include it in the result. ------- fix: closest parent description Clean unnecessary white space the description returned by find_closest_parent_with_useful_text so that the two different strategies return consistent results. ------- fix: potential sources for element extraction Fix the potential sources for element extraction to ensure that the srcset and data-lazy-src are processed instead of srcssetdata-lazy-src due to missing comma. ------- fix: data validation in web scraping Add missing data validation, ensuring that the correct types and only set values are processed. ------- fix: missing json import for AmazonProductCrawler Add missing json import for AmazonProductCrawler. ------- fix: GoogleSearchCrawler checks and errors Correct the result field used to report the error message when arun fails. This ensures that the error message is correctly reported. Add missing check for result.js_execution_result before use. Add missing check for blank cleaned_html. Add validation of cleaned_html to ensure that the value is correctly converted from bytes to str. ------- fix: abstract method definitions Fix the abstract method definitions to ensure that the correct types are used for async generators and iterators. This lint errors for incompatible overrides by their implementers. This is cause be the type being adjusted when a yield is present. ------- fix: invalid use of infinity for int values Fix the use of infinity for int values flagged by the linter. Instead we use -1 to indicate that the value is not set. Fix use of unset url parameter, reported by linter. ------- chore: remove unneeded loop on batch results Remove unnecessary duplicate loop on batch results in _arun_batch to calculate _pages_crawled. ------- fix: invalid use of lambda to define async function Replace the use of lambda to define an async function with a normal function definition. ------- fix: validate use of deep_crawl_strategy before use Validate the type of deep_crawl_strategy before use to ensure that the correct type is used and preventing lint errors on methods calls using it. ------- fix: unprocessed FilterChain tasks Ensure that all tasks in the FilterChain are either processed and returned or cancelled if not needed, preventing runtime warnings. ------- feat: add support for transport to Crawl4aiDockerClient Add the ability to provide a custom transport to Crawl4aiDockerClient which allows easy testing. Set base_url on the httpx.AsyncClient to avoid need for local variable and calculations, simplifying the code. ------- fix: Crawl4aiDockerClient.crawl results on error Correctly handle the async data returned by the crawl method in Crawl4aiDockerClient when the stream results in a non 200 response. ------- fix: linter errors for lxml imports Fix linter errors for lxml imports by using importing the required methods directly. ------- fix: use of unset llm_config Fix the use of unset llm_config in generate_schema to ensure that we don't get a runtime error. ------- fix: missing meta field for hub crawlers Support both ways that meta data is defined by hub crawlers. ------- fix: correct return type for get_cached_url Correct the type and number of returned values for get_cached_url. ------- fix: DocsManager generation Fix the generation performed by DocsManager now docs have a hierarchical structure. ------- fix: linting errors in calculate_semaphore_count Ensure we set a default for the resource values returned by os methods. ------- fix: port handling in get_base_domain Maintain the port for non default ports in get_base_domain to ensure that the correct domain is returned. This prevents local links being incorrectly classified as external. ------- fix: get_title method type Add static method decorator to get_title method to ensure that the method is correctly identified as a static method and not an instance method. This prevents lint errors and ensures that the method is correctly called. ------- chore: add project settings to improve developer experience Add details to pyproject.toml to improve the developer experience including: * Configuring pytest test timeouts, ignoring external warnings and asyncio scope. * Disabling ruff formatting * Creating developer package targets: dev, docker and test * Leverage chaining to simplify maintenance of the all group ------- fix: WebScrapingStrategy ascrap Fix the ascrap command by calling scrap instead of _scrap which misses a lot of the functionality. ------- fix: scraping local sites Fix the scraping of local sites by removing the check for dot in parsed network location. ------- fix: monitor support on pseudo terminals Fix monitor support on pseudo terminals by setting the default for enable_ui to the value of sys.stdin.isatty(), this ensures it works under pytest if not specifically set. ------- fix: imports Add missing and remove unused imports as well as eliminating the use of wildcard imports. ------- fix: dependencies Add missing optional dependencies. ------- chore: remove deprecated licence tags Update setup.py and pyproject.toml to remove deprecated license tags, bumping the dependency to support the new method. The eliminates toml lint warning in pyproject.toml.
1 parent 712b033 commit e09474f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+1360
-784
lines changed

Diff for: .gitignore

+5-1
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ coverage.xml
5050
.hypothesis/
5151
.pytest_cache/
5252
cover/
53+
tests/async/output/
5354

5455
# Translations
5556
*.mo
@@ -257,4 +258,7 @@ continue_config.json
257258
.private/
258259

259260
CLAUDE_MONITOR.md
260-
CLAUDE.md
261+
CLAUDE.md
262+
263+
# Test output
264+
logs/

Diff for: crawl4ai/__init__.py

+4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# __init__.py
22
import warnings
3+
from logging import Logger
34

45
from .async_webcrawler import AsyncWebCrawler, CacheMode
56
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig
@@ -64,6 +65,7 @@
6465
DFSDeepCrawlStrategy,
6566
DeepCrawlDecorator,
6667
)
68+
from .deep_crawling.scorers import ScoringStats
6769

6870
__all__ = [
6971
"AsyncLoggerBase",
@@ -121,6 +123,8 @@
121123
"Crawl4aiDockerClient",
122124
"ProxyRotationStrategy",
123125
"RoundRobinProxyStrategy",
126+
"ScoringStats",
127+
"Logger", # Required for serialization
124128
]
125129

126130

0 commit comments

Comments
 (0)