Commit e09474f
committed
fix: core api including config and result handling
Fix a large number of issues with the core including config
serialisation and crawl result handling, which were all identified
during work to ensure that all the existing tests pass consistently.
This includes:
-------
fix: config serialisation
Fix config serialisation by creating a new Serialisable type and adding
missing module imports for ScoringStats and Logger.
This allows the config to be serialised and deserialised correctly.
Add missing initialisation for ScoringStats.
Add missing stats parameter to URLScorer and all its subclasses to
ensure that the stats are serialisable.
-------
fix: parameter definitions, type hints and defaults
Fix parameter definitions by adding missing Optional hit to those which
default to None.
Add type hints to improve linting validation and IDE support.
Set default values for parameters which were missing them, ensuring
type safety and preventing runtime errors.
-------
fix: various comment typos
Fix various typos in comments and doc strings to improve clarity.
-------
fix: BaseDispatcher missing abstract methods
Add missing abstract methods to BaseDispatcher and implement in
subclasses.
-------
fix: crawl result handling
Fix the handling of crawl results, which were using inconsistent types.
This now uses CrawlResultContainer for all crawl results, unwrapping as
needed when performing deep crawls.
This moves CrawlResultContainer into models ensuring it can be imported
where needed, avoiding circular imports.
Refactor CrawlResultContainer to subclass CrawlResult to provide type
hinting in the single result case and ensure consistent handling of
both synchronous and asynchronous results.
-------
feat: implement run_urls_stream for SemaphoreDispatcher
Implement run_urls_stream for SemaphoreDispatcher to allow streaming
of URLs to be crawled.
-------
chore: translate non english comments
Translate non english comments to english to improve readability and
maintainability for non native speakers.
-------
chore: additional example for arun
Add examples for arun to demonstrate usage of streamed and batch
processing modes, clarifying the impact of deep crawl on the results.
-------
fix: handling of CrawlerRunConfig
Fix the handling of CrawlerRunConfig to ensure that the config is
correctly initialised from legacy kwargs.
-------
fix: invalid screenshot argument to aprocess_html
Fix failure caused by generating a screenshot due to unsupported
argument to aprocess_html.
-------
fix: structured content extraction
Fix structured content extraction not being run when NoExtractionStrategy
is used. This ensures that the content is extracted correctly and
returned in the crawl result.
-------
fix: aclear_cache
Fix aclear_cache, previously it was just clean up, which just closed
active connections. It now calls aclear_db to remove all entries from
the cache table.
-------
fix: unused imports
Remove unused imports to improve readability and reduce clutter.
-------
fix: undefined filter_conf
Fix use of undefined filter_conf in crawl_cmd when the config is not
passed in and neither markdown-fix nor md-fit are specified as output
options.
-------
fix: bs4 imports
Fix bs4 imports to use the correct module name and ensure that the
correct classes are imported. This addresses linter errors.
-------
fix: BM25Okapi idf calculation
Fix the idf calculation in BM25Okapi to use the correct formula and
ensure that the idf is calculated correctly. This prevents missing
results when using BM25Okapi caused by zero idf values.
Removed commented out code to improve readability.
Eliminate unnecessary tag_weight calculation when score is 0.
-------
fix: relevant content filter
Fix the extraction of headers from the response, previously this only
handled h1 tags, this now handles all header tags.
Fix boundary check for the relevant content filter to ensure that the
content is not excluded when the end aligns with 150 character limit.
-------
fix: return type for extract_text_chunks
Fix the return type for extract_text_chunks to be include the right types
and values, so consumers know what to expect.
-------
fix: invalid markdown parameter to ScrapingResult
Remove all references to markdown for ScrapingResult as the source never
returns markdown, so its pointless to include it in the result.
-------
fix: closest parent description
Clean unnecessary white space the description returned by
find_closest_parent_with_useful_text so that the two different
strategies return consistent results.
-------
fix: potential sources for element extraction
Fix the potential sources for element extraction to ensure that the
srcset and data-lazy-src are processed instead of srcssetdata-lazy-src
due to missing comma.
-------
fix: data validation in web scraping
Add missing data validation, ensuring that the correct types and only
set values are processed.
-------
fix: missing json import for AmazonProductCrawler
Add missing json import for AmazonProductCrawler.
-------
fix: GoogleSearchCrawler checks and errors
Correct the result field used to report the error message when arun
fails. This ensures that the error message is correctly reported.
Add missing check for result.js_execution_result before use.
Add missing check for blank cleaned_html.
Add validation of cleaned_html to ensure that the value is correctly
converted from bytes to str.
-------
fix: abstract method definitions
Fix the abstract method definitions to ensure that the correct types are
used for async generators and iterators. This lint errors for
incompatible overrides by their implementers. This is cause be the
type being adjusted when a yield is present.
-------
fix: invalid use of infinity for int values
Fix the use of infinity for int values flagged by the linter. Instead
we use -1 to indicate that the value is not set.
Fix use of unset url parameter, reported by linter.
-------
chore: remove unneeded loop on batch results
Remove unnecessary duplicate loop on batch results in _arun_batch to
calculate _pages_crawled.
-------
fix: invalid use of lambda to define async function
Replace the use of lambda to define an async function with a normal
function definition.
-------
fix: validate use of deep_crawl_strategy before use
Validate the type of deep_crawl_strategy before use to ensure that the
correct type is used and preventing lint errors on methods calls using
it.
-------
fix: unprocessed FilterChain tasks
Ensure that all tasks in the FilterChain are either processed and
returned or cancelled if not needed, preventing runtime warnings.
-------
feat: add support for transport to Crawl4aiDockerClient
Add the ability to provide a custom transport to Crawl4aiDockerClient
which allows easy testing.
Set base_url on the httpx.AsyncClient to avoid need for local variable
and calculations, simplifying the code.
-------
fix: Crawl4aiDockerClient.crawl results on error
Correctly handle the async data returned by the crawl method in
Crawl4aiDockerClient when the stream results in a non 200 response.
-------
fix: linter errors for lxml imports
Fix linter errors for lxml imports by using importing the required
methods directly.
-------
fix: use of unset llm_config
Fix the use of unset llm_config in generate_schema to ensure that we
don't get a runtime error.
-------
fix: missing meta field for hub crawlers
Support both ways that meta data is defined by hub crawlers.
-------
fix: correct return type for get_cached_url
Correct the type and number of returned values for get_cached_url.
-------
fix: DocsManager generation
Fix the generation performed by DocsManager now docs have a hierarchical
structure.
-------
fix: linting errors in calculate_semaphore_count
Ensure we set a default for the resource values returned by os methods.
-------
fix: port handling in get_base_domain
Maintain the port for non default ports in get_base_domain to ensure
that the correct domain is returned. This prevents local links being
incorrectly classified as external.
-------
fix: get_title method type
Add static method decorator to get_title method to ensure that the
method is correctly identified as a static method and not an instance
method. This prevents lint errors and ensures that the method is
correctly called.
-------
chore: add project settings to improve developer experience
Add details to pyproject.toml to improve the developer experience
including:
* Configuring pytest test timeouts, ignoring external warnings and
asyncio scope.
* Disabling ruff formatting
* Creating developer package targets: dev, docker and test
* Leverage chaining to simplify maintenance of the all group
-------
fix: WebScrapingStrategy ascrap
Fix the ascrap command by calling scrap instead of _scrap which misses
a lot of the functionality.
-------
fix: scraping local sites
Fix the scraping of local sites by removing the check for dot in parsed
network location.
-------
fix: monitor support on pseudo terminals
Fix monitor support on pseudo terminals by setting the default for
enable_ui to the value of sys.stdin.isatty(), this ensures it works
under pytest if not specifically set.
-------
fix: imports
Add missing and remove unused imports as well as eliminating the use of
wildcard imports.
-------
fix: dependencies
Add missing optional dependencies.
-------
chore: remove deprecated licence tags
Update setup.py and pyproject.toml to remove deprecated license tags,
bumping the dependency to support the new method. The eliminates toml lint
warning in pyproject.toml.1 parent 712b033 commit e09474f
File tree
48 files changed
+1360
-784
lines changed- crawl4ai
- browser
- components
- crawlers
- amazon_product
- google_search
- deep_crawling
- legacy
- deploy/docker
- docs
- examples
- md_v2/core
- releases_review
- tests
- 20241401
- async
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
48 files changed
+1360
-784
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
| |||
257 | 258 | | |
258 | 259 | | |
259 | 260 | | |
260 | | - | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| 68 | + | |
67 | 69 | | |
68 | 70 | | |
69 | 71 | | |
| |||
121 | 123 | | |
122 | 124 | | |
123 | 125 | | |
| 126 | + | |
| 127 | + | |
124 | 128 | | |
125 | 129 | | |
126 | 130 | | |
| |||
0 commit comments