-
-
Couldn't load subscription status.
- Fork 5.5k
Fix: Wrong URL variable used for extraction of raw html #1447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe extraction call in crawl4ai/async_webcrawler.py’s aprocess_html now passes _url instead of url to extraction_strategy.run, ensuring raw HTML inputs use a placeholder ("Raw HTML") while normal URL flows remain unchanged. No other logic, signatures, or control flow were modified. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Crawler as AsyncWebCrawler
participant Strategy as ExtractionStrategy
Caller->>Crawler: aprocess_html(input, url, is_raw_html)
alt is_raw_html
note right of Crawler: Set _url = "Raw HTML"
else not raw
note right of Crawler: Set _url = url
end
Crawler->>Strategy: run(preferred_content, _url)
Strategy-->>Crawler: extraction result
Crawler-->>Caller: result
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
crawl4ai/async_webcrawler.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
crawl4ai/async_webcrawler.py (1)
crawl4ai/extraction_strategy.py (5)
run(79-95)run(109-113)run(461-473)run(726-781)run(1047-1066)
🔇 Additional comments (1)
crawl4ai/async_webcrawler.py (1)
618-618: Fix is correct: pass _url to extraction strategy for raw HTML.This prevents leaking the full raw HTML via the url param into extraction strategies while preserving the normal URL path. Matches the PR objective.
c2bcfb5 to
9c3cacd
Compare
- Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html
9c3cacd to
edd0b57
Compare
|
This is a very important PR. It enables to send raw html for extraction. Please merge. |
Summary
Fixes #1116 , #1178- Fixed incorrect URL variable used for extraction when processing raw HTML content. The extraction strategy was receiving the full HTML content as the URL parameter instead of the properly formatted URL identifier.
List of files changed and why
rawl4ai/async_webcrawler.py - Fixed line 610 to use _url instead of url in the extraction strategy call. The _url variable correctly handles raw HTML cases by displaying "Raw HTML" instead of passing the entire HTML content as the URL parameter.
How Has This Been Tested?
Local Testing:
Checklist:
Summary by CodeRabbit