-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Open
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers
Description
crawl4ai version
0.7.4
Expected Behavior
When an HTML document contains a <base> tag, relative links should be resolved against the URL specified in the <base> tag's href attribute, as per HTML standards.
Current Behavior
Relative links are resolved against the base_url passed to generate_markdown (usually the page URL), ignoring the <base> tag present in the HTML content. This leads to incorrect URLs when the <base> tag modifies the base path for relative links.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Create an HTML string with a `<base>` tag pointing to a different directory/root than the page URL.
2. Add a relative link in the HTML.
3. Use `DefaultMarkdownGenerator` to convert the HTML to Markdown, passing the original page URL as `base_url`.
4. Observe that the link in the Markdown is incorrect (resolved against page URL instead of base tag URL).Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Use css_selector="#main" to simulate extracting a specific element which might strip <head>
result = await crawler.arun(
url="https://www.philippsburg.de/index.php/oeffentliche-bekanntmachungen.html",
css_selector="#main" # also fails if this is not present
)
correct_url = "https://www.philippsburg.de/files/philippsburg/Oeffentliche%20Bekanntgaben/2025/Neufassung%20Hundesteuersatzung.pdf"
incorrect_url = "https://www.philippsburg.de/index.php/files/philippsburg/Oeffentliche%20Bekanntgaben/2025/Neufassung%20Hundesteuersatzung.pdf"
found_correct = correct_url in result.markdown
found_incorrect = incorrect_url in result.markdown
if found_correct:
print("SUCCESS: Found correct URL.")
if found_incorrect:
print("ERROR: Found incorrect URL (relative link not resolved correctly).")
if __name__ == "__main__":
asyncio.run(main())OS
Windows
Python version
3.14.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Metadata
Metadata
Assignees
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers