Skip to content

[Bug]: html2text ignores <base> tag when resolving relative links #1680

@Joorrit

Description

@Joorrit

crawl4ai version

0.7.4

Expected Behavior

When an HTML document contains a <base> tag, relative links should be resolved against the URL specified in the <base> tag's href attribute, as per HTML standards.

Current Behavior

Relative links are resolved against the base_url passed to generate_markdown (usually the page URL), ignoring the <base> tag present in the HTML content. This leads to incorrect URLs when the <base> tag modifies the base path for relative links.

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

1. Create an HTML string with a `<base>` tag pointing to a different directory/root than the page URL.
2. Add a relative link in the HTML.
3. Use `DefaultMarkdownGenerator` to convert the HTML to Markdown, passing the original page URL as `base_url`.
4. Observe that the link in the Markdown is incorrect (resolved against page URL instead of base tag URL).

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        # Use css_selector="#main" to simulate extracting a specific element which might strip <head>
        result = await crawler.arun(
            url="https://www.philippsburg.de/index.php/oeffentliche-bekanntmachungen.html",
            css_selector="#main" # also fails if this is not present
        )
        
        correct_url = "https://www.philippsburg.de/files/philippsburg/Oeffentliche%20Bekanntgaben/2025/Neufassung%20Hundesteuersatzung.pdf"
        incorrect_url = "https://www.philippsburg.de/index.php/files/philippsburg/Oeffentliche%20Bekanntgaben/2025/Neufassung%20Hundesteuersatzung.pdf"

        found_correct = correct_url in result.markdown
        found_incorrect = incorrect_url in result.markdown
        
        if found_correct:
            print("SUCCESS: Found correct URL.")
            
        if found_incorrect:
            print("ERROR: Found incorrect URL (relative link not resolved correctly).")

if __name__ == "__main__":
    asyncio.run(main())

OS

Windows

Python version

3.14.0

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions