Skip to content

[Bug]: Crawler wont extract table content #1278

Open
@tropxy

Description

@tropxy

crawl4ai version

0.6.3

Expected Behavior

Crawler should get table content given the LLM request

Current Behavior

Extracts high level Markdown but not the table content that is specified

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Despite a lot of attempts, changing the way to extract the content, including using schemas, I was not able to get the crawler to get the right result.

Code snippets

I have used this code:


async def get_charger_status_llm(charger_url: str) -> str:
    import os
    os.environ["OPENAI_API_KEY"] = settings.LLM_API_KEY
    # 1. Define the LLM extraction strategy
    llm_strategy = LLMExtractionStrategy(
        llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
        #schema=ChargerStatusDetails.model_json_schema(), # Or use model_json_schema()
        #extraction_type="schema",
        extraction_type="block",
        instruction="Look for the OCPP Log contained in a table  that has this HTML '<div class='tab ocpp-log relationship-tab' label='OCPP log'>' and extract the most recent (the one on the top) Status Notification COMMAND, which is under the OCPP Log section, and its details, including Status, errorCode, vendorErrorCode, and the overall station status.",
        chunk_token_threshold=4096,
        apply_chunking=True,
        input_format="markdown",   # or "html", "fit_markdown"
    )

    # 2. Build the crawler config
    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS
    )

    # 3. Create a browser config if needed
    browser_config = BrowserConfig(
        headless=settings.HEADLESS_BROWSER,
        verbose=True,
        extra_args=[
            "--disable-gpu",
            "--no-sandbox",
            "--disable-dev-shm-usage",
            "--disable-setuid-sandbox",
            "--disable-images",
            "--disable-fonts",
        ],
        storage_state=settings.COOKIES_FILE_PATH,
        java_script_enabled=True,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # 4. Let's say we want to crawl a single page
        result = await crawler.arun(
            url=charger_url,
            config=crawl_config
        )

        if result.success:
            # 5. The extracted content is presumably JSON
            data = json.loads(result.extracted_content)
            logger.debug(f"Extracted items: {data}")

            # 6. Show usage stats
            llm_strategy.show_usage()  # prints token usage
            return data
        else:
            logger.error("Error:", result.error_message)

OS

macOS

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

html_ampeco.html.txt

ampeco_ocpp_log_table.html.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions