Open
Description
crawl4ai version
0.6.3
Expected Behavior
Crawler should get table content given the LLM request
Current Behavior
Extracts high level Markdown but not the table content that is specified
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Despite a lot of attempts, changing the way to extract the content, including using schemas, I was not able to get the crawler to get the right result.
Code snippets
I have used this code:
async def get_charger_status_llm(charger_url: str) -> str:
import os
os.environ["OPENAI_API_KEY"] = settings.LLM_API_KEY
# 1. Define the LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
#schema=ChargerStatusDetails.model_json_schema(), # Or use model_json_schema()
#extraction_type="schema",
extraction_type="block",
instruction="Look for the OCPP Log contained in a table that has this HTML '<div class='tab ocpp-log relationship-tab' label='OCPP log'>' and extract the most recent (the one on the top) Status Notification COMMAND, which is under the OCPP Log section, and its details, including Status, errorCode, vendorErrorCode, and the overall station status.",
chunk_token_threshold=4096,
apply_chunking=True,
input_format="markdown", # or "html", "fit_markdown"
)
# 2. Build the crawler config
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS
)
# 3. Create a browser config if needed
browser_config = BrowserConfig(
headless=settings.HEADLESS_BROWSER,
verbose=True,
extra_args=[
"--disable-gpu",
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-setuid-sandbox",
"--disable-images",
"--disable-fonts",
],
storage_state=settings.COOKIES_FILE_PATH,
java_script_enabled=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# 4. Let's say we want to crawl a single page
result = await crawler.arun(
url=charger_url,
config=crawl_config
)
if result.success:
# 5. The extracted content is presumably JSON
data = json.loads(result.extracted_content)
logger.debug(f"Extracted items: {data}")
# 6. Show usage stats
llm_strategy.show_usage() # prints token usage
return data
else:
logger.error("Error:", result.error_message)
OS
macOS
Python version
3.11
Browser
No response
Browser version
No response