-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Closed
Labels
⚙ DoneBug fix, enhancement, FR that's completed pending releaseBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug
Milestone
Description
crawl4ai version
0.5.0.post8
Expected Behavior
If rate limit is hit the user should be informed
Current Behavior
When the rate limit exceeds the retries perform_completion_with_backoff returns a list which is not handled by LLMExtractionStrategy.extract resulting in it trying to access usage data field which doesn't exist which results in the extracted_content being:
[
{
"index": 0,
"error": true,
"tags": [
"error"
],
"content": "\'list\' object has no attribute \'usage\'"
}
]Is this reproducible?
Yes
Inputs Causing the Bug
Any request which results in a ratelimit for more than 2 retries.Steps to Reproduce
Perform an crawl using LLMExtractionStrategy.Code snippets
"""Test LLM extraction strategy for job postings."""
import json
import logging
import os
import sys
from typing import TYPE_CHECKING, Any
from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.async_configs import LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import pytest
if TYPE_CHECKING:
from crawl4ai.models import CrawlResult
_LOGGER: logging.Logger = logging.getLogger(__name__)
class JobRequirement(BaseModel):
"""Schema for job requirements."""
category: str = Field(
description="Category of the requirement (e.g., Technical, Soft Skills)",
)
items: list[str] = Field(
description="List of specific requirements in this category",
)
priority: str = Field(
description="Priority level (Required/Preferred) based on the HTML class or context",
)
class JobPosting(BaseModel):
"""Schema for job postings."""
title: str = Field(description="Job title")
department: str = Field(description="Department or team")
location: str = Field(description="Job location, including remote options")
salary_range: str | None = Field(description="Salary range if specified")
requirements: list[JobRequirement] = Field(
description="Categorized job requirements",
)
application_deadline: str | None = Field(
description="Application deadline if specified",
)
contact_info: dict | None = Field(
description="Contact information from footer or contact section",
)
@pytest.mark.asyncio
async def test_llm_extraction() -> None:
"""Crawl job postings and extract details."""
api_key: str | None = os.environ.get("OPENAI_API_KEY")
if not api_key:
msg: str = "OPENAI_API_KEY environment variable not set"
raise ValueError(msg)
browser_config: BrowserConfig = BrowserConfig(
verbose=False,
extra_args=[
"--disable-gpu",
"--disable-dev-shm-usage",
"--no-sandbox",
],
)
extraction_strategy: LLMExtractionStrategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o",
api_token=api_key,
),
schema=JobPosting.model_json_schema(),
extraction_type="schema",
instruction="""
Extract job posting details, using HTML structure to:
1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
2. Extract contact info from the page footer or dedicated contact section
3. Parse salary information from specially formatted elements
4. Determine application deadline from timestamp or date elements
Use HTML attributes and classes to enhance extraction accuracy.
""",
input_format="html",
# chunk_token_threshold=chunk_token_threshold,
)
config: CrawlerRunConfig = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True,
extraction_strategy=extraction_strategy,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result: CrawlResult
async for result in await crawler.arun_many(
urls=[
"https://www.rocketscience.gg/careers/c77fbdec-fce6-44f1-a05e-8cd76325a1a0/",
],
config=config,
):
assert result.success
assert result.extracted_content
extracted_content: list[dict[str, Any]] = json.loads(result.extracted_content)
assert len(extracted_content) == 1
if __name__ == "__main__":
import subprocess
sys.exit(subprocess.call(["pytest", *sys.argv[1:], sys.argv[0]])) # noqa: S603, S607OS
macOS
Python version
3.12.9
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: xxx
configfile: pyproject.toml
plugins: anyio-4.9.0, logfire-3.12.0, pytest_httpserver-1.1.3, asyncio-0.26.0, mock-3.14.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 1 item
tests/test_extraction.py [FETCH]... ↓ http://localhost:51368/engineering-manager... | Status: True | Time: 0.74s
[SCRAPE].. ◆ http://localhost:51368/engineering-manager... | Time: 0.096s
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 2 seconds before retrying...
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 4 seconds before retrying...
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
[EXTRACT]. ■ Completed for http://localhost:51368/engineering-manager... | Time: 14.378983207978308s
[COMPLETE] ● http://localhost:51368/engineering-manager... | Status: True | Total: 15.22s
ColeFrench and lance6716
Metadata
Metadata
Assignees
Labels
⚙ DoneBug fix, enhancement, FR that's completed pending releaseBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug