Skip to content

[Bug]: LLMExtractionStrategy ratelimit results in no attribute usage #989

@stevenh

Description

@stevenh

crawl4ai version

0.5.0.post8

Expected Behavior

If rate limit is hit the user should be informed

Current Behavior

When the rate limit exceeds the retries perform_completion_with_backoff returns a list which is not handled by LLMExtractionStrategy.extract resulting in it trying to access usage data field which doesn't exist which results in the extracted_content being:

[
    {
        "index": 0,
        "error": true,
        "tags": [
            "error"
        ],
        "content": "\'list\' object has no attribute \'usage\'"
    }
]

Is this reproducible?

Yes

Inputs Causing the Bug

Any request which results in a ratelimit for more than 2 retries.

Steps to Reproduce

Perform an crawl using LLMExtractionStrategy.

Code snippets

"""Test LLM extraction strategy for job postings."""

import json
import logging
import os
import sys
from typing import TYPE_CHECKING, Any

from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.async_configs import LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import pytest

if TYPE_CHECKING:
    from crawl4ai.models import CrawlResult

_LOGGER: logging.Logger = logging.getLogger(__name__)


class JobRequirement(BaseModel):
    """Schema for job requirements."""

    category: str = Field(
        description="Category of the requirement (e.g., Technical, Soft Skills)",
    )
    items: list[str] = Field(
        description="List of specific requirements in this category",
    )
    priority: str = Field(
        description="Priority level (Required/Preferred) based on the HTML class or context",
    )


class JobPosting(BaseModel):
    """Schema for job postings."""

    title: str = Field(description="Job title")
    department: str = Field(description="Department or team")
    location: str = Field(description="Job location, including remote options")
    salary_range: str | None = Field(description="Salary range if specified")
    requirements: list[JobRequirement] = Field(
        description="Categorized job requirements",
    )
    application_deadline: str | None = Field(
        description="Application deadline if specified",
    )
    contact_info: dict | None = Field(
        description="Contact information from footer or contact section",
    )


@pytest.mark.asyncio
async def test_llm_extraction() -> None:
    """Crawl job postings and extract details."""
    api_key: str | None = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        msg: str = "OPENAI_API_KEY environment variable not set"
        raise ValueError(msg)

    browser_config: BrowserConfig = BrowserConfig(
        verbose=False,
        extra_args=[
            "--disable-gpu",
            "--disable-dev-shm-usage",
            "--no-sandbox",
        ],
    )

    extraction_strategy: LLMExtractionStrategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o",
            api_token=api_key,
        ),
        schema=JobPosting.model_json_schema(),
        extraction_type="schema",
        instruction="""
        Extract job posting details, using HTML structure to:
        1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
        2. Extract contact info from the page footer or dedicated contact section
        3. Parse salary information from specially formatted elements
        4. Determine application deadline from timestamp or date elements

        Use HTML attributes and classes to enhance extraction accuracy.
        """,
        input_format="html",
        # chunk_token_threshold=chunk_token_threshold,
    )

    config: CrawlerRunConfig = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=True,
        extraction_strategy=extraction_strategy,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result: CrawlResult
        async for result in await crawler.arun_many(
            urls=[
                "https://www.rocketscience.gg/careers/c77fbdec-fce6-44f1-a05e-8cd76325a1a0/",
            ],
            config=config,
        ):
            assert result.success
            assert result.extracted_content
            extracted_content: list[dict[str, Any]] = json.loads(result.extracted_content)
            assert len(extracted_content) == 1


if __name__ == "__main__":
    import subprocess

    sys.exit(subprocess.call(["pytest", *sys.argv[1:], sys.argv[0]]))  # noqa: S603, S607

OS

macOS

Python version

3.12.9

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: xxx
configfile: pyproject.toml
plugins: anyio-4.9.0, logfire-3.12.0, pytest_httpserver-1.1.3, asyncio-0.26.0, mock-3.14.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 1 item

tests/test_extraction.py [FETCH]... ↓ http://localhost:51368/engineering-manager... | Status: True | Time: 0.74s
[SCRAPE].. ◆ http://localhost:51368/engineering-manager... | Time: 0.096s

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 2 seconds before retrying...

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 4 seconds before retrying...

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
[EXTRACT]. ■ Completed for http://localhost:51368/engineering-manager... | Time: 14.378983207978308s
[COMPLETE] ● http://localhost:51368/engineering-manager... | Status: True | Total: 15.22s

Metadata

Metadata

Labels

⚙ DoneBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions