Skip to content

Conversation

@Akeemkabiru
Copy link

@Akeemkabiru Akeemkabiru commented Sep 17, 2025

I introduces Firecrawl as an optional backend for crawl4ai.

Updates

  • Added FirecrawlBackend wrapper around Firecrawl’s SDK.
  • Extended CLI with --backend option (default | firecrawl).
  • Enabled output in multiple formats (json, markdown, markdown-fit).
  • Added a standalone script (firecrawl_backend.py) showing how to use Firecrawl programmatically.
  • Updated README with installation and usage instructions for Firecrawl.

Example usage : python -m crawl4ai.cli crawl https://docs.firecrawl.dev --backend firecrawl --output markdown

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 17, 2025

Walkthrough

Adds a Firecrawl backend across docs and CLI: README documents Firecrawl and pre-release flows; CLI gains a --backend option with Firecrawl-specific result handling; introduces a FirecrawlBackend wrapper; and adds a demo script. The CLI and default entry points propagate the backend parameter. API keys are hard-coded in examples.

Changes

Cohort / File(s) Summary
Documentation updates
README.md
Documents Firecrawl backend and usage; expands release notes; adds pre-release install and diagnostics; updates mission/roadmap; introduces “Text Attribution”; formatting and sample edits.
CLI integration
crawl4ai/cli.py
Adds --backend selector (default, firecrawl); updates default and crawl_cmd signatures to accept and forward backend; integrates FirecrawlBackend usage path with extractor and output modes (json/markdown/markdown-fit); early-return flow for Firecrawl; imports backend.
Backend implementation
crawl4ai/firecrawl_backend.py
New FirecrawlBackend wrapper exposing crawl, scrape, and search, delegating to the Firecrawl client.
Demo script
firecrawl_demo.py
Example script instantiating FirecrawlBackend with an API key and scraping a sample URL, printing results.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant C as CLI (crawl4ai/cli.py)
  participant S as Backend Selector
  participant F as FirecrawlBackend
  participant A as Firecrawl API
  participant O as Output Handler

  U->>C: crawl --url ... --backend firecrawl --output ...
  C->>S: Select backend
  alt backend = firecrawl
    S->>F: init(api_key)
    C->>F: crawl(url)
    F->>A: crawl(url, limit)
    A-->>F: documents
    F-->>C: documents
    C->>O: format (json/markdown/md-fit)
    O-->>U: print/return
  else backend = default
    C->>C: run standard crawling flow
    C->>O: format and output
    O-->>U: print/return
  end
  note over C,O: Firecrawl path may early-return after output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I twitch my whiskers—new trails in sight,
Firecrawl footprints by moonlit byte,
A toggle flips, two paths now run,
JSON crumbs, markdown sun.
I nibble docs with careful cheer—
Hop, crawl, print—our warren’s clear! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description contains a short "Description" and usage example but does not follow the repository's required template. It is missing the required "Summary" section (with linked issue numbers if applicable), the "List of files changed and why", the "How Has This Been Tested?" section detailing tests and results, and the Checklist with completed checkboxes. Because these sections are required by the template the description is incomplete and should be updated before merge. Please update the PR description to match the template by adding a concise Summary (linking any related issues), a file-by-file list explaining why each change was made, a "How Has This Been Tested?" section that lists commands and test results or logs, and by completing the Checklist items or explaining why any are not applicable. Include any relevant test output or CI results and mention whether unit or integration tests were added or updated. After these additions the description will be sufficient for a follow-up review.
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title directly and accurately summarizes the primary change—adding Firecrawl backend support to the crawler/CLI. It is concise and focused on the main change rather than listing files or unrelated details. The use of the "feat/add" prefix with a slash is a minor stylistic quirk but does not make the title misleading or unclear.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

839-856: Attribution language conflicts with Apache-2.0

Apache-2.0 does not require attribution badges. “Must include” is inconsistent with “attribution is recommended.” Reword to “recommended/optional.”

-This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](...)
+This project is licensed under the Apache License 2.0. Attribution is appreciated (see badges below). See the [Apache 2.0 License](...)
@@
-When using Crawl4AI, you must include one of the following attribution methods:
+When using Crawl4AI, you may include one of the following attribution methods (optional, appreciated):
🧹 Nitpick comments (9)
README.md (3)

548-559: Document API key setup and fix code fence

Add how to supply the Firecrawl API key and use a triple‑backtick fence. Also keep CLI style consistent.

-### Firecrawl Backend Support
-
-A new backend has been added to allow crawling and scraping via [Firecrawl](https://firecrawl.dev).
-
-#### CLI Usage
-
-You can now select the Firecrawl backend with the `--backend firecrawl` option:
-
-````bash
-crwl crawl https://docs.firecrawl.dev --backend firecrawl --output markdown
-````
+### Firecrawl Backend Support
+
+A new backend allows crawling/scraping via [Firecrawl](https://firecrawl.dev).
+
+Before using it, set your API key (or configure via `crwl config`):
+
+```bash
+export FIRECRAWL_API_KEY="your-key"
+```
+
+#### CLI Usage
+
+Select the Firecrawl backend with:
+
+```bash
+crwl crawl https://docs.firecrawl.dev --backend firecrawl --output markdown
+```

834-838: Remove stray authoring text from README

These lines look like internal/editor notes and should not be in the README.

-I'll help modify the license section with badges. For the halftone effect, here's a version with it:
-
-Here's the updated license section:

932-936: Add language to fenced code block (mdlint MD040)

Specify a language to satisfy linters and improve rendering.

-```
+```text
 UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
 GitHub. https://github.com/unclecode/crawl4ai

</blockquote></details>
<details>
<summary>crawl4ai/firecrawl_backend.py (3)</summary><blockquote>

`1-1`: **Guard optional dependency import with a clear error**

Provide a helpful error if the `firecrawl` package isn’t installed.



```diff
-from firecrawl import Firecrawl
+try:
+    from firecrawl import Firecrawl
+except Exception as e:
+    raise ImportError(
+        "Missing optional dependency 'firecrawl'. Install with: pip install firecrawl"
+    ) from e

4-6: Allow env fallback for API key and validate

Small ergonomics improvement and safer defaults.

-class FirecrawlBackend:
-    def __init__(self, api_key: str):
-        self.client = Firecrawl(api_key=api_key)
+class FirecrawlBackend:
+    def __init__(self, api_key: str | None = None):
+        import os
+        api_key = api_key or os.getenv("FIRECRAWL_API_KEY")
+        if not api_key:
+            raise ValueError("Firecrawl API key is required. Set FIRECRAWL_API_KEY or pass api_key=")
+        self.client = Firecrawl(api_key=api_key)

10-12: Expose formats as a parameter (preserve current default)

Makes the wrapper flexible without changing behavior.

-    def scrape(self, url: str):
-        return self.client.scrape(url=url, formats=["markdown", "html"])
+    def scrape(self, url: str, formats: list[str] | None = None):
+        return self.client.scrape(url=url, formats=formats or ["markdown", "html"])
crawl4ai/cli.py (3)

1006-1023: Optional: Add a flag to pass the Firecrawl API key via CLI (env still preferred)

Keeps secrets out of code and supports CI usage.

 @cli.command("crawl")
 @click.argument("url", required=True)
 @click.option(
     "--backend",
     type=click.Choice(["default", "firecrawl"]),
     default="default",
     help="Choose crawling backend"
 )
+@click.option(
+    "--firecrawl-api-key",
+    envvar="FIRECRAWL_API_KEY",
+    help="Firecrawl API key (or set FIRECRAWL_API_KEY env var)"
+)
@@
-def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, 
-           extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
-           output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int,  backend: str,):
+def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, 
+           extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
+           output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str,
+           deep_crawl: str, max_pages: int, backend: str, firecrawl_api_key: Optional[str] = None):

And inside the Firecrawl block above, prefer api_key = firecrawl_api_key or os.getenv("FIRECRAWL_API_KEY").


1261-1267: Firecrawl path doesn’t support -q/--question yet

If intended, document that Q&A is only supported for the default backend; otherwise, wire the Firecrawl result’s markdown into stream_llm_response.


1211-1211: Non-Firecrawl path forces LXML scraping; confirm compatibility

Setting crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy() unconditionally may undo user defaults/config. Consider honoring config unless explicitly overridden.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and eb6892c.

📒 Files selected for processing (4)
  • README.md (28 hunks)
  • crawl4ai/cli.py (5 hunks)
  • crawl4ai/firecrawl_backend.py (1 hunks)
  • firecrawl_demo.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
firecrawl_demo.py (1)
crawl4ai/firecrawl_backend.py (2)
  • FirecrawlBackend (3-14)
  • scrape (10-11)
crawl4ai/cli.py (1)
crawl4ai/firecrawl_backend.py (2)
  • FirecrawlBackend (3-14)
  • crawl (7-8)
🪛 Gitleaks (8.27.2)
firecrawl_demo.py

[high] 5-5: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

crawl4ai/cli.py

[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 markdownlint-cli2 (0.17.2)
README.md

933-933: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)
crawl4ai/cli.py (2)

1008-1012: Good: backend selector added to CLI

The --backend {default,firecrawl} option is clear and scoped.

Consider also adding this option to the default entrypoint (the command with name="") if you intend to support crwl <url> --backend firecrawl. Otherwise, only crwl crawl ... --backend firecrawl will work (which is fine if intentional).


1510-1511: Good: backend propagated from default entrypoint

Forwarding the backend keeps behavior consistent when invoking via the default route.

If you want crwl <url> --backend firecrawl to work, add the --backend option to the default command too (currently only the crawl subcommand defines it).

Comment on lines +1038 to +1078
# Firecrawl
if backend == "firecrawl":
firecrawl_client = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
docs = firecrawl_client.crawl(url)

if not docs:
click.echo("No documents returned by Firecrawl")
return

def extract_item(item):
if isinstance(item, dict):
return item.get("title", ""), item.get("content", "")
elif isinstance(item, (tuple, list)) and len(item) == 2:
return item[0], item[1]
else:
return "", str(item)

if output in ["all", "json"]:
import json
click.echo(json.dumps(docs, indent=2))

elif output in ["markdown", "md"]:
markdown_text = ""
for item in docs:
title, content = extract_item(item)
markdown_text += f"# {title}\n\n{content}\n\n"
click.echo(markdown_text)

elif output in ["markdown-fit", "md-fit"]:
markdown_text = ""
max_chars = 2000
for item in docs:
title, content = extract_item(item)
combined = f"# {title}\n\n{content}\n\n"
if len(markdown_text) + len(combined) > max_chars:
break
markdown_text += combined
click.echo(markdown_text)

return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Firecrawl block: hard-coded API key, NameError risk, output flow bypass, and missing limit propagation

  • Hard-coded secret.
  • docs = firecrawl_client.crawl(url) is unindented (executes even when backend != firecrawl) → NameError.
  • Firecrawl path ignores output_file and doesn’t return for most formats, so the default crawler still runs.
  • Doesn’t pass max_pages to Firecrawl limit.
- # Firecrawl
-    if  backend == "firecrawl":
-       firecrawl_client = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
-    docs = firecrawl_client.crawl(url)
-
-    if not docs:
-        click.echo("No documents returned by Firecrawl")
-        return
+    # Firecrawl
+    if backend == "firecrawl":
+        import os
+        api_key = os.getenv("FIRECRAWL_API_KEY")
+        if not api_key:
+            raise click.ClickException(
+                "Missing FIRECRAWL_API_KEY. Set the env var or run: crwl config set FIRECRAWL_API_KEY <key>"
+            )
+        firecrawl_client = FirecrawlBackend(api_key=api_key)
+        # Propagate max_pages as crawl limit when provided
+        docs = firecrawl_client.crawl(url, limit=max_pages or 10)
+        if not docs:
+            click.echo("No documents returned by Firecrawl")
+            return
+
+        # Normalize return shape to a list of (title, content)
         def extract_item(item):
             if isinstance(item, dict):
-            return item.get("title", ""), item.get("content", "")
+                title = item.get("title") or item.get("metadata", {}).get("title") or ""
+                content = item.get("content") or item.get("markdown") or item.get("html") or ""
+                return title, content
             elif isinstance(item, (tuple, list)) and len(item) == 2:
                 return item[0], item[1]
             else:
                 return "", str(item)
 
-    if output in ["all", "json"]:
-        import json
-        click.echo(json.dumps(docs, indent=2))
+        # If Firecrawl returns an object with 'data', unwrap it
+        items = docs.get("data", docs) if isinstance(docs, dict) else docs
+
+        if output in ["all", "json"]:
+            payload = docs if output == "all" else items
+            text = json.dumps(payload, indent=2)
+            if output_file:
+                Path(output_file).write_text(text, encoding="utf-8")
+            else:
+                click.echo(text)
+            return
 
-    elif output in ["markdown", "md"]:
-        markdown_text = ""
-        for item in docs:
-            title, content = extract_item(item)
-            markdown_text += f"# {title}\n\n{content}\n\n"
-        click.echo(markdown_text)
+        elif output in ["markdown", "md"]:
+            markdown_text = ""
+            for item in items:
+                title, content = extract_item(item)
+                if title:
+                    markdown_text += f"# {title}\n\n{content}\n\n"
+                else:
+                    markdown_text += f"{content}\n\n"
+            if output_file:
+                Path(output_file).write_text(markdown_text, encoding="utf-8")
+            else:
+                click.echo(markdown_text)
+            return
 
-    elif output in ["markdown-fit", "md-fit"]:
-        markdown_text = ""
-        max_chars = 2000
-        for item in docs:
-            title, content = extract_item(item)
-            combined = f"# {title}\n\n{content}\n\n"
-            if len(markdown_text) + len(combined) > max_chars:
-                break
-            markdown_text += combined
-        click.echo(markdown_text)
-
-        return
+        elif output in ["markdown-fit", "md-fit"]:
+            markdown_text = ""
+            max_chars = 2000
+            for item in items:
+                title, content = extract_item(item)
+                combined = (f"# {title}\n\n{content}\n\n" if title else f"{content}\n\n")
+                if len(markdown_text) + len(combined) > max_chars:
+                    break
+                markdown_text += combined
+            if output_file:
+                Path(output_file).write_text(markdown_text, encoding="utf-8")
+            else:
+                click.echo(markdown_text)
+            return

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Gitleaks (8.27.2)

[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
In crawl4ai/cli.py around lines 1038-1078, the Firecrawl block currently
hard-codes an API key, instantiates/uses firecrawl_client outside the backend
check (risking NameError), doesn't pass max_pages to the crawl limit, and fails
to respect output_file or return after handling outputs so execution falls
through to the default crawler; fix by removing the hard-coded secret and read
the API key from configuration or env var, ensure firecrawl_client is created
and docs = firecrawl_client.crawl(url, limit=max_pages) is executed only inside
the if backend == "firecrawl" block (proper indentation), after producing output
handle writing to output_file when provided (open+write) and always return after
processing any output format to prevent the rest of the function from running.

Comment on lines +5 to +7
backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
docs = backend.scrape("https://docs.firecrawl.dev")
print(docs)
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Hard-coded API key committed (secret leak) — remove and rotate immediately

Real-looking key present. Replace with env/config; revoke/rotate the exposed key now.

-from crawl4ai.firecrawl_backend import FirecrawlBackend
+import os
+from crawl4ai.firecrawl_backend import FirecrawlBackend
@@
 def main():
-    backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
+    api_key = os.getenv("FIRECRAWL_API_KEY")
+    if not api_key:
+        raise SystemExit("FIRECRAWL_API_KEY is not set")
+    backend = FirecrawlBackend(api_key=api_key)
     docs = backend.scrape("https://docs.firecrawl.dev")
     print(docs)

Action items:

  • Revoke/rotate the leaked key at the Firecrawl provider.
  • Force-push a commit removing the key from git history if necessary (or use GitHub secret scanning remediation).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
docs = backend.scrape("https://docs.firecrawl.dev")
print(docs)
import os
from crawl4ai.firecrawl_backend import FirecrawlBackend
def main():
api_key = os.getenv("FIRECRAWL_API_KEY")
if not api_key:
raise SystemExit("FIRECRAWL_API_KEY is not set")
backend = FirecrawlBackend(api_key=api_key)
docs = backend.scrape("https://docs.firecrawl.dev")
print(docs)
🧰 Tools
🪛 Gitleaks (8.27.2)

[high] 5-5: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
In firecrawl_demo.py around lines 5 to 7, a real API key is hard-coded which
leaks secrets; remove the literal key, rotate/revoke it with the Firecrawl
provider immediately, and replace usage with a secure retrieval (e.g., read from
an environment variable or config secret manager); update code to read
os.environ["FIRECRAWL_API_KEY"] (or equivalent) and fail fast with a clear error
if missing, then purge the key from git history or force-push a commit after
rotating the key to ensure it is not stored in the repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes @Akeemkabiru

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants