feat/add Firecrawl backend support to crawler #1497

Akeemkabiru · 2025-09-17T16:51:23Z

I introduces Firecrawl as an optional backend for crawl4ai.

Updates

Added FirecrawlBackend wrapper around Firecrawl’s SDK.
Extended CLI with --backend option (default | firecrawl).
Enabled output in multiple formats (json, markdown, markdown-fit).
Added a standalone script (firecrawl_backend.py) showing how to use Firecrawl programmatically.
Updated README with installation and usage instructions for Firecrawl.

Example usage : python -m crawl4ai.cli crawl https://docs.firecrawl.dev --backend firecrawl --output markdown

coderabbitai · 2025-09-17T16:51:31Z

Walkthrough

Adds a Firecrawl backend across docs and CLI: README documents Firecrawl and pre-release flows; CLI gains a --backend option with Firecrawl-specific result handling; introduces a FirecrawlBackend wrapper; and adds a demo script. The CLI and default entry points propagate the backend parameter. API keys are hard-coded in examples.

Changes

Cohort / File(s)	Summary
Documentation updates `README.md`	Documents Firecrawl backend and usage; expands release notes; adds pre-release install and diagnostics; updates mission/roadmap; introduces “Text Attribution”; formatting and sample edits.
CLI integration `crawl4ai/cli.py`	Adds `--backend` selector (default, firecrawl); updates `default` and `crawl_cmd` signatures to accept and forward `backend`; integrates `FirecrawlBackend` usage path with extractor and output modes (json/markdown/markdown-fit); early-return flow for Firecrawl; imports backend.
Backend implementation `crawl4ai/firecrawl_backend.py`	New `FirecrawlBackend` wrapper exposing `crawl`, `scrape`, and `search`, delegating to the Firecrawl client.
Demo script `firecrawl_demo.py`	Example script instantiating `FirecrawlBackend` with an API key and scraping a sample URL, printing results.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant C as CLI (crawl4ai/cli.py)
  participant S as Backend Selector
  participant F as FirecrawlBackend
  participant A as Firecrawl API
  participant O as Output Handler

  U->>C: crawl --url ... --backend firecrawl --output ...
  C->>S: Select backend
  alt backend = firecrawl
    S->>F: init(api_key)
    C->>F: crawl(url)
    F->>A: crawl(url, limit)
    A-->>F: documents
    F-->>C: documents
    C->>O: format (json/markdown/md-fit)
    O-->>U: print/return
  else backend = default
    C->>C: run standard crawling flow
    C->>O: format and output
    O-->>U: print/return
  end
  note over C,O: Firecrawl path may early-return after output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I twitch my whiskers—new trails in sight,
Firecrawl footprints by moonlit byte,
A toggle flips, two paths now run,
JSON crumbs, markdown sun.
I nibble docs with careful cheer—
Hop, crawl, print—our warren’s clear! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description contains a short "Description" and usage example but does not follow the repository's required template. It is missing the required "Summary" section (with linked issue numbers if applicable), the "List of files changed and why", the "How Has This Been Tested?" section detailing tests and results, and the Checklist with completed checkboxes. Because these sections are required by the template the description is incomplete and should be updated before merge.	Please update the PR description to match the template by adding a concise Summary (linking any related issues), a file-by-file list explaining why each change was made, a "How Has This Been Tested?" section that lists commands and test results or logs, and by completing the Checklist items or explaining why any are not applicable. Include any relevant test output or CI results and mention whether unit or integration tests were added or updated. After these additions the description will be sufficient for a follow-up review.
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title directly and accurately summarizes the primary change—adding Firecrawl backend support to the crawler/CLI. It is concise and focused on the main change rather than listing files or unrelated details. The use of the "feat/add" prefix with a slash is a minor stylistic quirk but does not make the title misleading or unclear.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)
839-856: Attribution language conflicts with Apache-2.0

Apache-2.0 does not require attribution badges. “Must include” is inconsistent with “attribution is recommended.” Reword to “recommended/optional.”
-This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](...)
+This project is licensed under the Apache License 2.0. Attribution is appreciated (see badges below). See the [Apache 2.0 License](...)
@@
-When using Crawl4AI, you must include one of the following attribution methods:
+When using Crawl4AI, you may include one of the following attribution methods (optional, appreciated):

🧹 Nitpick comments (9)

README.md (3)

548-559: Document API key setup and fix code fence

Add how to supply the Firecrawl API key and use a triple‑backtick fence. Also keep CLI style consistent.

-### Firecrawl Backend Support
-
-A new backend has been added to allow crawling and scraping via [Firecrawl](https://firecrawl.dev).
-
-#### CLI Usage
-
-You can now select the Firecrawl backend with the `--backend firecrawl` option:
-
-````bash
-crwl crawl https://docs.firecrawl.dev --backend firecrawl --output markdown
-````
+### Firecrawl Backend Support
+
+A new backend allows crawling/scraping via [Firecrawl](https://firecrawl.dev).
+
+Before using it, set your API key (or configure via `crwl config`):
+
+```bash
+export FIRECRAWL_API_KEY="your-key"
+```
+
+#### CLI Usage
+
+Select the Firecrawl backend with:
+
+```bash
+crwl crawl https://docs.firecrawl.dev --backend firecrawl --output markdown
+```

834-838: Remove stray authoring text from README

These lines look like internal/editor notes and should not be in the README.

-I'll help modify the license section with badges. For the halftone effect, here's a version with it:
-
-Here's the updated license section:

932-936: Add language to fenced code block (mdlint MD040)

Specify a language to satisfy linters and improve rendering.

-```
+```text
 UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
 GitHub. https://github.com/unclecode/crawl4ai


</blockquote></details>
<details>
<summary>crawl4ai/firecrawl_backend.py (3)</summary><blockquote>

`1-1`: **Guard optional dependency import with a clear error**

Provide a helpful error if the `firecrawl` package isn’t installed.



```diff
-from firecrawl import Firecrawl
+try:
+    from firecrawl import Firecrawl
+except Exception as e:
+    raise ImportError(
+        "Missing optional dependency 'firecrawl'. Install with: pip install firecrawl"
+    ) from e

4-6: Allow env fallback for API key and validate

Small ergonomics improvement and safer defaults.

-class FirecrawlBackend:
-    def __init__(self, api_key: str):
-        self.client = Firecrawl(api_key=api_key)
+class FirecrawlBackend:
+    def __init__(self, api_key: str | None = None):
+        import os
+        api_key = api_key or os.getenv("FIRECRAWL_API_KEY")
+        if not api_key:
+            raise ValueError("Firecrawl API key is required. Set FIRECRAWL_API_KEY or pass api_key=")
+        self.client = Firecrawl(api_key=api_key)

10-12: Expose formats as a parameter (preserve current default)

Makes the wrapper flexible without changing behavior.

-    def scrape(self, url: str):
-        return self.client.scrape(url=url, formats=["markdown", "html"])
+    def scrape(self, url: str, formats: list[str] | None = None):
+        return self.client.scrape(url=url, formats=formats or ["markdown", "html"])

crawl4ai/cli.py (3)

1006-1023: Optional: Add a flag to pass the Firecrawl API key via CLI (env still preferred)

Keeps secrets out of code and supports CI usage.
 @cli.command("crawl")
 @click.argument("url", required=True)
 @click.option(
     "--backend",
     type=click.Choice(["default", "firecrawl"]),
     default="default",
     help="Choose crawling backend"
 )
+@click.option(
+    "--firecrawl-api-key",
+    envvar="FIRECRAWL_API_KEY",
+    help="Firecrawl API key (or set FIRECRAWL_API_KEY env var)"
+)
@@
-def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, 
-           extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
-           output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int,  backend: str,):
+def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, 
+           extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
+           output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str,
+           deep_crawl: str, max_pages: int, backend: str, firecrawl_api_key: Optional[str] = None):
And inside the Firecrawl block above, prefer api_key = firecrawl_api_key or os.getenv("FIRECRAWL_API_KEY").

1261-1267: Firecrawl path doesn’t support -q/--question yet

If intended, document that Q&A is only supported for the default backend; otherwise, wire the Firecrawl result’s markdown into stream_llm_response.

1211-1211: Non-Firecrawl path forces LXML scraping; confirm compatibility

Setting crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy() unconditionally may undo user defaults/config. Consider honoring config unless explicitly overridden.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and eb6892c.

📒 Files selected for processing (4)

README.md (28 hunks)
crawl4ai/cli.py (5 hunks)
crawl4ai/firecrawl_backend.py (1 hunks)
firecrawl_demo.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

firecrawl_demo.py (1)

crawl4ai/firecrawl_backend.py (2)

FirecrawlBackend (3-14)

scrape (10-11)

crawl4ai/cli.py (1)

crawl4ai/firecrawl_backend.py (2)

FirecrawlBackend (3-14)

crawl (7-8)

🪛 Gitleaks (8.27.2)

firecrawl_demo.py

[high] 5-5: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

crawl4ai/cli.py

[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 markdownlint-cli2 (0.17.2)

README.md

933-933: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)

crawl4ai/cli.py (2)

1008-1012: Good: backend selector added to CLI

The --backend {default,firecrawl} option is clear and scoped.

Consider also adding this option to the default entrypoint (the command with name="") if you intend to support crwl <url> --backend firecrawl. Otherwise, only crwl crawl ... --backend firecrawl will work (which is fine if intentional).

1510-1511: Good: backend propagated from default entrypoint

Forwarding the backend keeps behavior consistent when invoking via the default route.

If you want crwl <url> --backend firecrawl to work, add the --backend option to the default command too (currently only the crawl subcommand defines it).

coderabbitai · 2025-09-17T17:01:33Z

crawl4ai/cli.py

+ # Firecrawl
+    if  backend == "firecrawl":
+       firecrawl_client = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
+    docs = firecrawl_client.crawl(url)
+
+    if not docs:
+        click.echo("No documents returned by Firecrawl")
+        return
+
+    def extract_item(item):
+        if isinstance(item, dict):
+            return item.get("title", ""), item.get("content", "")
+        elif isinstance(item, (tuple, list)) and len(item) == 2:
+            return item[0], item[1]
+        else:
+            return "", str(item)
+
+    if output in ["all", "json"]:
+        import json
+        click.echo(json.dumps(docs, indent=2))
+
+    elif output in ["markdown", "md"]:
+        markdown_text = ""
+        for item in docs:
+            title, content = extract_item(item)
+            markdown_text += f"# {title}\n\n{content}\n\n"
+        click.echo(markdown_text)
+
+    elif output in ["markdown-fit", "md-fit"]:
+        markdown_text = ""
+        max_chars = 2000
+        for item in docs:
+            title, content = extract_item(item)
+            combined = f"# {title}\n\n{content}\n\n"
+            if len(markdown_text) + len(combined) > max_chars:
+                break
+            markdown_text += combined
+        click.echo(markdown_text)
+
+        return
+


⚠️ Potential issue

Firecrawl block: hard-coded API key, NameError risk, output flow bypass, and missing limit propagation

Hard-coded secret.

docs = firecrawl_client.crawl(url) is unindented (executes even when backend != firecrawl) → NameError.

Firecrawl path ignores output_file and doesn’t return for most formats, so the default crawler still runs.

Doesn’t pass max_pages to Firecrawl limit.

- # Firecrawl - if backend == "firecrawl": - firecrawl_client = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c") - docs = firecrawl_client.crawl(url) - - if not docs: - click.echo("No documents returned by Firecrawl") - return + # Firecrawl + if backend == "firecrawl": + import os + api_key = os.getenv("FIRECRAWL_API_KEY") + if not api_key: + raise click.ClickException( + "Missing FIRECRAWL_API_KEY. Set the env var or run: crwl config set FIRECRAWL_API_KEY <key>" + ) + firecrawl_client = FirecrawlBackend(api_key=api_key) + # Propagate max_pages as crawl limit when provided + docs = firecrawl_client.crawl(url, limit=max_pages or 10) + if not docs: + click.echo("No documents returned by Firecrawl") + return + + # Normalize return shape to a list of (title, content) def extract_item(item): if isinstance(item, dict): - return item.get("title", ""), item.get("content", "") + title = item.get("title") or item.get("metadata", {}).get("title") or "" + content = item.get("content") or item.get("markdown") or item.get("html") or "" + return title, content elif isinstance(item, (tuple, list)) and len(item) == 2: return item[0], item[1] else: return "", str(item) - if output in ["all", "json"]: - import json - click.echo(json.dumps(docs, indent=2)) + # If Firecrawl returns an object with 'data', unwrap it + items = docs.get("data", docs) if isinstance(docs, dict) else docs + + if output in ["all", "json"]: + payload = docs if output == "all" else items + text = json.dumps(payload, indent=2) + if output_file: + Path(output_file).write_text(text, encoding="utf-8") + else: + click.echo(text) + return - elif output in ["markdown", "md"]: - markdown_text = "" - for item in docs: - title, content = extract_item(item) - markdown_text += f"# {title}\n\n{content}\n\n" - click.echo(markdown_text) + elif output in ["markdown", "md"]: + markdown_text = "" + for item in items: + title, content = extract_item(item) + if title: + markdown_text += f"# {title}\n\n{content}\n\n" + else: + markdown_text += f"{content}\n\n" + if output_file: + Path(output_file).write_text(markdown_text, encoding="utf-8") + else: + click.echo(markdown_text) + return - elif output in ["markdown-fit", "md-fit"]: - markdown_text = "" - max_chars = 2000 - for item in docs: - title, content = extract_item(item) - combined = f"# {title}\n\n{content}\n\n" - if len(markdown_text) + len(combined) > max_chars: - break - markdown_text += combined - click.echo(markdown_text) - - return + elif output in ["markdown-fit", "md-fit"]: + markdown_text = "" + max_chars = 2000 + for item in items: + title, content = extract_item(item) + combined = (f"# {title}\n\n{content}\n\n" if title else f"{content}\n\n") + if len(markdown_text) + len(combined) > max_chars: + break + markdown_text += combined + if output_file: + Path(output_file).write_text(markdown_text, encoding="utf-8") + else: + click.echo(markdown_text) + return

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Gitleaks (8.27.2)

[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents

In crawl4ai/cli.py around lines 1038-1078, the Firecrawl block currently hard-codes an API key, instantiates/uses firecrawl_client outside the backend check (risking NameError), doesn't pass max_pages to the crawl limit, and fails to respect output_file or return after handling outputs so execution falls through to the default crawler; fix by removing the hard-coded secret and read the API key from configuration or env var, ensure firecrawl_client is created and docs = firecrawl_client.crawl(url, limit=max_pages) is executed only inside the if backend == "firecrawl" block (proper indentation), after producing output handle writing to output_file when provided (open+write) and always return after processing any output format to prevent the rest of the function from running.

coderabbitai · 2025-09-17T17:01:33Z

firecrawl_demo.py

+    backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
+    docs = backend.scrape("https://docs.firecrawl.dev")
+    print(docs)


⚠️ Potential issue

Hard-coded API key committed (secret leak) — remove and rotate immediately

Real-looking key present. Replace with env/config; revoke/rotate the exposed key now.

-from crawl4ai.firecrawl_backend import FirecrawlBackend +import os +from crawl4ai.firecrawl_backend import FirecrawlBackend @@ def main(): - backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c") + api_key = os.getenv("FIRECRAWL_API_KEY") + if not api_key: + raise SystemExit("FIRECRAWL_API_KEY is not set") + backend = FirecrawlBackend(api_key=api_key) docs = backend.scrape("https://docs.firecrawl.dev") print(docs)

Action items:

Revoke/rotate the leaked key at the Firecrawl provider.

Force-push a commit removing the key from git history if necessary (or use GitHub secret scanning remediation).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")

docs = backend.scrape("https://docs.firecrawl.dev")

print(docs)

import os

from crawl4ai.firecrawl_backend import FirecrawlBackend

def main():

api_key = os.getenv("FIRECRAWL_API_KEY")

if not api_key:

raise SystemExit("FIRECRAWL_API_KEY is not set")

backend = FirecrawlBackend(api_key=api_key)

docs = backend.scrape("https://docs.firecrawl.dev")

print(docs)

🧰 Tools

🪛 Gitleaks (8.27.2)

[high] 5-5: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents

In firecrawl_demo.py around lines 5 to 7, a real API key is hard-coded which leaks secrets; remove the literal key, rotate/revoke it with the Firecrawl provider immediately, and replace usage with a secure retrieval (e.g., read from an environment variable or config secret manager); update code to read os.environ["FIRECRAWL_API_KEY"] (or equivalent) and fail fast with a clear error if missing, then purge the key from git history or force-push a commit after rotating the key to ensure it is not stored in the repository.

yikes @Akeemkabiru

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Akeemkabiru added 2 commits September 17, 2025 17:35

feat/add Firecrawl backend support to crawler

6529115

chore: update readme

eb6892c

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

feat/add Firecrawl backend support to crawler #1497

feat/add Firecrawl backend support to crawler #1497

Uh oh!

Akeemkabiru commented Sep 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 17, 2025

Uh oh!

coderabbitai bot Sep 17, 2025 •

edited

Loading

Uh oh!

Sjoeborg Oct 3, 2025

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

aksheyd Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

feat/add Firecrawl backend support to crawler #1497

Are you sure you want to change the base?

feat/add Firecrawl backend support to crawler #1497

Uh oh!

Conversation

Akeemkabiru commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sjoeborg Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

aksheyd Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Akeemkabiru commented Sep 17, 2025 •

edited

Loading

coderabbitai bot commented Sep 17, 2025 •

edited

Loading

coderabbitai bot Sep 17, 2025 •

edited

Loading