-
-
Couldn't load subscription status.
- Fork 5.5k
feat/add Firecrawl backend support to crawler #1497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a Firecrawl backend across docs and CLI: README documents Firecrawl and pre-release flows; CLI gains a --backend option with Firecrawl-specific result handling; introduces a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant C as CLI (crawl4ai/cli.py)
participant S as Backend Selector
participant F as FirecrawlBackend
participant A as Firecrawl API
participant O as Output Handler
U->>C: crawl --url ... --backend firecrawl --output ...
C->>S: Select backend
alt backend = firecrawl
S->>F: init(api_key)
C->>F: crawl(url)
F->>A: crawl(url, limit)
A-->>F: documents
F-->>C: documents
C->>O: format (json/markdown/md-fit)
O-->>U: print/return
else backend = default
C->>C: run standard crawling flow
C->>O: format and output
O-->>U: print/return
end
note over C,O: Firecrawl path may early-return after output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
839-856: Attribution language conflicts with Apache-2.0Apache-2.0 does not require attribution badges. “Must include” is inconsistent with “attribution is recommended.” Reword to “recommended/optional.”
-This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](...) +This project is licensed under the Apache License 2.0. Attribution is appreciated (see badges below). See the [Apache 2.0 License](...) @@ -When using Crawl4AI, you must include one of the following attribution methods: +When using Crawl4AI, you may include one of the following attribution methods (optional, appreciated):
🧹 Nitpick comments (9)
README.md (3)
548-559: Document API key setup and fix code fenceAdd how to supply the Firecrawl API key and use a triple‑backtick fence. Also keep CLI style consistent.
-### Firecrawl Backend Support - -A new backend has been added to allow crawling and scraping via [Firecrawl](https://firecrawl.dev). - -#### CLI Usage - -You can now select the Firecrawl backend with the `--backend firecrawl` option: - -````bash -crwl crawl https://docs.firecrawl.dev --backend firecrawl --output markdown -```` +### Firecrawl Backend Support + +A new backend allows crawling/scraping via [Firecrawl](https://firecrawl.dev). + +Before using it, set your API key (or configure via `crwl config`): + +```bash +export FIRECRAWL_API_KEY="your-key" +``` + +#### CLI Usage + +Select the Firecrawl backend with: + +```bash +crwl crawl https://docs.firecrawl.dev --backend firecrawl --output markdown +```
834-838: Remove stray authoring text from READMEThese lines look like internal/editor notes and should not be in the README.
-I'll help modify the license section with badges. For the halftone effect, here's a version with it: - -Here's the updated license section:
932-936: Add language to fenced code block (mdlint MD040)Specify a language to satisfy linters and improve rendering.
-``` +```text UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software]. GitHub. https://github.com/unclecode/crawl4ai</blockquote></details> <details> <summary>crawl4ai/firecrawl_backend.py (3)</summary><blockquote> `1-1`: **Guard optional dependency import with a clear error** Provide a helpful error if the `firecrawl` package isn’t installed. ```diff -from firecrawl import Firecrawl +try: + from firecrawl import Firecrawl +except Exception as e: + raise ImportError( + "Missing optional dependency 'firecrawl'. Install with: pip install firecrawl" + ) from e
4-6: Allow env fallback for API key and validateSmall ergonomics improvement and safer defaults.
-class FirecrawlBackend: - def __init__(self, api_key: str): - self.client = Firecrawl(api_key=api_key) +class FirecrawlBackend: + def __init__(self, api_key: str | None = None): + import os + api_key = api_key or os.getenv("FIRECRAWL_API_KEY") + if not api_key: + raise ValueError("Firecrawl API key is required. Set FIRECRAWL_API_KEY or pass api_key=") + self.client = Firecrawl(api_key=api_key)
10-12: Expose formats as a parameter (preserve current default)Makes the wrapper flexible without changing behavior.
- def scrape(self, url: str): - return self.client.scrape(url=url, formats=["markdown", "html"]) + def scrape(self, url: str, formats: list[str] | None = None): + return self.client.scrape(url=url, formats=formats or ["markdown", "html"])crawl4ai/cli.py (3)
1006-1023: Optional: Add a flag to pass the Firecrawl API key via CLI (env still preferred)Keeps secrets out of code and supports CI usage.
@cli.command("crawl") @click.argument("url", required=True) @click.option( "--backend", type=click.Choice(["default", "firecrawl"]), default="default", help="Choose crawling backend" ) +@click.option( + "--firecrawl-api-key", + envvar="FIRECRAWL_API_KEY", + help="Firecrawl API key (or set FIRECRAWL_API_KEY env var)" +) @@ -def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, - extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict, - output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int, backend: str,): +def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, + extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict, + output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, + deep_crawl: str, max_pages: int, backend: str, firecrawl_api_key: Optional[str] = None):And inside the Firecrawl block above, prefer
api_key = firecrawl_api_key or os.getenv("FIRECRAWL_API_KEY").
1261-1267: Firecrawl path doesn’t support-q/--questionyetIf intended, document that Q&A is only supported for the default backend; otherwise, wire the Firecrawl result’s markdown into
stream_llm_response.
1211-1211: Non-Firecrawl path forces LXML scraping; confirm compatibilitySetting
crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy()unconditionally may undo user defaults/config. Consider honoring config unless explicitly overridden.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.md(28 hunks)crawl4ai/cli.py(5 hunks)crawl4ai/firecrawl_backend.py(1 hunks)firecrawl_demo.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
firecrawl_demo.py (1)
crawl4ai/firecrawl_backend.py (2)
FirecrawlBackend(3-14)scrape(10-11)
crawl4ai/cli.py (1)
crawl4ai/firecrawl_backend.py (2)
FirecrawlBackend(3-14)crawl(7-8)
🪛 Gitleaks (8.27.2)
firecrawl_demo.py
[high] 5-5: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
crawl4ai/cli.py
[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 markdownlint-cli2 (0.17.2)
README.md
933-933: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (2)
crawl4ai/cli.py (2)
1008-1012: Good: backend selector added to CLIThe
--backend {default,firecrawl}option is clear and scoped.Consider also adding this option to the default entrypoint (the command with name="") if you intend to support
crwl <url> --backend firecrawl. Otherwise, onlycrwl crawl ... --backend firecrawlwill work (which is fine if intentional).
1510-1511: Good: backend propagated from default entrypointForwarding the backend keeps behavior consistent when invoking via the default route.
If you want
crwl <url> --backend firecrawlto work, add the--backendoption to the default command too (currently only thecrawlsubcommand defines it).
| # Firecrawl | ||
| if backend == "firecrawl": | ||
| firecrawl_client = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c") | ||
| docs = firecrawl_client.crawl(url) | ||
|
|
||
| if not docs: | ||
| click.echo("No documents returned by Firecrawl") | ||
| return | ||
|
|
||
| def extract_item(item): | ||
| if isinstance(item, dict): | ||
| return item.get("title", ""), item.get("content", "") | ||
| elif isinstance(item, (tuple, list)) and len(item) == 2: | ||
| return item[0], item[1] | ||
| else: | ||
| return "", str(item) | ||
|
|
||
| if output in ["all", "json"]: | ||
| import json | ||
| click.echo(json.dumps(docs, indent=2)) | ||
|
|
||
| elif output in ["markdown", "md"]: | ||
| markdown_text = "" | ||
| for item in docs: | ||
| title, content = extract_item(item) | ||
| markdown_text += f"# {title}\n\n{content}\n\n" | ||
| click.echo(markdown_text) | ||
|
|
||
| elif output in ["markdown-fit", "md-fit"]: | ||
| markdown_text = "" | ||
| max_chars = 2000 | ||
| for item in docs: | ||
| title, content = extract_item(item) | ||
| combined = f"# {title}\n\n{content}\n\n" | ||
| if len(markdown_text) + len(combined) > max_chars: | ||
| break | ||
| markdown_text += combined | ||
| click.echo(markdown_text) | ||
|
|
||
| return | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firecrawl block: hard-coded API key, NameError risk, output flow bypass, and missing limit propagation
- Hard-coded secret.
docs = firecrawl_client.crawl(url)is unindented (executes even when backend != firecrawl) → NameError.- Firecrawl path ignores
output_fileand doesn’treturnfor most formats, so the default crawler still runs. - Doesn’t pass
max_pagesto Firecrawllimit.
- # Firecrawl
- if backend == "firecrawl":
- firecrawl_client = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
- docs = firecrawl_client.crawl(url)
-
- if not docs:
- click.echo("No documents returned by Firecrawl")
- return
+ # Firecrawl
+ if backend == "firecrawl":
+ import os
+ api_key = os.getenv("FIRECRAWL_API_KEY")
+ if not api_key:
+ raise click.ClickException(
+ "Missing FIRECRAWL_API_KEY. Set the env var or run: crwl config set FIRECRAWL_API_KEY <key>"
+ )
+ firecrawl_client = FirecrawlBackend(api_key=api_key)
+ # Propagate max_pages as crawl limit when provided
+ docs = firecrawl_client.crawl(url, limit=max_pages or 10)
+ if not docs:
+ click.echo("No documents returned by Firecrawl")
+ return
+
+ # Normalize return shape to a list of (title, content)
def extract_item(item):
if isinstance(item, dict):
- return item.get("title", ""), item.get("content", "")
+ title = item.get("title") or item.get("metadata", {}).get("title") or ""
+ content = item.get("content") or item.get("markdown") or item.get("html") or ""
+ return title, content
elif isinstance(item, (tuple, list)) and len(item) == 2:
return item[0], item[1]
else:
return "", str(item)
- if output in ["all", "json"]:
- import json
- click.echo(json.dumps(docs, indent=2))
+ # If Firecrawl returns an object with 'data', unwrap it
+ items = docs.get("data", docs) if isinstance(docs, dict) else docs
+
+ if output in ["all", "json"]:
+ payload = docs if output == "all" else items
+ text = json.dumps(payload, indent=2)
+ if output_file:
+ Path(output_file).write_text(text, encoding="utf-8")
+ else:
+ click.echo(text)
+ return
- elif output in ["markdown", "md"]:
- markdown_text = ""
- for item in docs:
- title, content = extract_item(item)
- markdown_text += f"# {title}\n\n{content}\n\n"
- click.echo(markdown_text)
+ elif output in ["markdown", "md"]:
+ markdown_text = ""
+ for item in items:
+ title, content = extract_item(item)
+ if title:
+ markdown_text += f"# {title}\n\n{content}\n\n"
+ else:
+ markdown_text += f"{content}\n\n"
+ if output_file:
+ Path(output_file).write_text(markdown_text, encoding="utf-8")
+ else:
+ click.echo(markdown_text)
+ return
- elif output in ["markdown-fit", "md-fit"]:
- markdown_text = ""
- max_chars = 2000
- for item in docs:
- title, content = extract_item(item)
- combined = f"# {title}\n\n{content}\n\n"
- if len(markdown_text) + len(combined) > max_chars:
- break
- markdown_text += combined
- click.echo(markdown_text)
-
- return
+ elif output in ["markdown-fit", "md-fit"]:
+ markdown_text = ""
+ max_chars = 2000
+ for item in items:
+ title, content = extract_item(item)
+ combined = (f"# {title}\n\n{content}\n\n" if title else f"{content}\n\n")
+ if len(markdown_text) + len(combined) > max_chars:
+ break
+ markdown_text += combined
+ if output_file:
+ Path(output_file).write_text(markdown_text, encoding="utf-8")
+ else:
+ click.echo(markdown_text)
+ returnCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Gitleaks (8.27.2)
[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
In crawl4ai/cli.py around lines 1038-1078, the Firecrawl block currently
hard-codes an API key, instantiates/uses firecrawl_client outside the backend
check (risking NameError), doesn't pass max_pages to the crawl limit, and fails
to respect output_file or return after handling outputs so execution falls
through to the default crawler; fix by removing the hard-coded secret and read
the API key from configuration or env var, ensure firecrawl_client is created
and docs = firecrawl_client.crawl(url, limit=max_pages) is executed only inside
the if backend == "firecrawl" block (proper indentation), after producing output
handle writing to output_file when provided (open+write) and always return after
processing any output format to prevent the rest of the function from running.
| backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c") | ||
| docs = backend.scrape("https://docs.firecrawl.dev") | ||
| print(docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded API key committed (secret leak) — remove and rotate immediately
Real-looking key present. Replace with env/config; revoke/rotate the exposed key now.
-from crawl4ai.firecrawl_backend import FirecrawlBackend
+import os
+from crawl4ai.firecrawl_backend import FirecrawlBackend
@@
def main():
- backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c")
+ api_key = os.getenv("FIRECRAWL_API_KEY")
+ if not api_key:
+ raise SystemExit("FIRECRAWL_API_KEY is not set")
+ backend = FirecrawlBackend(api_key=api_key)
docs = backend.scrape("https://docs.firecrawl.dev")
print(docs)Action items:
- Revoke/rotate the leaked key at the Firecrawl provider.
- Force-push a commit removing the key from git history if necessary (or use GitHub secret scanning remediation).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| backend = FirecrawlBackend(api_key="fc-fa43e06d8c1348b58200a39911a4ae9c") | |
| docs = backend.scrape("https://docs.firecrawl.dev") | |
| print(docs) | |
| import os | |
| from crawl4ai.firecrawl_backend import FirecrawlBackend | |
| def main(): | |
| api_key = os.getenv("FIRECRAWL_API_KEY") | |
| if not api_key: | |
| raise SystemExit("FIRECRAWL_API_KEY is not set") | |
| backend = FirecrawlBackend(api_key=api_key) | |
| docs = backend.scrape("https://docs.firecrawl.dev") | |
| print(docs) |
🧰 Tools
🪛 Gitleaks (8.27.2)
[high] 5-5: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
In firecrawl_demo.py around lines 5 to 7, a real API key is hard-coded which
leaks secrets; remove the literal key, rotate/revoke it with the Firecrawl
provider immediately, and replace usage with a secure retrieval (e.g., read from
an environment variable or config secret manager); update code to read
os.environ["FIRECRAWL_API_KEY"] (or equivalent) and fail fast with a clear error
if missing, then purge the key from git history or force-push a commit after
rotating the key to ensure it is not stored in the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes @Akeemkabiru
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

I introduces Firecrawl as an optional backend for crawl4ai.
Updates
Example usage : python -m crawl4ai.cli crawl https://docs.firecrawl.dev --backend firecrawl --output markdown