-
Notifications
You must be signed in to change notification settings - Fork 18.5k
feat: add maxDiscoveryDepth and crawlEntireDomain for Firecrawl #28337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add maxDiscoveryDepth and crawlEntireDomain for Firecrawl #28337
Conversation
Signed-off-by: NeatGuyCoding <[email protected]>
…in templates Signed-off-by: NeatGuyCoding <[email protected]>
Summary of ChangesHello @NeatGuyCoding, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Firecrawl web crawling capabilities by integrating Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for maxDiscoveryDepth and crawlEntireDomain parameters for the Firecrawl crawler, which is a great enhancement. The changes on the frontend and in the pipeline configurations look good.
I've left a couple of comments for improvement:
- In the backend, the logic for adding Firecrawl parameters can be improved to better reflect the API's behavior, especially regarding the relationship between
crawlEntireDomainand other crawling options. - On the frontend, the user experience could be enhanced by disabling irrelevant options when
Crawl entire domainis selected, making the interface more intuitive.
Overall, this is a good contribution. Addressing these points will improve the code's correctness and usability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for two new Firecrawl API v2 parameters: maxDiscoveryDepth and crawlEntireDomain. The implementation includes backend service changes, frontend UI updates, i18n translations, and RAG pipeline configuration updates.
Key Changes
- Added
crawl_entire_domainboolean field to theCrawlOptionsdataclass and TypeScript type - Implemented mapping of
max_depthto Firecrawl'smaxDiscoveryDepthparameter - Implemented mapping of
crawl_entire_domainto Firecrawl'scrawlEntireDomainparameter - Added UI checkbox control and translations for the "Crawl entire domain" option
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| api/services/website_service.py | Added crawl_entire_domain field to CrawlOptions and parameter mappings to Firecrawl API |
| web/models/datasets.ts | Added crawl_entire_domain boolean field to CrawlOptions type |
| web/app/components/datasets/create/index.tsx | Added default value false for crawl_entire_domain in DEFAULT_CRAWL_OPTIONS |
| web/app/components/datasets/create/website/firecrawl/options.tsx | Added checkbox UI control for "Crawl entire domain" option |
| web/i18n/en-US/dataset-creation.ts | Added English translation for "Crawl entire domain" |
| web/i18n/zh-Hans/dataset-creation.ts | Added Chinese translation for "爬取整个域名" |
| api/services/rag_pipeline/transform/website-crawl-parentchild.yml | Added crawl_entire_domain parameter to workflow configuration |
| api/services/rag_pipeline/transform/website-crawl-general-high-quality.yml | Added crawl_entire_domain parameter to workflow configuration |
| api/services/rag_pipeline/transform/website-crawl-general-economy.yml | Added crawl_entire_domain parameter to workflow configuration |
| api/constants/pipeline_templates.json | Added crawl_entire_domain variable references to pipeline templates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
api/services/rag_pipeline/transform/website-crawl-parentchild.yml
Outdated
Show resolved
Hide resolved
api/services/rag_pipeline/transform/website-crawl-general-high-quality.yml
Outdated
Show resolved
Hide resolved
api/services/rag_pipeline/transform/website-crawl-general-economy.yml
Outdated
Show resolved
Hide resolved
…omy.yml Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
…-quality.yml Co-authored-by: Copilot <[email protected]>
|
Please link an existing issue or create a new one in the description :) Thank you very much. |
Important
Fixes #<issue number>.Summary
This PR adds support for
maxDiscoveryDepthandcrawlEntireDomainparameters in the Firecrawl crawler integration, aligning with the official Firecrawl API v2 specification.Changes:
crawl_entire_domainoption toCrawlOptionsdataclassmaxDiscoveryDepthparameter mapping from existingmax_depthfieldcrawlEntireDomainparameter support when the option is enabledcrawl_entire_domain: falseTechnical Details:
website_service.pyto passmaxDiscoveryDepthandcrawlEntireDomainto Firecrawl API when appropriateCrawlOptionstype, added UI component, and updated default valuesScreenshots
max_depthnot passed to Firecrawl APImax_depthnow correctly mapped tomaxDiscoveryDepthChecklist
This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods