Skip to content

Conversation

@NeatGuyCoding
Copy link
Contributor

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

This PR adds support for maxDiscoveryDepth and crawlEntireDomain parameters in the Firecrawl crawler integration, aligning with the official Firecrawl API v2 specification.

Changes:

  • Added crawl_entire_domain option to CrawlOptions dataclass
  • Implemented maxDiscoveryDepth parameter mapping from existing max_depth field
  • Added crawlEntireDomain parameter support when the option is enabled
  • Added UI checkbox control for "Crawl entire domain" option in Firecrawl settings
  • Added i18n translations (English and Chinese) for the new option
  • Updated default crawl options to include crawl_entire_domain: false

Technical Details:

  • Backend: Modified website_service.py to pass maxDiscoveryDepth and crawlEntireDomain to Firecrawl API when appropriate
  • Frontend: Updated CrawlOptions type, added UI component, and updated default values

Screenshots

Before After
No option for crawling entire domain Added "Crawl entire domain" checkbox in Firecrawl options
max_depth not passed to Firecrawl API max_depth now correctly mapped to maxDiscoveryDepth

Checklist

  • This change requires a documentation update, included: Dify Document

  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)

  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.

  • I've updated the documentation accordingly.

  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Copilot AI review requested due to automatic review settings November 18, 2025 09:50
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 💪 enhancement New feature or request labels Nov 18, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @NeatGuyCoding, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Firecrawl web crawling capabilities by integrating maxDiscoveryDepth and crawlEntireDomain parameters, which are part of the Firecrawl API v2. This allows users to specify how deep the crawler should go and whether it should restrict itself to the initial domain, providing more control and flexibility for data ingestion. The changes involve updates across the backend service logic, frontend user interface, and configuration files to seamlessly incorporate these new options.

Highlights

  • Firecrawl Parameter Support: Added support for maxDiscoveryDepth and crawlEntireDomain parameters in the Firecrawl integration, aligning with the official Firecrawl API v2 specification for more granular control over crawling.
  • Backend Logic Update: Modified api/services/website_service.py to include crawl_entire_domain in CrawlOptions and to correctly map max_depth to maxDiscoveryDepth and crawl_entire_domain to crawlEntireDomain when making calls to the Firecrawl API.
  • Frontend UI Integration: Introduced a new 'Crawl entire domain' checkbox in the Firecrawl settings user interface, along with corresponding internationalization (i18n) translations for both English and Chinese.
  • Configuration Updates: Updated default crawl options in web/app/components/datasets/create/index.tsx and adjusted pipeline templates in api/constants/pipeline_templates.json and api/services/rag_pipeline/transform/*.yml to incorporate the new crawl_entire_domain setting.
  • Type Definition Enhancement: Extended the CrawlOptions TypeScript type in web/models/datasets.ts to include the new crawl_entire_domain property, ensuring type safety and consistency across the frontend.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for maxDiscoveryDepth and crawlEntireDomain parameters for the Firecrawl crawler, which is a great enhancement. The changes on the frontend and in the pipeline configurations look good.

I've left a couple of comments for improvement:

  1. In the backend, the logic for adding Firecrawl parameters can be improved to better reflect the API's behavior, especially regarding the relationship between crawlEntireDomain and other crawling options.
  2. On the frontend, the user experience could be enhanced by disabling irrelevant options when Crawl entire domain is selected, making the interface more intuitive.

Overall, this is a good contribution. Addressing these points will improve the code's correctness and usability.

Copilot finished reviewing on behalf of NeatGuyCoding November 18, 2025 09:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for two new Firecrawl API v2 parameters: maxDiscoveryDepth and crawlEntireDomain. The implementation includes backend service changes, frontend UI updates, i18n translations, and RAG pipeline configuration updates.

Key Changes

  • Added crawl_entire_domain boolean field to the CrawlOptions dataclass and TypeScript type
  • Implemented mapping of max_depth to Firecrawl's maxDiscoveryDepth parameter
  • Implemented mapping of crawl_entire_domain to Firecrawl's crawlEntireDomain parameter
  • Added UI checkbox control and translations for the "Crawl entire domain" option

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
api/services/website_service.py Added crawl_entire_domain field to CrawlOptions and parameter mappings to Firecrawl API
web/models/datasets.ts Added crawl_entire_domain boolean field to CrawlOptions type
web/app/components/datasets/create/index.tsx Added default value false for crawl_entire_domain in DEFAULT_CRAWL_OPTIONS
web/app/components/datasets/create/website/firecrawl/options.tsx Added checkbox UI control for "Crawl entire domain" option
web/i18n/en-US/dataset-creation.ts Added English translation for "Crawl entire domain"
web/i18n/zh-Hans/dataset-creation.ts Added Chinese translation for "爬取整个域名"
api/services/rag_pipeline/transform/website-crawl-parentchild.yml Added crawl_entire_domain parameter to workflow configuration
api/services/rag_pipeline/transform/website-crawl-general-high-quality.yml Added crawl_entire_domain parameter to workflow configuration
api/services/rag_pipeline/transform/website-crawl-general-economy.yml Added crawl_entire_domain parameter to workflow configuration
api/constants/pipeline_templates.json Added crawl_entire_domain variable references to pipeline templates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@crazywoola
Copy link
Member

Please link an existing issue or create a new one in the description :) Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

💪 enhancement New feature or request size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants