-
Notifications
You must be signed in to change notification settings - Fork 4.6k
feat: Added a way to pass both BrowserConfig and CrawlerConfig through /crawler api endpoint #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ugh /crawler api endpoint
""" WalkthroughThe changes add new Pydantic models Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant Subprocess
Client->>Server: GET /scripts
Server->>Server: List files in /app/scripts
Server-->>Client: Return list of script filenames
Client->>Server: POST /scripts/run (script_name)
Server->>Server: Validate script_name and existence
Server->>Subprocess: Run script asynchronously
Subprocess-->>Server: Return stdout/stderr
alt Output or error present
Server-->>Client: HTTP 400 with output/error
else No output or error
Server-->>Client: HTTP 200 with output
end
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
deploy/docker/schemas.py (1)
7-9
: Fix formatting issue: Add blank line before class definition.Static analysis correctly identified a PEP 8 formatting issue.
+ class CrawlConfigs(BaseModel): browser_config: Optional[Dict] = Field(default_factory=dict) crawler_config: Optional[Dict] = Field(default_factory=dict)
deploy/docker/server.py (1)
441-446
: Consider extracting config extraction logic to reduce duplication.The same config extraction logic is duplicated between both endpoints. Consider creating a helper function to improve maintainability.
+def extract_crawl_configs(crawl_request: CrawlRequest) -> tuple[dict, dict]: + """Extract browser and crawler configs from request.""" + if crawl_request.configs: + return crawl_request.configs.browser_config, crawl_request.configs.crawler_config + return {}, {}Then use it in both endpoints:
- browser_config = {} - crawler_config = {} - if crawl_request.configs: - browser_config = crawl_request.configs.browser_config if crawl_request.configs.browser_config else {} - crawler_config = crawl_request.configs.crawler_config if crawl_request.configs.crawler_config else {} + browser_config, crawler_config = extract_crawl_configs(crawl_request)Also applies to: 466-471
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/docker/schemas.py
(1 hunks)deploy/docker/server.py
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
deploy/docker/server.py (1)
deploy/docker/api.py (2)
handle_crawl_request
(393-476)handle_stream_crawl_request
(478-527)
🪛 Flake8 (7.2.0)
deploy/docker/schemas.py
[error] 11-11: expected 2 blank lines, found 1
(E302)
🔇 Additional comments (1)
deploy/docker/schemas.py (1)
11-13
: LGTM! Well-structured schema design.The nested configuration approach allows flexible API usage while maintaining backward compatibility. The use of
default_factory=dict
for mutable defaults is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/docker/schemas.py (1)
7-7
: Fix formatting: Remove extra blank lines.The static analysis tool flagged too many blank lines before the class definition.
- - +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/docker/schemas.py
(1 hunks)deploy/docker/server.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/docker/server.py
🧰 Additional context used
🪛 Flake8 (7.2.0)
deploy/docker/schemas.py
[error] 8-8: too many blank lines (3)
(E303)
[error] 12-12: expected 2 blank lines, found 1
(E302)
🔇 Additional comments (1)
deploy/docker/schemas.py (1)
8-10
: Well-designed configuration grouping.The
CrawlConfigs
class effectively groups related configuration options and uses proper Pydantic patterns:
Field(default_factory=dict)
correctly handles mutable defaults- Optional typing allows flexibility in providing configurations
- Clear separation of browser vs crawler configurations
class CrawlRequest(BaseModel): | ||
urls: List[str] = Field(min_length=1, max_length=100) | ||
configs: Optional[CrawlConfigs] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix formatting and verify backward compatibility.
The modification to CrawlRequest
looks good functionally, but there's a formatting issue and a potential compatibility concern.
-class CrawlRequest(BaseModel):
+
+class CrawlRequest(BaseModel):
Please verify that existing clients can still use the API without providing the configs
field, and that the server properly handles requests with the old structure if any exist:
🏁 Script executed:
#!/bin/bash
# Description: Check how CrawlRequest is used in server endpoints to ensure backward compatibility
# Search for CrawlRequest usage in server files
rg -A 10 -B 5 "CrawlRequest" --type py
# Look for direct access to browser_config or crawler_config fields that might break
ast-grep --pattern 'request.browser_config'
ast-grep --pattern 'request.crawler_config'
Length of output: 4032
Default configs
to maintain backward compatibility
The server code assumes crawl_request.configs
is always a CrawlConfigs
instance—if a client omits configs
, accessing .browser_config
or .crawler_config
will raise an error. To fix:
• In deploy/docker/schemas.py
, replace the optional configs
field with a default factory so that missing configs still yield an empty CrawlConfigs
.
• You can drop the Optional
import once this is updated.
--- deploy/docker/schemas.py
@@
-class CrawlRequest(BaseModel):
- urls: List[str] = Field(min_length=1, max_length=100)
- configs: Optional[CrawlConfigs] = None
+class CrawlRequest(BaseModel):
+ urls: List[str] = Field(min_length=1, max_length=100)
+ configs: CrawlConfigs = Field(default_factory=CrawlConfigs)
No changes are needed in server.py
once configs
always defaults to an empty model.
Ensure you keep two blank lines between top-level classes per PEP8 for consistent formatting.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class CrawlRequest(BaseModel): | |
urls: List[str] = Field(min_length=1, max_length=100) | |
configs: Optional[CrawlConfigs] = None | |
class CrawlRequest(BaseModel): | |
urls: List[str] = Field(min_length=1, max_length=100) | |
configs: CrawlConfigs = Field(default_factory=CrawlConfigs) |
🧰 Tools
🪛 Flake8 (7.2.0)
[error] 12-12: expected 2 blank lines, found 1
(E302)
🤖 Prompt for AI Agents
In deploy/docker/schemas.py around lines 12 to 14, the configs field in
CrawlRequest is optional but the server expects it always to be a CrawlConfigs
instance, causing errors if omitted. Fix this by removing Optional from configs
and setting its default using a default_factory that returns an empty
CrawlConfigs instance. Also, remove the now unused Optional import and ensure
there are two blank lines between top-level classes to comply with PEP8
formatting.
I have to change tests aswell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (2)
deploy/docker/server.py (2)
443-444
: Simplify the conditional logic for config extraction.The current logic has redundant conditional checks since the schema should already provide defaults.
- browser_config = crawl_request.configs.browser_config if crawl_request.configs.browser_config else {} - crawler_config = crawl_request.configs.crawler_config if crawl_request.configs.crawler_config else {} + browser_config = crawl_request.configs.browser_config if crawl_request.configs else {} + crawler_config = crawl_request.configs.crawler_config if crawl_request.configs else {}
465-466
: Simplify the conditional logic for config extraction.Same redundant conditional checks as in the
/crawl
endpoint.- browser_config = crawl_request.configs.browser_config if crawl_request.configs.browser_config else {} - crawler_config = crawl_request.configs.crawler_config if crawl_request.configs.crawler_config else {} + browser_config = crawl_request.configs.browser_config if crawl_request.configs else {} + crawler_config = crawl_request.configs.crawler_config if crawl_request.configs else {}
🧹 Nitpick comments (4)
docker-compose.yml (2)
42-43
: CPU limit seems restrictive for script execution.The CPU limit of 0.1 cores may be insufficient for script execution, especially if scripts perform intensive operations. Consider adjusting based on expected script workload.
49-49
: Fix the missing newline at end of file.+
deploy/docker/server.py (2)
19-19
: Remove duplicate import.The
os
module is imported twice (lines 19 and 40), which is redundant.-import os
40-40
: Clean up import statement.This line redefines the
os
import and also importssubprocess
. Remove theos
part since it's already imported.-import os, subprocess +import subprocess
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deploy/docker/schemas.py
(2 hunks)deploy/docker/server.py
(5 hunks)docker-compose.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/docker/schemas.py
🧰 Additional context used
🪛 YAMLlint (1.37.1)
docker-compose.yml
[error] 49-49: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff (0.11.9)
deploy/docker/server.py
38-38: utils.FilterType
imported but unused
Remove unused import: utils.FilterType
(F401)
40-40: Redefinition of unused os
from line 19
Remove definition: os
(F811)
499-499: Do not perform function call Depends
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
534-534: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (3)
docker-compose.yml (1)
26-26
: Verify the necessity of privileged mode.The
privileged: true
setting grants the container extensive host system access. Ensure this is required for the script execution functionality, as it significantly reduces container isolation.Consider if specific capabilities would be sufficient instead of full privileged access.
deploy/docker/server.py (2)
499-499
: Fix function call in argument defaults.The
Depends(token_dep)
call in the function signature should be avoided in argument defaults.-async def scripts_run( - request: Request, - body: ScriptRequest, - _td: Dict = Depends(token_dep), -): +async def scripts_run( + request: Request, + body: ScriptRequest, + _td: Dict = Depends(token_dep), +):Note: This appears to be a false positive from the static analysis tool, as this is the correct way to use FastAPI dependencies.
Likely an incorrect or invalid review comment.
484-492
: Security concern: Directory traversal vulnerability.The
/scripts
endpoint lists files without proper validation, which could potentially expose sensitive files if the scripts directory is compromised or misconfigured.Consider adding proper validation and restricting file extensions:
- return JSONResponse([file for file in os.listdir(script_folder) if os.path.isfile(os.path.join(script_folder, file))]) + python_files = [file for file in os.listdir(script_folder) + if file.endswith('.py') and os.path.isfile(os.path.join(script_folder, file))] + return JSONResponse(python_files)Likely an incorrect or invalid review comment.
volumes: | ||
- ~/scripts:/app/scripts:rw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Review the security implications of the scripts volume mount.
The volume mount ~/scripts:/app/scripts:rw
with read-write permissions allows the container to modify host files. Consider if read-only access would be sufficient for script execution, or implement proper access controls to prevent potential security issues.
Consider using read-only mount if scripts don't need to be modified:
- - ~/scripts:/app/scripts:rw
+ - ~/scripts:/app/scripts:ro
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
volumes: | |
- ~/scripts:/app/scripts:rw | |
volumes: | |
- ~/scripts:/app/scripts:ro |
🤖 Prompt for AI Agents
In docker-compose.yml at lines 30-31, the volume mount
'~/scripts:/app/scripts:rw' grants read-write access, allowing the container to
modify host files. To improve security, change the mount to read-only by
replacing ':rw' with ':ro' if the container only needs to execute scripts
without modifying them. Alternatively, implement stricter access controls to
limit potential risks.
except subprocess.CalledProcessError as e: | ||
# Return the command execution error details | ||
response_data = { | ||
"Status": "Failed to execute the script", | ||
"Details": str(e) | ||
} | ||
raise HTTPException(400, response_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix exception handling with proper chaining.
The exception handling should preserve the original exception chain.
- except subprocess.CalledProcessError as e:
- # Return the command execution error details
- response_data = {
- "Status": "Failed to execute the script",
- "Details": str(e)
- }
- raise HTTPException(400, response_data)
+ except subprocess.CalledProcessError as e:
+ response_data = {
+ "Status": "Failed to execute the script",
+ "Details": str(e)
+ }
+ raise HTTPException(400, response_data) from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
except subprocess.CalledProcessError as e: | |
# Return the command execution error details | |
response_data = { | |
"Status": "Failed to execute the script", | |
"Details": str(e) | |
} | |
raise HTTPException(400, response_data) | |
except subprocess.CalledProcessError as e: | |
response_data = { | |
"Status": "Failed to execute the script", | |
"Details": str(e) | |
} | |
raise HTTPException(400, response_data) from e |
🧰 Tools
🪛 Ruff (0.11.9)
534-534: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 528 to 534, the exception handling for
subprocess.CalledProcessError currently raises an HTTPException without
preserving the original exception context. Modify the raise statement to include
"from e" to properly chain the original exception, ensuring the traceback and
error context are preserved for debugging.
@app.post("/scripts/run") | ||
@limiter.limit(config["rate_limiting"]["default_limit"]) | ||
async def scripts_run( | ||
request: Request, | ||
body: ScriptRequest, | ||
_td: Dict = Depends(token_dep), | ||
): | ||
scripts_directory = '/app/scripts' # Update this path as needed | ||
|
||
if not body or not body.script_name: | ||
raise HTTPException(400, "You have to provide an script name") | ||
|
||
if body.script_name.lower().endswith('.py'): | ||
raise HTTPException(400, "Provide only the name, without the extension") | ||
|
||
script_name = f'{body.script_name}.py' | ||
script_path = os.path.join(scripts_directory, script_name) | ||
|
||
if not os.path.isfile(script_path): | ||
raise HTTPException(400, "No script found with that name. Check available in /scripts") | ||
|
||
try: | ||
result = await subprocess.run(['python', script_path], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | ||
output = result.stdout.decode('utf-8') | ||
error_output = result.stderr.decode('utf-8') | ||
|
||
if output or error_output: | ||
response_data = { | ||
"Output": f"{output}{error_output}", | ||
"Status": "Script executed with output" | ||
} | ||
raise HTTPException(400, response_data) | ||
|
||
return JSONResponse(output) | ||
except subprocess.CalledProcessError as e: | ||
# Return the command execution error details | ||
response_data = { | ||
"Status": "Failed to execute the script", | ||
"Details": str(e) | ||
} | ||
raise HTTPException(400, response_data) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical security vulnerability: Arbitrary script execution.
The /scripts/run
endpoint executes arbitrary Python files without proper validation or sandboxing. This poses a significant security risk as it allows execution of any Python code in the scripts directory.
Consider implementing proper security measures:
- Validate script content before execution
- Use a sandboxed environment
- Implement proper input validation
- Add logging for security monitoring
- Consider using a task queue for async execution
+ # Add proper validation and sandboxing
+ if not re.match(r'^[a-zA-Z0-9_-]+$', body.script_name):
+ raise HTTPException(400, "Invalid script name format")
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.11.9)
499-499: Do not perform function call Depends
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
534-534: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 494 to 535, the /scripts/run endpoint
executes arbitrary Python scripts from the scripts directory without sufficient
validation or sandboxing, creating a critical security risk. To fix this,
implement strict validation to allow only a predefined whitelist of script
names, avoid executing scripts directly from user input, and run scripts in a
sandboxed or isolated environment. Add input validation to reject any unexpected
or malicious input, include logging for all script execution attempts for
security monitoring, and consider offloading script execution to a controlled
async task queue to limit resource abuse and improve security.
"Output": f"{output}{error_output}", | ||
"Status": "Script executed with output" | ||
} | ||
raise HTTPException(400, response_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve exception handling.
The current exception handling raises HTTPException with output, which may not be the intended behavior. Also, the exception chain should be preserved.
- raise HTTPException(400, response_data)
+ return JSONResponse(response_data, status_code=400)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In deploy/docker/server.py at line 525, the HTTPException is raised without
preserving the original exception context, which can obscure the root cause.
Modify the exception handling to include the original exception as the cause by
using the "from" keyword when raising HTTPException, ensuring the exception
chain is preserved for better debugging.
raise HTTPException(400, "No script found with that name. Check available in /scripts") | ||
|
||
try: | ||
result = await subprocess.run(['python', script_path], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix subprocess usage for async context.
Using subprocess.run
in an async function blocks the event loop. Use asyncio.create_subprocess_exec
instead.
- result = await subprocess.run(['python', script_path], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+ proc = await asyncio.create_subprocess_exec(
+ 'python', script_path,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE
+ )
+ stdout, stderr = await proc.communicate()
+ if proc.returncode != 0:
+ raise subprocess.CalledProcessError(proc.returncode, ['python', script_path])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
result = await subprocess.run(['python', script_path], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | |
proc = await asyncio.create_subprocess_exec( | |
'python', script_path, | |
stdout=asyncio.subprocess.PIPE, | |
stderr=asyncio.subprocess.PIPE | |
) | |
stdout, stderr = await proc.communicate() | |
if proc.returncode != 0: | |
raise subprocess.CalledProcessError(proc.returncode, ['python', script_path]) |
🤖 Prompt for AI Agents
In deploy/docker/server.py at line 516, the use of subprocess.run in an async
function blocks the event loop. Replace subprocess.run with
asyncio.create_subprocess_exec to run the subprocess asynchronously. Adjust the
code to await the subprocess completion and capture stdout and stderr using the
appropriate asyncio subprocess methods.
I'm closing this PR due to new ideas I wish to implement |
Summary
Considering the following scenario:
A page to be crawled must have been logged in beforehand. The login information is stored (usually) in cookies, that can be provided in BrowserConfig and that works like a charm. But if one need to filter some tags by using excluded_tags, taget_elements or css_selector (that are provided in CrawlerRunConfig) it is not possible.
That can be managed through scripting, of course, in a custom file. But not if one is using the crawler through docker on the default endpoint.
This PR focus on changing that behaviour. Now you can pass in body both.
List of files changed and why
docker/server.py - Made changes on API /crawl and /crawl/stream endpoints to handle cases where no .configs where passed on body so as no browser_config and/nor crawler_config inside configs.
docker/schemas.py - Changed type to fit new schema
How Has This Been Tested?
Tested calling the API using Postman. Cases:
Checklist:
Summary by CodeRabbit
New Features
Refactor
Chores