-
Notifications
You must be signed in to change notification settings - Fork 1
Feature: add copy mode support for HTTP and S3 #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kebe7jun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new syncMode
option for HTTP and S3 dataset sources, allowing users to specify whether data synchronization should perform a full sync
(default, removing absent files) or a copy
(only adding/updating files without deletion). This enhancement provides greater flexibility in managing how external data sources are ingested and maintained within the system.
Highlights
- New
syncMode
option for HTTP and S3: Introduced asyncMode
option for HTTP and S3 dataset sources, allowing users to choose between 'sync' (default, removes files in destination that don't exist in source) and 'copy' (only adds/updates files without removing existing ones). This provides greater control over data synchronization behavior. - API and CRD updates: The
DatasetSource
API definition (api/dataset/v1alpha1/dataset_types.go
) and its corresponding Custom Resource Definition (CRD) schema (config/crd/bases/dataset.baizeai.io_datasets.yaml
) have been updated to include and document the newsyncMode
option for S3 and HTTP sources. - Dynamic
rclone
command execution: The underlyingrclone
commands executed by the HTTP and S3 data loaders (internal/pkg/datasources/datasource_http.go
,internal/pkg/datasources/datasource_s3.go
) now dynamically use the specifiedsyncMode
('sync' or 'copy') instead of a hardcoded 'sync' operation. - Tool version upgrade: The
CONTROLLER_TOOLS_VERSION
in theMakefile
has been upgraded fromv0.17.1
tov0.18.0
, which necessitated regenerating the CRDs to reflect the new tool version.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a copy
mode for HTTP and S3 data sources, in addition to the existing sync
mode. The changes look good overall, but I've found a critical bug in the error handling for HTTP data sources where an invalid URI might not be caught. I've also pointed out some opportunities to improve code clarity and update the API documentation to reflect the new syncMode
option for HTTP sources.
_, err := url.Parse(options.URI) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to parse uri %s: %w", options.URI, err) | ||
} | ||
|
||
h.httpOptions.basicAuthUsername = secrets.Username |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from url.Parse
is not being checked. The err
variable is declared on line 31, but its value is overwritten on line 37 before being checked. This means an invalid URI will not be caught when creating the loader, leading to a failure much later during the Sync
operation.
It's best practice to validate inputs as early as possible ("fail-fast"). The URI validation should be performed here in NewHTTPLoader
. Please restore the validation.
_, err := url.Parse(options.URI) | |
if err != nil { | |
return nil, fmt.Errorf("failed to parse uri %s: %w", options.URI, err) | |
} | |
h.httpOptions.basicAuthUsername = secrets.Username | |
_, err := url.Parse(options.URI) | |
if err != nil { | |
return nil, fmt.Errorf("failed to parse uri %s: %w", options.URI, err) | |
} |
// - S3: region, endpoint, provider, syncMode | ||
// - HTTP: any key-value pair will be passed to the underlying http client as http headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for HTTP options is now incomplete. This pull request adds syncMode
as a valid option for HTTP data sources, but the comment here was not updated to reflect that. Please add syncMode
to the list of supported keys for HTTP.
// - S3: region, endpoint, provider, syncMode | |
// - HTTP: any key-value pair will be passed to the underlying http client as http headers | |
// - S3: region, endpoint, provider, syncMode | |
// - HTTP: any key-value pair will be passed to the underlying http client as http headers, syncMode |
- S3: region, endpoint, provider, syncMode | ||
- HTTP: any key-value pair will be passed to the underlying http client as http headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for HTTP options in the CRD is now incomplete. This pull request adds syncMode
as a valid option for HTTP data sources, but the description here was not updated to reflect that. Please update the list of supported keys for HTTP to include syncMode
.
- S3: region, endpoint, provider, syncMode
- HTTP: any key-value pair will be passed to the underlying http client as http headers, syncMode
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" { | ||
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check options.SyncMode != ""
is redundant. On line 35, lo.CoalesceOrEmpty
ensures that SyncMode
is never an empty string; it defaults to "sync" if not provided. You can simplify the condition.
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" { | |
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode) | |
} | |
if options.SyncMode != "sync" && options.SyncMode != "copy" { | |
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode) | |
} |
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" { | ||
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check options.SyncMode != ""
is redundant. On line 38, lo.CoalesceOrEmpty
ensures that SyncMode
is never an empty string; it defaults to "sync" if not provided. You can simplify the condition.
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" { | |
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode) | |
} | |
if options.SyncMode != "sync" && options.SyncMode != "copy" { | |
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode) | |
} |
d09e1a2
to
1533b38
Compare
Signed-off-by: Kebe <[email protected]> Co-authored-by: Ji Li <[email protected]>
1533b38
to
455da04
Compare
No description provided.