Skip to content

Feature: add copy mode support for HTTP and S3 #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 14, 2025

Conversation

kebe7jun
Copy link
Contributor

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kebe7jun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new syncMode option for HTTP and S3 dataset sources, allowing users to specify whether data synchronization should perform a full sync (default, removing absent files) or a copy (only adding/updating files without deletion). This enhancement provides greater flexibility in managing how external data sources are ingested and maintained within the system.

Highlights

  • New syncMode option for HTTP and S3: Introduced a syncMode option for HTTP and S3 dataset sources, allowing users to choose between 'sync' (default, removes files in destination that don't exist in source) and 'copy' (only adds/updates files without removing existing ones). This provides greater control over data synchronization behavior.
  • API and CRD updates: The DatasetSource API definition (api/dataset/v1alpha1/dataset_types.go) and its corresponding Custom Resource Definition (CRD) schema (config/crd/bases/dataset.baizeai.io_datasets.yaml) have been updated to include and document the new syncMode option for S3 and HTTP sources.
  • Dynamic rclone command execution: The underlying rclone commands executed by the HTTP and S3 data loaders (internal/pkg/datasources/datasource_http.go, internal/pkg/datasources/datasource_s3.go) now dynamically use the specified syncMode ('sync' or 'copy') instead of a hardcoded 'sync' operation.
  • Tool version upgrade: The CONTROLLER_TOOLS_VERSION in the Makefile has been upgraded from v0.17.1 to v0.18.0, which necessitated regenerating the CRDs to reflect the new tool version.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a copy mode for HTTP and S3 data sources, in addition to the existing sync mode. The changes look good overall, but I've found a critical bug in the error handling for HTTP data sources where an invalid URI might not be caught. I've also pointed out some opportunities to improve code clarity and update the API documentation to reflect the new syncMode option for HTTP sources.

Comment on lines 31 to 36
_, err := url.Parse(options.URI)
if err != nil {
return nil, fmt.Errorf("failed to parse uri %s: %w", options.URI, err)
}

h.httpOptions.basicAuthUsername = secrets.Username

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The error from url.Parse is not being checked. The err variable is declared on line 31, but its value is overwritten on line 37 before being checked. This means an invalid URI will not be caught when creating the loader, leading to a failure much later during the Sync operation.

It's best practice to validate inputs as early as possible ("fail-fast"). The URI validation should be performed here in NewHTTPLoader. Please restore the validation.

Suggested change
_, err := url.Parse(options.URI)
if err != nil {
return nil, fmt.Errorf("failed to parse uri %s: %w", options.URI, err)
}
h.httpOptions.basicAuthUsername = secrets.Username
_, err := url.Parse(options.URI)
if err != nil {
return nil, fmt.Errorf("failed to parse uri %s: %w", options.URI, err)
}

Comment on lines +74 to 75
// - S3: region, endpoint, provider, syncMode
// - HTTP: any key-value pair will be passed to the underlying http client as http headers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for HTTP options is now incomplete. This pull request adds syncMode as a valid option for HTTP data sources, but the comment here was not updated to reflect that. Please add syncMode to the list of supported keys for HTTP.

Suggested change
// - S3: region, endpoint, provider, syncMode
// - HTTP: any key-value pair will be passed to the underlying http client as http headers
// - S3: region, endpoint, provider, syncMode
// - HTTP: any key-value pair will be passed to the underlying http client as http headers, syncMode

Comment on lines +160 to 161
- S3: region, endpoint, provider, syncMode
- HTTP: any key-value pair will be passed to the underlying http client as http headers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for HTTP options in the CRD is now incomplete. This pull request adds syncMode as a valid option for HTTP data sources, but the description here was not updated to reflect that. Please update the list of supported keys for HTTP to include syncMode.

                      - S3: region, endpoint, provider, syncMode
                      - HTTP: any key-value pair will be passed to the underlying http client as http headers, syncMode

Comment on lines +55 to +60
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" {
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check options.SyncMode != "" is redundant. On line 35, lo.CoalesceOrEmpty ensures that SyncMode is never an empty string; it defaults to "sync" if not provided. You can simplify the condition.

Suggested change
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" {
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode)
}
if options.SyncMode != "sync" && options.SyncMode != "copy" {
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode)
}

Comment on lines +78 to +80
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" {
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check options.SyncMode != "" is redundant. On line 38, lo.CoalesceOrEmpty ensures that SyncMode is never an empty string; it defaults to "sync" if not provided. You can simplify the condition.

Suggested change
if options.SyncMode != "" && options.SyncMode != "sync" && options.SyncMode != "copy" {
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode)
}
if options.SyncMode != "sync" && options.SyncMode != "copy" {
return fmt.Errorf("invalid syncMode '%s', must be 'sync' or 'copy'", options.SyncMode)
}

@kebe7jun kebe7jun force-pushed the fix/support-rclone-copy-mode branch from d09e1a2 to 1533b38 Compare July 11, 2025 02:48
@kebe7jun kebe7jun force-pushed the fix/support-rclone-copy-mode branch from 1533b38 to 455da04 Compare July 11, 2025 08:27
@kebe7jun kebe7jun merged commit 158f93f into main Jul 14, 2025
3 checks passed
@kebe7jun kebe7jun deleted the fix/support-rclone-copy-mode branch July 14, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant