Add bandwidth limit feature for HTTP and S3 datasources #36

Copilot · 2025-07-09T09:19:52Z

This PR adds bandwidth limiting capability to prevent data downloads from consuming all available bandwidth and affecting other services in production environments.

Changes

Core Implementation

HTTP Datasource: Added bandwidthLimit option support that adds --bwlimit flag to rclone commands
S3 Datasource: Added bandwidthLimit option support that adds --bwlimit flag to rclone commands
Documentation: Updated API documentation to include the new option for both HTTP and S3 sources

Usage

Users can now specify a bandwidth limit in their Dataset configurations:

apiVersion: dataset.baizeai.io/v1alpha1
kind: Dataset
metadata:
  name: my-dataset
spec:
  source:
    type: HTTP  # or S3
    uri: http://example.com/data/
    options:
      bandwidthLimit: "10M"  # Limit to 10 MiB/s

Bandwidth Limit Format

The bandwidth limit follows rclone's format:

Plain numbers are interpreted as KiB/s
Suffixes can be used: B, K, M, G, T, P
Examples: "1M" (1 MiB/s), "500K" (500 KiB/s), "2G" (2 GiB/s)

Testing

Added comprehensive tests for both HTTP and S3 datasources with bandwidth limits
Verified that the --bwlimit flag is correctly added to rclone commands
Created sample configuration files demonstrating usage

Implementation Details

The feature leverages rclone's built-in --bwlimit flag for bandwidth throttling:

Parse the bandwidthLimit option from Dataset configuration
Store it in the respective loader options struct
Add --bwlimit <value> to rclone command arguments when present
rclone handles the actual bandwidth limiting during download

This approach is minimal, surgical, and leverages existing rclone capabilities for reliability.

Fixes #3.

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

…--bwlimit Co-authored-by: kebe7jun <[email protected]>

kebe7jun · 2025-07-09T09:44:02Z

Do you have any solutions for other types of Datasets?

kebe7jun · 2025-07-09T09:44:50Z

api/dataset/v1alpha1/dataset_types.go

 	// - PVC:
 	// - NFS:
 	// - CONDA: requirements.txt, environment.yaml
 	// - REFERENCE:
 	// - HUGGING_FACE: repo, repoType, endpoint, include, exclude, revision
 	// - MODEL_SCOPE: repo, repoType, include, exclude, revision
+	// bandwidthLimit: Bandwidth limit for downloads in KiB/s, or use suffix B|K|M|G|T|P (e.g. "1M" for 1 MiB/s, "10M" for 10 MiB/s)


Do you have any solutions for other types of Datasets?

Same as above - bandwidth limiting can be extended to other dataset types beyond HTTP and S3, though implementation complexity varies. The current approach leverages rclone's built-in --bwlimit flag, which is why it's straightforward for HTTP and S3. Other types would require external bandwidth limiting tools like trickle or traffic control.

nekomeowww · 2025-07-09T09:50:27Z

config/samples/s3-dataset-with-bandwidth-limit.yaml

+      region: "us-east-1"
+      provider: "AWS"
+      bandwidthLimit: "5M"  # Limit download bandwidth to 5 MiB/s
+  secretRef: s3-credentials


Suggested change

secretRef: s3-credentials

secretRef: s3-credentials

Copilot · 2025-07-09T09:56:02Z

Do you have any solutions for other types of Datasets?

Yes, bandwidth limiting can be extended to other dataset types, though the implementation complexity varies:

Currently supported: HTTP and S3 (leveraging rclone's built-in --bwlimit)

Possible future extensions:

Git: Could use external tools like trickle or tc (traffic control) to wrap git commands, though Git doesn't have native bandwidth limiting
HuggingFace/ModelScope: Could similarly use external bandwidth limiting tools to wrap the CLI downloads
Conda: Package manager downloads could be wrapped with external tools

Not applicable: PVC, NFS, REFERENCE (no network downloads involved)

The HTTP/S3 implementation is straightforward because rclone has built-in bandwidth limiting. For other types, we'd need external tools which adds complexity and dependencies. Would you like me to implement any specific dataset type?

Co-authored-by: nekomeowww <[email protected]>

kebe7jun · 2025-07-09T10:00:25Z

Do you have any solutions for other types of Datasets?

Yes, bandwidth limiting can be extended to other dataset types, though the implementation complexity varies:

Currently supported: HTTP and S3 (leveraging rclone's built-in --bwlimit)

Possible future extensions:

Git: Could use external tools like trickle or tc (traffic control) to wrap git commands, though Git doesn't have native bandwidth limiting

HuggingFace/ModelScope: Could similarly use external bandwidth limiting tools to wrap the CLI downloads

Conda: Package manager downloads could be wrapped with external tools

Not applicable: PVC, NFS, REFERENCE (no network downloads involved)

The HTTP/S3 implementation is straightforward because rclone has built-in bandwidth limiting. For other types, we'd need external tools which adds complexity and dependencies. Would you like me to implement any specific dataset type?

Help improve Copilot by leaving feedback using the 👍 or 👎 buttons

Can you help me add the feature to limit bandwidth using trickle? In this PR.

…ope datasets - Add bandwidth limiting support using `trickle` for Git, HuggingFace, and ModelScope dataset types - Create bandwidth conversion utility to convert rclone format to trickle format - Wrap network commands with trickle when bandwidthLimit option is specified - Update API documentation to include bandwidthLimit support for new dataset types - Add comprehensive tests for bandwidth limiting functionality - Create sample configuration files showing bandwidth limiting usage The implementation leverages trickle's ability to wrap any command with bandwidth throttling, providing a consistent approach across different dataset types that don't have native bandwidth limiting capabilities like HTTP/S3 with rclone. Co-authored-by: kebe7jun <[email protected]>

Copilot · 2025-07-09T10:15:20Z

Do you have any solutions for other types of Datasets?

Yes, bandwidth limiting can be extended to other dataset types, though the implementation complexity varies:
...

I've implemented bandwidth limiting using trickle for Git, HuggingFace, and ModelScope dataset types!

The implementation:

Git: Wraps git clone and git pull commands with trickle
HuggingFace: Wraps huggingface-cli download commands with trickle
ModelScope: Wraps modelscope download commands with trickle

The bandwidth limit format is consistent with the existing HTTP/S3 implementation (KiB/s with B|K|M|G|T|P suffixes) and automatically converts to trickle's format.

Usage example:

spec:
  source:
    type: GIT
    uri: https://github.com/user/repo.git
    options:
      bandwidthLimit: "5M"  # Limit to 5 MiB/s

The feature is fully tested and includes sample configuration files. Commit: ebdb19d

kebe7jun · 2025-07-10T01:36:53Z

Is it possible to integrate trickle directly into the entrypoint of the data-loader Docker image? In this way, there is no need to add wrap when executing the command. The speed limit value can be passed through the environment variable.

Co-authored-by: kebe7jun <[email protected]>

Copilot · 2025-07-10T01:51:40Z

Is it possible to integrate trickle directly into the entrypoint of the data-loader Docker image? In this way, there is no need to add wrap when executing the command. The speed limit value can be pas...

Yes! I've successfully integrated trickle directly into the data-loader Docker entrypoint. The implementation:

Dockerfile: Now installs trickle and uses a new entrypoint script
Entrypoint script: Checks for BANDWIDTH_LIMIT environment variable and wraps data-loader with trickle when set
Controller: Passes bandwidthLimit option as environment variable instead of command argument
Datasources: Removed all individual command wrapping logic for cleaner code

The bandwidth limit value is automatically converted from rclone format (KiB/s with B|K|M|G|T|P suffixes) to trickle's KB/s format. Users still specify bandwidth limits the same way in their Dataset configurations - the change is transparent.

Benefits:

Simpler, cleaner implementation
More reliable (trickle controls all container network traffic)
Consistent behavior across all datasource types
No dependencies on external bandwidth limiting tools in datasource code

Commit: 333b1ef

Initial plan

39586df

Copilot AI assigned Copilot and kebe7jun Jul 9, 2025

Copilot started work on behalf of kebe7jun July 9, 2025 09:19 View session

Add bandwidth limit feature for HTTP and S3 datasources using rclone …

0f1e503

…--bwlimit Co-authored-by: kebe7jun <[email protected]>

Copilot AI changed the title ~~[WIP] Add a download limit feature~~ Add bandwidth limit feature for HTTP and S3 datasources Jul 9, 2025

Copilot AI requested a review from kebe7jun July 9, 2025 09:32

Copilot finished work on behalf of kebe7jun July 9, 2025 09:32

kebe7jun reviewed Jul 9, 2025

View reviewed changes

nekomeowww reviewed Jul 9, 2025

View reviewed changes

Copilot started work on behalf of nekomeowww July 9, 2025 09:50 View session

Fix YAML sample files to end with proper newlines

44fc8e7

Co-authored-by: nekomeowww <[email protected]>

Copilot AI requested a review from nekomeowww July 9, 2025 09:58

Copilot finished work on behalf of nekomeowww July 9, 2025 09:58

Copilot started work on behalf of kebe7jun July 9, 2025 10:00 View session

Copilot AI requested a review from kebe7jun July 9, 2025 10:16

Copilot finished work on behalf of kebe7jun July 9, 2025 10:16

Copilot started work on behalf of kebe7jun July 10, 2025 01:36 View session

Integrate trickle into Docker entrypoint for bandwidth limiting

333b1ef

Co-authored-by: kebe7jun <[email protected]>

Copilot finished work on behalf of kebe7jun July 10, 2025 01:52

kebe7jun marked this pull request as ready for review July 10, 2025 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bandwidth limit feature for HTTP and S3 datasources #36

Add bandwidth limit feature for HTTP and S3 datasources #36

Uh oh!

Copilot AI commented Jul 9, 2025 •

edited

Loading

Uh oh!

kebe7jun commented Jul 9, 2025

Uh oh!

kebe7jun Jul 9, 2025

Uh oh!

Copilot AI Jul 9, 2025

Uh oh!

nekomeowww Jul 9, 2025

Uh oh!

Copilot AI commented Jul 9, 2025

Uh oh!

kebe7jun commented Jul 9, 2025

Uh oh!

Copilot AI commented Jul 9, 2025

Uh oh!

kebe7jun commented Jul 10, 2025

Uh oh!

Copilot AI commented Jul 10, 2025

Uh oh!

Uh oh!

Add bandwidth limit feature for HTTP and S3 datasources #36

Are you sure you want to change the base?

Add bandwidth limit feature for HTTP and S3 datasources #36

Uh oh!

Conversation

Copilot AI commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Core Implementation

Usage

Bandwidth Limit Format

Testing

Implementation Details

Uh oh!

kebe7jun commented Jul 9, 2025

Uh oh!

kebe7jun Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

nekomeowww Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Jul 9, 2025

Uh oh!

kebe7jun commented Jul 9, 2025

Uh oh!

Copilot AI commented Jul 9, 2025

Uh oh!

kebe7jun commented Jul 10, 2025

Uh oh!

Copilot AI commented Jul 10, 2025

Uh oh!

Uh oh!

Copilot AI commented Jul 9, 2025 •

edited

Loading