Skip to content

feat: s3 multipart import file upload #1521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

strophy
Copy link
Contributor

@strophy strophy commented Jul 17, 2025

This PR introduces support for large file (multipart) imports in AppFlowy-Cloud in order to work around the 5GB limitation in AWS S3 on single PUT operations.

It unifies the logic for small and large file uploads, updating both API and internal handling. The changes ensure that files larger than 5GB are uploaded using a multipart protocol compatible with S3 and the AppFlowy worker, while maintaining backward compatibility for small files.

I have successfully used this to import a 12.2GB Notion export in AWS EC2 with S3 storage and an external proxy. I have a corresponding PR modifying AppFlowy-Web here. Some further changes to proxy template defaults may be needed? I also haven't tested or modified the desktop AppFlowy client because I don't use it, but I guess changes are needed here too to support this function. There shouldn't be any breaking changes though, so the old desktop client should still work.

I'm looking forward to edits and CI results, hopefully my approach wasn't too naive here, my understanding of AppFlowy is still very limited.

Summary by Sourcery

Implement S3-compatible multipart import file upload for files over 5GB and unify the import upload workflow between small and large files

New Features:

  • Support multipart upload for large import files in client-api with upload_large_import_file
  • Extend create_import endpoint to choose between presigned URL and multipart upload and return upload_type and workspace_id in the response

Enhancements:

  • Unify upload_import_file method to delegate to small or large upload logic based on file size

Build:

  • Add tempfile as a dev-dependency for creating temporary test files

Tests:

  • Add tests for small single-part and large multipart import file uploads
  • Update import test helper to supply workspace_id for multipart uploads

Copy link

sourcery-ai bot commented Jul 17, 2025

Reviewer's Guide

Adds multipart upload support for files over 5GB by unifying small and large file upload logic in the client API and server create_import endpoint, extends the import task response DTO, and updates tests and dependencies accordingly.

Sequence diagram for unified import file upload (small vs large files)

sequenceDiagram
    actor User
    participant Client as Client API
    participant Server as AppFlowy Server
    participant S3 as AWS S3

    User->>Client: upload_import_file(file_path, url, workspace_id)
    Client->>Client: Check file size
    alt file_size <= 5GB
        Client->>Server: create_import (returns presigned_url)
        Client->>S3: PUT file to presigned_url
        S3-->>Client: 200 OK
    else file_size > 5GB
        Client->>Server: create_import (returns upload_type: multipart, workspace_id)
        Client->>Server: create_upload (multipart session)
        loop For each chunk
            Client->>Server: upload_part(chunk)
        end
        Client->>Server: complete_upload
    end
Loading

Class diagram for updated CreateImportTaskResponse DTO

classDiagram
    class CreateImportTaskResponse {
        +String task_id
        +Option<String> presigned_url
        +String upload_type
        +Option<String> workspace_id
    }
Loading

File-Level Changes

Change Details Files
Unify file upload entrypoint to route small vs large uploads
  • Add S3_SINGLE_PUT_LIMIT constant and file size check
  • Modify upload_import_file signature to accept workspace_id
  • Delegate to upload_small_import_file or upload_large_import_file based on size
libs/client-api/src/http_file.rs
Implement multipart upload for large files
  • Create upload session via create_upload
  • Read file in 100MB chunks and upload with upload_part
  • Collect ETags and complete upload with complete_upload
libs/client-api/src/http_file.rs
Enhance create_import_handler to support multipart logic
  • Branch on content_length >= 5GB to choose upload_type
  • Generate S3 key with workspace import prefix
  • Return upload_type and workspace_id for large uploads
  • Continue returning presigned_url for small uploads
src/api/data_import.rs
Extend import task response DTO for multipart
  • Make presigned_url optional
  • Add upload_type field
  • Add optional workspace_id field
libs/database-entity/src/dto.rs
Add tests for small and large file upload workflows
  • Introduce test_large_file_multipart_upload using NamedTempFile
  • Introduce test_small_file_single_upload with presigned URL
  • Update upload_file helper to supply workspace_id
tests/workspace/import_test.rs
Include tempfile crate for file-based tests
  • Add tempfile to dev-dependencies in root Cargo.toml
  • Add tempfile to dev-dependencies in libs/client-api Cargo.toml
Cargo.toml
libs/client-api/Cargo.toml

Possibly linked issues

  • License issue #1: The PR adds multipart S3 upload for large files, directly addressing the S3 error in Notion import.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @strophy - I've reviewed your changes - here's some feedback:

  • Rather than generating a random legacy_task_id in upload_import_file for large uploads, propagate the actual task_id (and workspace_id) returned from create_import so that the client uses the same s3_key path as the server expects.
  • The multipart upload tests print errors but don’t assert on behavior—add explicit assertions for expected outcomes (e.g., matching upload_type or error variants) so CI can reliably catch regressions.
  • Consider replacing the string-based upload_type field in CreateImportTaskResponse with a typed enum for better compile-time safety and to avoid magic string inconsistencies.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Rather than generating a random legacy_task_id in upload_import_file for large uploads, propagate the actual task_id (and workspace_id) returned from create_import so that the client uses the same s3_key path as the server expects.
- The multipart upload tests print errors but don’t assert on behavior—add explicit assertions for expected outcomes (e.g., matching upload_type or error variants) so CI can reliably catch regressions.
- Consider replacing the string-based upload_type field in CreateImportTaskResponse with a typed enum for better compile-time safety and to avoid magic string inconsistencies.

## Individual Comments

### Comment 1
<location> `libs/client-api/src/http_file.rs:280` </location>
<code_context>
+    trace!("created multipart upload session: {}", upload_response.upload_id);
+
+    // Step 2: Upload file in chunks
+    const CHUNK_SIZE: usize = 100 * 1024 * 1024; // 100MB chunks
+    let mut file = File::open(file_path).await?;
+    let mut part_number = 1;
+    let mut parts = Vec::new();
</code_context>

<issue_to_address>
Chunk size for multipart upload is hardcoded and may not be optimal for all environments.

Consider making the chunk size configurable or documenting why 100MB was chosen, as different S3 providers have varying part size and number limits.

Suggested implementation:

```rust
    // Step 2: Upload file in chunks
    // Default chunk size is 100MB, which is a common value for S3 multipart uploads.
    // S3 requires parts to be at least 5MB (except the last), and has a maximum of 10,000 parts.
    // Make this configurable to support different environments and S3 providers.
    let chunk_size = chunk_size.unwrap_or(100 * 1024 * 1024); // 100MB default
    let mut file = File::open(file_path).await?;
    let mut part_number = 1;
    let mut parts = Vec::new();

```

```rust
    loop {
      let mut chunk = vec![0u8; chunk_size];
      let bytes_read = file.read(&mut chunk).await?;

```

- You will need to add a `chunk_size: Option<usize>` parameter to the containing function's signature.
- When calling this function, pass `None` to use the default, or `Some(desired_size)` to override.
- If this function is part of a struct, consider making `chunk_size` a field of the struct instead.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +280 to +287
const CHUNK_SIZE: usize = 100 * 1024 * 1024; // 100MB chunks
let mut file = File::open(file_path).await?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Chunk size for multipart upload is hardcoded and may not be optimal for all environments.

Consider making the chunk size configurable or documenting why 100MB was chosen, as different S3 providers have varying part size and number limits.

Suggested implementation:

    // Step 2: Upload file in chunks
    // Default chunk size is 100MB, which is a common value for S3 multipart uploads.
    // S3 requires parts to be at least 5MB (except the last), and has a maximum of 10,000 parts.
    // Make this configurable to support different environments and S3 providers.
    let chunk_size = chunk_size.unwrap_or(100 * 1024 * 1024); // 100MB default
    let mut file = File::open(file_path).await?;
    let mut part_number = 1;
    let mut parts = Vec::new();
    loop {
      let mut chunk = vec![0u8; chunk_size];
      let bytes_read = file.read(&mut chunk).await?;
  • You will need to add a chunk_size: Option<usize> parameter to the containing function's signature.
  • When calling this function, pass None to use the default, or Some(desired_size) to override.
  • If this function is part of a struct, consider making chunk_size a field of the struct instead.

@strophy strophy force-pushed the multipart-import branch from 92e8a01 to 017e84e Compare July 17, 2025 11:11
@khorshuheng
Copy link
Collaborator

khorshuheng commented Jul 18, 2025

If i understand correctly, for this approach, the server will need to have sufficient disk space / memory in order to handle the file upload, since the file will be uploaded indirectly to S3 via the server instead of using presigned url directly.

This is fine (and a good way to get around the large file limitation imposed by S3) for self hosted use cases, as the server will typically have sufficient disk / memory for a single person.

But, when there are large number of users, the server will require quite a bit of resource, and may crash if multiple users are trying to upload the files at the same time.

Hence, we will likely need to handle this on the client's end i.e. client sending files directly to S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants