-
-
Notifications
You must be signed in to change notification settings - Fork 418
feat: s3 multipart import file upload #1521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds multipart upload support for files over 5GB by unifying small and large file upload logic in the client API and server create_import endpoint, extends the import task response DTO, and updates tests and dependencies accordingly. Sequence diagram for unified import file upload (small vs large files)sequenceDiagram
actor User
participant Client as Client API
participant Server as AppFlowy Server
participant S3 as AWS S3
User->>Client: upload_import_file(file_path, url, workspace_id)
Client->>Client: Check file size
alt file_size <= 5GB
Client->>Server: create_import (returns presigned_url)
Client->>S3: PUT file to presigned_url
S3-->>Client: 200 OK
else file_size > 5GB
Client->>Server: create_import (returns upload_type: multipart, workspace_id)
Client->>Server: create_upload (multipart session)
loop For each chunk
Client->>Server: upload_part(chunk)
end
Client->>Server: complete_upload
end
Class diagram for updated CreateImportTaskResponse DTOclassDiagram
class CreateImportTaskResponse {
+String task_id
+Option<String> presigned_url
+String upload_type
+Option<String> workspace_id
}
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @strophy - I've reviewed your changes - here's some feedback:
- Rather than generating a random legacy_task_id in upload_import_file for large uploads, propagate the actual task_id (and workspace_id) returned from create_import so that the client uses the same s3_key path as the server expects.
- The multipart upload tests print errors but don’t assert on behavior—add explicit assertions for expected outcomes (e.g., matching upload_type or error variants) so CI can reliably catch regressions.
- Consider replacing the string-based upload_type field in CreateImportTaskResponse with a typed enum for better compile-time safety and to avoid magic string inconsistencies.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Rather than generating a random legacy_task_id in upload_import_file for large uploads, propagate the actual task_id (and workspace_id) returned from create_import so that the client uses the same s3_key path as the server expects.
- The multipart upload tests print errors but don’t assert on behavior—add explicit assertions for expected outcomes (e.g., matching upload_type or error variants) so CI can reliably catch regressions.
- Consider replacing the string-based upload_type field in CreateImportTaskResponse with a typed enum for better compile-time safety and to avoid magic string inconsistencies.
## Individual Comments
### Comment 1
<location> `libs/client-api/src/http_file.rs:280` </location>
<code_context>
+ trace!("created multipart upload session: {}", upload_response.upload_id);
+
+ // Step 2: Upload file in chunks
+ const CHUNK_SIZE: usize = 100 * 1024 * 1024; // 100MB chunks
+ let mut file = File::open(file_path).await?;
+ let mut part_number = 1;
+ let mut parts = Vec::new();
</code_context>
<issue_to_address>
Chunk size for multipart upload is hardcoded and may not be optimal for all environments.
Consider making the chunk size configurable or documenting why 100MB was chosen, as different S3 providers have varying part size and number limits.
Suggested implementation:
```rust
// Step 2: Upload file in chunks
// Default chunk size is 100MB, which is a common value for S3 multipart uploads.
// S3 requires parts to be at least 5MB (except the last), and has a maximum of 10,000 parts.
// Make this configurable to support different environments and S3 providers.
let chunk_size = chunk_size.unwrap_or(100 * 1024 * 1024); // 100MB default
let mut file = File::open(file_path).await?;
let mut part_number = 1;
let mut parts = Vec::new();
```
```rust
loop {
let mut chunk = vec![0u8; chunk_size];
let bytes_read = file.read(&mut chunk).await?;
```
- You will need to add a `chunk_size: Option<usize>` parameter to the containing function's signature.
- When calling this function, pass `None` to use the default, or `Some(desired_size)` to override.
- If this function is part of a struct, consider making `chunk_size` a field of the struct instead.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
const CHUNK_SIZE: usize = 100 * 1024 * 1024; // 100MB chunks | ||
let mut file = File::open(file_path).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Chunk size for multipart upload is hardcoded and may not be optimal for all environments.
Consider making the chunk size configurable or documenting why 100MB was chosen, as different S3 providers have varying part size and number limits.
Suggested implementation:
// Step 2: Upload file in chunks
// Default chunk size is 100MB, which is a common value for S3 multipart uploads.
// S3 requires parts to be at least 5MB (except the last), and has a maximum of 10,000 parts.
// Make this configurable to support different environments and S3 providers.
let chunk_size = chunk_size.unwrap_or(100 * 1024 * 1024); // 100MB default
let mut file = File::open(file_path).await?;
let mut part_number = 1;
let mut parts = Vec::new();
loop {
let mut chunk = vec![0u8; chunk_size];
let bytes_read = file.read(&mut chunk).await?;
- You will need to add a
chunk_size: Option<usize>
parameter to the containing function's signature. - When calling this function, pass
None
to use the default, orSome(desired_size)
to override. - If this function is part of a struct, consider making
chunk_size
a field of the struct instead.
92e8a01
to
017e84e
Compare
If i understand correctly, for this approach, the server will need to have sufficient disk space / memory in order to handle the file upload, since the file will be uploaded indirectly to S3 via the server instead of using presigned url directly. This is fine (and a good way to get around the large file limitation imposed by S3) for self hosted use cases, as the server will typically have sufficient disk / memory for a single person. But, when there are large number of users, the server will require quite a bit of resource, and may crash if multiple users are trying to upload the files at the same time. Hence, we will likely need to handle this on the client's end i.e. client sending files directly to S3. |
This PR introduces support for large file (multipart) imports in AppFlowy-Cloud in order to work around the 5GB limitation in AWS S3 on single PUT operations.
It unifies the logic for small and large file uploads, updating both API and internal handling. The changes ensure that files larger than 5GB are uploaded using a multipart protocol compatible with S3 and the AppFlowy worker, while maintaining backward compatibility for small files.
I have successfully used this to import a 12.2GB Notion export in AWS EC2 with S3 storage and an external proxy. I have a corresponding PR modifying AppFlowy-Web here. Some further changes to proxy template defaults may be needed? I also haven't tested or modified the desktop AppFlowy client because I don't use it, but I guess changes are needed here too to support this function. There shouldn't be any breaking changes though, so the old desktop client should still work.
I'm looking forward to edits and CI results, hopefully my approach wasn't too naive here, my understanding of AppFlowy is still very limited.
Summary by Sourcery
Implement S3-compatible multipart import file upload for files over 5GB and unify the import upload workflow between small and large files
New Features:
Enhancements:
Build:
Tests: