Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TUS temporary file optimization (mv instead of read on same filesystem) #1690

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JeffersonBledsoe
Copy link
Member

@JeffersonBledsoe JeffersonBledsoe commented Aug 24, 2023

  • this is an optimisation that bypasses reading and writing the TUS upload if it's on the same file system as the tmp dir used for new blobs. This speeds up the last part of a TUS upload considerably.
  • it also bypasses DX validation of the blob as that seems to read this into memory for some reason.
    • instead we should work out how to still validate but not read the data unless we really have to

Plan

@netlify
Copy link

netlify bot commented Aug 24, 2023

Deploy Preview for plone-restapi ready!

Name Link
🔨 Latest commit 0792e47
🔍 Latest deploy log https://app.netlify.com/sites/plone-restapi/deploys/668a43840674b2000804626d
😎 Deploy Preview https://deploy-preview-1690--plone-restapi.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mister-roboto
Copy link

@JeffersonBledsoe thanks for creating this Pull Request and helping to improve Plone!

TL;DR: Finish pushing changes, pass all other checks, then paste a comment:

@jenkins-plone-org please run jobs

To ensure that these changes do not break other parts of Plone, the Plone test suite matrix needs to pass, but it takes 30-60 min. Other CI checks are usually much faster and the Plone Jenkins resources are limited, so when done pushing changes and all other checks pass either start all Jenkins PR jobs yourself, or simply add the comment above in this PR to start all the jobs automatically.

Happy hacking!

@davisagli
Copy link
Member

@JeffersonBledsoe NamedBlobFile uses IStorage adapters to process the value that is passed in (https://github.com/plone/plone.namedfile/blob/master/plone/namedfile/storages.py). It might make sense to implement this as an IStorage adapter for TUSUpload.

it also bypasses DX validation of the blob as that seems to read this into memory for some reason. instead we should work out how to still validate but not read the data unless we really have to

Yeah, I don't think we can bypass validation entirely. If the entire file is read during validation that's surprising, have you identified where that is happening?

@djay
Copy link
Member

djay commented Nov 23, 2023

@davisagli I've put in the storage adapter and that works. I've also put in a test to show it works.

But the validation does actually read the whole file into memory (also shown in the test).

The cause is

  • part of validating a field is to validate all the attributes of the field against it's schema
  • so data property in the namedblobFile instance is being validated against INamedBlobFile and schema.Bytes field default validation.
  • but data is a dynamic property which will read the entire file into memory
  • and it does this for no reason since all it actually checks is that it has a min length of 0
    • it maybe checks it;s of type bytes but doesn't a file have to be anyway?
    • but it actually doesn't matter what the validation. it will read the data in regardless - get_schema_validation_errors - as long as Bytes field is validatable the data is read even if no validation is performed

Note this is a problem for NamedImageBlob too. and this is likely a problem for any editing of a File object or other reasons files might be validated

Possible solutions?

  • we skip default validation in NamedBlobFile field and instead do our own sanity check that the NamedBlobFile value is what it says it is?
  • maybe use something other than a bytes field in the INamedBlobFile schema taht doesn't do validation against file contents itself?
    • I removed the bytes schema definition from IFile and the test passed so the file wasn't read into memory.

Not really sure the right way to solve this yet. ideas?

@djay
Copy link
Member

djay commented Nov 23, 2023

I also suspect that some of the other storages like FileUploadStorage might also be be already buffered into a local file at least some of the time. In that case, the file can be moved rather than read again improving performance. I haven't looked into this yet however.

@djay
Copy link
Member

djay commented Nov 29, 2023

@davisagli I had a go at changing the validation to avoid reading in the whole file in plone/plone.namedfile#155

@djay djay requested a review from davisagli December 13, 2023 09:41
@djay djay changed the title WIP: TUS temporary file optimization TUS temporary file optimization Dec 13, 2023

def store(self, data, blob):
if not isinstance(data, TUSUpload):
raise NotStorable('Could not store data (not of "FileUpload").')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise NotStorable('Could not store data (not of "FileUpload").')
raise NotStorable('Could not store data (not of type "TUSUpload").')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djay Now that I'm looking at things more carefully, I'm wondering if this special storage adapter is really needed.

Previously, an open file was passed into the NamedBlobFile constructor. This should end up using the FileDescriptorStorable which uses blob.consumeFile, which is already using rename_or_copy_blob.

self.assertEqual(blob_read, 0, "Validation is reading the whole blob in memory")
# TODO: would be better test to patch file read instead and ensure its not called?

ZODB.blob.Blob.open = old_open
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in a finally block in case there's an error during the test.

@djay djay changed the title TUS temporary file optimization TUS temporary file optimization (mv instead of read on same filesystem) Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants