Skip to content

Conversation

mys007
Copy link

@mys007 mys007 commented Sep 2, 2025

Tracking issue

Why are the changes needed?

Persisting FlyteDirectory doesn't work correctly when there are multiple files with the same name in the directory.

Let's have two files:

  • /home/user/my_dir/file.txt
  • /home/user/my_dir/subdir/file.txt (with identical content in both files)

And a trivial workflow:

@task()
def first_task(fd: fl.FlyteDirectory) -> None:
    fd.download()

@fl.workflow
def my_workflow(fd: fl.FlyteDirectory) -> None:
   first_task(fd)

Running with pyflyte run ... --fd /home/user/my_dir results in only file.txt being uploaded in the flyte S3 bucket. The file subdir/file.txt is skipped. This is of course not as expected.

In addition, note that if the files have a different content, this raises an exception in get_upload_signed_url!

USER:EntityAlreadyExists: error=None, cause=<_InactiveRpcError of RPC that terminated with:
        status = StatusCode.ALREADY_EXISTS
        details = "file already exists at location [s3://flyte/flytesnacks/development/ABSCCKRIMI35Z5XNGNQTO7FZIQ======/file.txt], specify a matching hash if you wish to rewrite"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:6, grpc_message:"file already exists at location [s3://flyte/flytesnacks/development/ABSCCKRIMI35Z5XNGNQTO7FZIQ======/file.txt], specify a matching hash if you wish to rewrite"}"

Our current workaround is to pack a directory into an archive and use FlyteFile.

What changes were proposed in this pull request?

This bugfix is based on the observation that get_upload_signed_url is only provided file basename and file content hash to determine the URL. This leads to the both files ending up in the same location (the directory is flatten). To fix that, I add the path relative to the local root to the filename. This seems to work and both files are uploaded.

Disclaimer: I have no idea if there are any side effects of this fix. It can be that some other functionality gets broken. Or there might be a better fix. I would appreciate a review from the maintainers:).

How was this patch tested?

Empirical test: For the minimal example above this works, the call to super()._put in FlyteFS._put() returns two objects: ['s3://flyte/flytesnacks/development/QLGCX23Y7HX3FI2LIEGVF5OW7Q======/subdir/file.txt', 's3://flyte/flytesnacks/development/QLGCX23Y7HX3FI2LIEGVF5OW7Q======/file.txt']. This works both when the file content is identical and when it's different.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This pull request fixes a bug in the FlyteDirectory persistence mechanism that prevented the upload of multiple files with the same name and content. By including the relative path of files in the upload process, it enhances the Flyte framework's ability to manage directories with duplicate files effectively.

@flyte-bot
Copy link
Contributor

Bito Automatic Review Failed - Technical Failure

Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:

Agent Run ID: c11d12ad-43fc-4e4e-8926-11ca82d446d0

@mys007 mys007 force-pushed the fix-upload-rel-filenames2 branch from b824f2d to 8241a82 Compare September 3, 2025 07:29
@mys007 mys007 changed the title Fix: FlyteDirectory persistance loses copies of identical files Fix: FlyteDirectory fails with multiple files with the same name Sep 3, 2025
@mys007 mys007 changed the title Fix: FlyteDirectory fails with multiple files with the same name Fix: FlyteDirectory fails for multiple files with the same name Sep 3, 2025
@flyte-bot
Copy link
Contributor

Bito Automatic Review Failed - Technical Failure

Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:

Agent Run ID: f1aa79e7-8a74-4d65-85f4-69c4726bf53c

Signed-off-by: Martin Simonovsky <[email protected]>
Signed-off-by: Martin Simonovsky <[email protected]>
@mys007 mys007 force-pushed the fix-upload-rel-filenames2 branch from bde858a to 3d4e564 Compare September 3, 2025 22:14
@flyte-bot
Copy link
Contributor

Bito Automatic Review Failed - Technical Failure

Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:

Agent Run ID: 5d0c7d53-4c07-4184-886c-62f22316e875

Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 11.11111% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.76%. Comparing base (eb5a67f) to head (3d4e564).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/remote/remote_fs.py 11.11% 8 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (eb5a67f) and HEAD (3d4e564). Click for more details.

HEAD has 98 uploads less than BASE
Flag BASE (eb5a67f) HEAD (3d4e564)
100 2
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3325      +/-   ##
==========================================
- Coverage   85.26%   75.76%   -9.50%     
==========================================
  Files         386      215     -171     
  Lines       30276    22504    -7772     
  Branches     2969     2954      -15     
==========================================
- Hits        25814    17051    -8763     
- Misses       3615     4583     +968     
- Partials      847      870      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mys007
Copy link
Author

mys007 commented Sep 5, 2025

I'm not sure why the single test of attr_access_sd in integration_test_codecov is failing. Insights would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants