Skip to content

Conversation

@wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Nov 3, 2025

Tracking issue

Internal issue: https://linear.app/unionai/issue/BB-6143/missing-outputs-when-syncing-sbworkflow-in-flyteremote-execution

Why are the changes needed?

FlyteRemote was failing with

failed to fetch object: rpc error: code = NotFound desc = rpc error: code = NotFound desc = request failed with status code 404. Body: {"code":5, "message":"object 's3://union-bucket/metadata/propeller/flytesnacks-development-a78lg8z5tpxznp2nf79j/n0/data/0/outputs.pb' not found

The reason this happens is because propeller is writing the outputs uri immediately after the dynamic task itself is done, not waiting until the dynamic subworkflow has been completed. Admin checks that the output uri is there before it tries to fetch but since it's getting set by propeller, it just returns a 404.

Notably, FlyteRemote is falling into the 'normal-node' section of sync_node_execution because the node execution metadata for the dynamic node isn't set initially (i.e. is_parent_node is False and is_dynamic is False). Only later is it set - so I think the race is between the node getting these flags set and the task execution for the dynamic task itself getting marked as succeeded.

What changes were proposed in this pull request?

Delay fetching task execution data until node is complete.

How was this patch tested?

Made a dummy workflow that was able to repro the error maybe half of the time and added logic to fix. Confirmed with debugging logic now removed that the race condition was hit and ignored.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

  • This pull request addresses a race condition that caused 404 errors when attempting to access output data prematurely by delaying the fetching of task execution data until the associated node is complete.
  • The modifications ensure that the task execution data is only retrieved when it is guaranteed to be available, enhancing the reliability of the workflow execution process.
  • Overall, this change improves the FlyteRemote functionality by modifying the task execution data retrieval process to prevent errors.

@flyte-bot
Copy link
Contributor

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>
pingsutw
pingsutw previously approved these changes Nov 3, 2025
@wild-endeavor wild-endeavor changed the title don't fetch task execution data if the node is still running FlyteRemote - Delay fetching task execution data until node complete Nov 3, 2025
@wild-endeavor wild-endeavor marked this pull request as ready for review November 3, 2025 21:51
@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.69%. Comparing base (9f3ec96) to head (337a892).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3347       +/-   ##
===========================================
+ Coverage   44.28%   92.69%   +48.40%     
===========================================
  Files         305       51      -254     
  Lines       27188     2449    -24739     
  Branches     2970        0     -2970     
===========================================
- Hits        12040     2270     -9770     
+ Misses      15051      179    -14872     
+ Partials       97        0       -97     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wild-endeavor wild-endeavor enabled auto-merge (squash) November 3, 2025 22:03
@wild-endeavor wild-endeavor merged commit 3aebd23 into master Nov 3, 2025
117 of 119 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants