FlyteRemote - Delay fetching task execution data until node complete #3347
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tracking issue
Internal issue: https://linear.app/unionai/issue/BB-6143/missing-outputs-when-syncing-sbworkflow-in-flyteremote-execution
Why are the changes needed?
FlyteRemote was failing with
The reason this happens is because propeller is writing the outputs uri immediately after the dynamic task itself is done, not waiting until the dynamic subworkflow has been completed. Admin checks that the output uri is there before it tries to fetch but since it's getting set by propeller, it just returns a 404.
Notably,
FlyteRemoteis falling into the 'normal-node' section of sync_node_execution because the node execution metadata for the dynamic node isn't set initially (i.e. is_parent_node is False and is_dynamic is False). Only later is it set - so I think the race is between the node getting these flags set and the task execution for the dynamic task itself getting marked as succeeded.What changes were proposed in this pull request?
Delay fetching task execution data until node is complete.
How was this patch tested?
Made a dummy workflow that was able to repro the error maybe half of the time and added logic to fix. Confirmed with debugging logic now removed that the race condition was hit and ignored.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito