Fix: Make distributed error aggregation opt-in#6103
Merged
Conversation
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6103 +/- ##
==========================================
- Coverage 37.10% 36.99% -0.11%
==========================================
Files 1318 1318
Lines 132403 132415 +12
==========================================
- Hits 49122 48989 -133
- Misses 79008 79173 +165
+ Partials 4273 4253 -20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
eapolinario
approved these changes
Dec 12, 2024
Contributor
eapolinario
left a comment
There was a problem hiding this comment.
Thank you. This makes sense, please tag me on the docs PR as well.
Member
Author
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
For RFC #5598, flytepropeller was given the ability to list error files in the so-called raw output prefix bucket of an execution with the goal of identifying which worker pod in a failed distributed task experienced the first error.
In GCP, listing the error files requires the
"storage.objects.list"permission which so far wasn't given to propeller. I added this permission to the Flyte propeller custom role here.That being said, because this feature is therefore not backwards compatible, I propose to make it opt-in.
If you agree with this, I'll make another PR to document this feature and how to activate it here and/or here.
What changes were proposed in this pull request?
Only search for multiple error files from the different workers of a distributed task as proposed in RFC #5598 if actively enabled in the flytepropeller config in order to not strictly require the addition of the
"storage.objects.list"permission.How was this patch tested?
Ran flytepropeller with/without the flag enabled locally for a GKE based deployment and adapted unit tests.
Check all the applicable boxes