Skip to content

Fix: Make distributed error aggregation opt-in#6103

Merged
fg91 merged 1 commit intomasterfrom
fg91/fix/opt-in-dist-error-aggregation
Dec 13, 2024
Merged

Fix: Make distributed error aggregation opt-in#6103
fg91 merged 1 commit intomasterfrom
fg91/fix/opt-in-dist-error-aggregation

Conversation

@fg91
Copy link
Member

@fg91 fg91 commented Dec 11, 2024

Why are the changes needed?

For RFC #5598, flytepropeller was given the ability to list error files in the so-called raw output prefix bucket of an execution with the goal of identifying which worker pod in a failed distributed task experienced the first error.

In GCP, listing the error files requires the "storage.objects.list" permission which so far wasn't given to propeller. I added this permission to the Flyte propeller custom role here.

That being said, because this feature is therefore not backwards compatible, I propose to make it opt-in.

If you agree with this, I'll make another PR to document this feature and how to activate it here and/or here.

What changes were proposed in this pull request?

Only search for multiple error files from the different workers of a distributed task as proposed in RFC #5598 if actively enabled in the flytepropeller config in order to not strictly require the addition of the "storage.objects.list" permission.

How was this patch tested?

Ran flytepropeller with/without the flag enabled locally for a GKE based deployment and adapted unit tests.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
@codecov
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.99%. Comparing base (4a7f4c2) to head (c6e73c7).
Report is 128 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6103      +/-   ##
==========================================
- Coverage   37.10%   36.99%   -0.11%     
==========================================
  Files        1318     1318              
  Lines      132403   132415      +12     
==========================================
- Hits        49122    48989     -133     
- Misses      79008    79173     +165     
+ Partials     4273     4253      -20     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.10% <ø> (ø)
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.29% <ø> (-0.05%) ⬇️
unittests-flyteidl 7.23% <ø> (ø)
unittests-flyteplugins 53.85% <100.00%> (+0.02%) ⬆️
unittests-flytepropeller 42.60% <ø> (ø)
unittests-flytestdlib 55.18% <ø> (-2.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fg91 fg91 self-assigned this Dec 11, 2024
Copy link
Contributor

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This makes sense, please tag me on the docs PR as well.

@fg91
Copy link
Member Author

fg91 commented Jan 14, 2025

Thank you. This makes sense, please tag me on the docs PR as well.

flyteorg/flytesnacks#1776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants