Something marked all task instances as failed #36596
Unanswered
mat-pielat
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
Hmm.. Maybe it's this one #32684 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Something caused task states of all DAGs to become failed - including ones that have run successfully in the past. DAG states were unaffected, so one could observe successful DAG runs with all their tasks marked as failed.
This is the third time we observe such phenomenon. It happened in three different deployments but considering the size of our fleet, this is a very rare occurrence. For each of these deployments it's a one off and we don't know what are the conditions that can trigger the situation. It happened once with Airflow version 2.4.3 and twice with 2.5.3.
One thing is common for all three cases was that
clearcommand was used around the time of the occurrence. The audit log of the recently affected Airflow deployment contains these two entries:However, these were logged 1h 40m before point T, so this might not be the smoking gun we're looking for.
Can this situation be explained in any way?
clearcommand. Technically, this command only affect the specific {dagrun, task} pair but maybe a bug can cause a big spill somehow?Has anything like this been observed before?
Beta Was this translation helpful? Give feedback.
All reactions