-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alarm on percentage of failing workers #313
Comments
Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations? In the statement
... would you want "failing" to include or exclude jobs that were later retried successfully? |
For reference here is the current alarm that should be replaced by a percentage: Lines 73 to 76 in 7aa0e89
Lines 673 to 688 in 7aa0e89
I guess this depends on how many jobs finish per period. As seen above, the current alarm uses a period of 1 minute. We could use the same period here.
I actually think we'd want 100% of running jobs failing as the detection metric. This would show that there is a systemic error that makes all jobs fail.
For jobs that continuously fail and don't succeed through retries, there is the dead letter queue (and alarm on it). The idea of this alarm is to get an alarm faster if there is a systemic problem. One thing to note: We might have to tune this alarm to not trigger if there is a very low number of tasks (e.g. 1 tasks finished, it failed, alarm). Not sure if this is something to protect against. |
So pragmatically, what metrics do we compose to make this metric? I think it would be
So Note that worker errors metric wouldn't cover watcher failures, though my hunch is that is really quite rare & maybe not what this alarm is trying to grasp. |
Background: For some services, you don’t need careful monitoring on worker errors if you have careful monitoring on the dead letter queue (DLQ). This is because if worker errors don’t result in a DLQ status, that means they were successfully retried. For these services the only type of worker error monitoring we’d want is to monitor widespread failure across all workers. However, if the number of workers running at a given time is variable, this isn’t accomplishable with current watchbot error alerting, which requires a static threshold
Feature request: The ability to configure the error alarm with a percentage of failures would be great: i.e. “alarm when more than 75% of running jobs are failing.”
/cc @mapbox/platform
The text was updated successfully, but these errors were encountered: