Skip to content

Commit

Permalink
Merge pull request #300 from jeremyandrews/co-info
Browse files Browse the repository at this point in the history
improve Coordinated Omission documentation and debug output
  • Loading branch information
jeremyandrews authored Jul 2, 2021
2 parents 0e1c391 + 56d0dd6 commit a15914e
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 16 deletions.
24 changes: 21 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -527,9 +527,27 @@ If something causes the response to a request to take abnormally long, raw Goose
### Mitigation
Goose attempts to mitigate Coordinated Omission by back-filling the metrics with the statistically expected requests. To do this, it tracks the normal "cadence" of each `GooseUser`, timing how long it takes to loop through all `GooseTasks` in the assigned `GooseTaskSet`. By default, Goose will trigger Coordinated Omission Mitigation if the time to loop through a `GooseTaskSet` takes more than twice as long as the average time of all previous loops. In this case, on the next loop through the `GooseTaskSet` when tracking the actual metrics for each subsequent request in all `GooseTasks` it will also add in statistically generated "requests" with a `response_time` starting at the unexpectedly long request time, then again with that `response_time` minus the normal "cadence", continuing to generate a metric then subtract the normal "cadence" until arriving at the expected "respone_time". In this way, Goose is able to estimate the actual effect of a slowdown.
Goose attempts to mitigate Coordinated Omission by back-filling the metrics with the statistically expected requests. To do this, it tracks the normal "cadence" of each `GooseUser`, timing how long it takes to loop through all `GooseTasks` in the assigned `GooseTaskSet`. By default, Goose will trigger Coordinated Omission Mitigation if the time to loop through a `GooseTaskSet` takes more than twice as long as the average time of all previous loops. In this case, on the next loop through the `GooseTaskSet` when tracking the actual metrics for each subsequent request in all `GooseTasks` it will also add in statistically generated "requests" with a `response_time` starting at the unexpectedly long request time, then again with that `response_time` minus the normal "cadence", continuing to generate a metric then subtract the normal "cadence" until arriving at the expected `response_time`. In this way, Goose is able to estimate the actual effect of a slowdown.
If the `--requests-file` is enabled, requests back-filled by Coordinated Omission Mitigation show up in the generated log file, even though they were not actually sent to the server. In the following example, Coordinated Omission Mitigation was triggered when the server took 11,965 milliseconds to loop through all requests, instead of the average cadence of 3,162 milliseconds. This causes it to backfill a block of requests that statistically should have happened, with a `response_time` decreasing by the expected request cadence.
When Coordinated Omission Mitigation detects an abnormally slow request, Goose will generate an INFO level message (which will be visible if Goose was started with the `-v` run time flag, or written to the log if started with the `-g` run time flag and `--log-file` is configured). For example:
```
10:10:02 [INFO] coordinated omission alert 6.957s into goose attack: "GET http://apache/node/8848" [200] took abnormally long (2932 ms), task name: "(Anon) node page"
10:10:02 [INFO] coordinated omission alert 7.019s into goose attack: "GET http://apache/node/1960" [200] took abnormally long (2873 ms), task name: "(Anon) node page"
10:10:02 [INFO] coordinated omission alert 7.314s into goose attack: "GET http://apache/node/1297" [200] took abnormally long (2578 ms), task name: "(Anon) node page"
```
If the `--requests-file` is enabled, you can get more details, in this case by looking for elapsed times matching the above messages, specifically 6957, 7019, and 7314 respectively:
```
{"coordinated_omission_cadence":1651,"coordinated_omission_elapsed":0,"elapsed":6957,"error":"","final_url":"http://apache/node/8848","method":"Get","name":"(Anon) node page","redirected":false,"response_time":2932,"status_code":200,"success":true,"update":false,"url":"http://apache/node/8848","user":2}
{"coordinated_omission_cadence":1439,"coordinated_omission_elapsed":0,"elapsed":7019,"error":"","final_url":"http://apache/node/1960","method":"Get","name":"(Anon) node page","redirected":false,"response_time":2873,"status_code":200,"success":true,"update":false,"url":"http://apache/node/1960","user":0}
{"coordinated_omission_cadence":1812,"coordinated_omission_elapsed":0,"elapsed":7314,"error":"","final_url":"http://apache/node/1297","method":"Get","name":"(Anon) node page","redirected":false,"response_time":2578,"status_code":200,"success":true,"update":false,"url":"http://apache/node/1297","user":3}
```
In the requests file, you can see that three different user threads triggered Coordinated Omission Mitigation, specifically threads 2, 0, and 3. All `GooseUser` threads were loading the same `GooseTask` as due to task weighting this is the task loaded the most frequently. Each `GooseUser` thread loops through all `GooseTasks` in a similar amount of time: thread 2 takes on average 1.651 seconds, thread 0 takes on average 1.439 seconds, and thread 3 takes on average 1.812 seconds.
Also if the `--requests-file` is enabled, requests back-filled by Coordinated Omission Mitigation show up in the generated log file, even though they were not actually sent to the server. In the following example, Coordinated Omission Mitigation was triggered when the server took 11,965 milliseconds to loop through all requests, instead of the average cadence of 3,162 milliseconds. This causes it to backfill a block of requests that statistically should have happened, with a `response_time` decreasing by the expected request cadence.
```json
{"coordinated_omission_cadence":3161,"coordinated_omission_elapsed":11965,"elapsed":185835,"error":"","final_url":"http://example.com/misc/jquery.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":11965,"status_code":200,"success":true,"update":false,"url":"http://example.com/misc/jquery.js?v=1.4.4","user":2}
Expand All @@ -538,7 +556,7 @@ If the `--requests-file` is enabled, requests back-filled by Coordinated Omissio
{"coordinated_omission_cadence":3161,"coordinated_omission_elapsed":11965,"elapsed":185835,"error":"","final_url":"http://example.com/misc/jquery.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":2482,"status_code":200,"success":true,"update":false,"url":"http://example.com/misc/jquery.js?v=1.4.4","user":2}
```
Normal requests not generated by Coordinated Omission Mitigation have a `coordinated_omission_cadence` and `coordinated_omission_elapsed` of 0.
Normal requests not generated by Coordinated Omission Mitigation have a `coordinated_omission_elapsed` of 0.
Coordinated Omission Mitigation can be disabled by setting `--co-mitigation disabled` when starting Goose. By default it uses the average cadence when backfilling, but it can also be configured to use the `minimum` or `maximum` cadence to allow for different server configuration and testing plans operating on different assumptions.
Expand Down
54 changes: 41 additions & 13 deletions src/goose.rs
Original file line number Diff line number Diff line change
Expand Up @@ -784,7 +784,7 @@ struct GooseRequestCadence {
/// If non-zero, the length of the server slowdown detected by the Goose Coordinated
/// Omission Mitigation in milliseconds.
coordinated_omission_mitigation: u64,
/// If non-zero, the expected cadence to loop through all GooseTasks.
/// The expected cadence to loop through all GooseTasks.
coordinated_omission_cadence: u64,
/// If -1 coordinated_omission_mitigation was never enabled. Otherwise is a counter of how
/// many times the mitigation triggered.
Expand Down Expand Up @@ -1466,19 +1466,20 @@ impl GooseUser {
}
};

// Send a copy of the raw request object to the parent process if
// we're tracking metrics.
if !self.config.no_metrics {
self.send_to_parent(GooseMetric::Request(request_metric.clone()))?;
}

// If enabled, track the cadence between each time the same request is made while
// this GooseUser is running. If requests are blocked by the upstream server, this
// allows Goose to backfill the requests that should have been made based on
// cadence statistics.
self.coordinated_omission_mitigation(&request_metric)
request_metric.coordinated_omission_cadence = self
.coordinated_omission_mitigation(&request_metric)
.await?;

// Send a copy of the raw request object to the parent process if
// we're tracking metrics.
if !self.config.no_metrics {
self.send_to_parent(GooseMetric::Request(request_metric.clone()))?;
}

Ok(GooseResponse::new(request_metric, response))
}

Expand Down Expand Up @@ -1551,11 +1552,11 @@ impl GooseUser {
);
request_cadence.coordinated_omission_counter += 1;
request_cadence.coordinated_omission_mitigation = elapsed;
request_cadence.coordinated_omission_cadence = cadence;
} else {
request_cadence.coordinated_omission_mitigation = 0;
request_cadence.coordinated_omission_cadence = 0;
}
// Always track the expected cadence.
request_cadence.coordinated_omission_cadence = cadence;
}
} else {
// Coordinated Omission Mitigation defaults to average.
Expand All @@ -1570,17 +1571,44 @@ impl GooseUser {
async fn coordinated_omission_mitigation(
&self,
request_metric: &GooseRequestMetric,
) -> Result<(), GooseTaskError> {
) -> Result<u64, GooseTaskError> {
if let Some(co_mitigation) = self.config.co_mitigation.as_ref() {
// Return immediately if coordinated omission mitigation is disabled.
if co_mitigation == &GooseCoordinatedOmissionMitigation::Disabled {
return Ok(());
return Ok(0);
}

// Grab a read-only copy of the `requests_cadence` object to view the current
// request cadence.
let request_cadence = self.request_cadence.read().await;

// Generate an info level alert if this specific request took longer than the normal
// cadence, as that means this specific request will likely trigger Coordinated
// Omission Mitigation.
if request_cadence.counter > 3
&& request_metric.response_time > request_cadence.coordinated_omission_cadence
{
let task_name = if !self.weighted_tasks.is_empty() {
let position = self.position.load(Ordering::SeqCst);
if !self.weighted_tasks[position].1.is_empty() {
format!(", task name: \"{}\"", self.weighted_tasks[position].1)
} else {
"".to_string()
}
} else {
"".to_string()
};
info!(
"coordinated omission alert {:.3}s into goose attack: \"{} {}\" [{}] took abnormally long ({} ms){}",
request_metric.elapsed as f64 / 1_000.0,
request_metric.method,
request_metric.url,
request_metric.status_code,
request_metric.response_time,
task_name,
);
}

// Check if Coordinated Omission Mitigation has been triggered.
if request_cadence.coordinated_omission_mitigation > 0 {
// Base our coordinated omission generated request metric on the actual
Expand All @@ -1595,11 +1623,11 @@ impl GooseUser {
// Send the coordinated omission mitigation generated metrics to the parent.
self.send_to_parent(GooseMetric::Request(coordinated_omission_request_metric))?;
}
Ok(request_cadence.coordinated_omission_cadence)
} else {
// A setting for coordinated omission mitigation is required, defaults to Average.
unreachable!();
}
Ok(())
}

fn send_to_parent(&self, metric: GooseMetric) -> GooseTaskResult {
Expand Down

0 comments on commit a15914e

Please sign in to comment.