-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ingest error code when having control plane errors #5722
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the debouncer we add the capability to write errors back to the barrier so that control plane errors can be shared back with all ingest routing requests that are waiting for shards.
pub fn record_get_or_create_open_shards_error( | ||
&mut self, | ||
subrequest_id: SubrequestId, | ||
open_shard_error: &ControlPlaneError, | ||
) { | ||
let last_failure = match open_shard_error { | ||
ControlPlaneError::Internal(_) => SubworkbenchFailure::Internal, | ||
ControlPlaneError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable, | ||
ControlPlaneError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable, | ||
ControlPlaneError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable, | ||
ControlPlaneError::Metastore(metastore_error) => match metastore_error { | ||
MetastoreError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable, | ||
MetastoreError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable, | ||
MetastoreError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable, | ||
// TODO: are there other metastore errors that can be considered temporary? | ||
_ => SubworkbenchFailure::Internal, | ||
}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping determines which requests are going to be 500 or 503. In either case they will be retried internally (is_pending()
returns true for both)
@@ -68,6 +68,7 @@ enum IngestFailureReason { | |||
INGEST_FAILURE_REASON_ROUTER_LOAD_SHEDDING = 8; | |||
INGEST_FAILURE_REASON_LOAD_SHEDDING = 9; | |||
INGEST_FAILURE_REASON_CIRCUIT_BREAKER = 10; | |||
INGEST_FAILURE_REASON_UNAVAILABLE = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
until all servers are upgraded, this will be converted to IngestServiceError::Internal
@@ -375,7 +423,9 @@ impl IngestRouter { | |||
let rate_limited_shards: &HashSet<ShardId> = &workbench.rate_limited_shards; | |||
let state_guard = self.state.lock().await; | |||
|
|||
for subrequest in pending_subrequests(&workbench.subworkbenches) { | |||
for subrequest in | |||
pending_subrequests_for_attempt(&workbench.subworkbenches, workbench.num_attempts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some logic to not process further during this retry attempt subrequests for which we already observed an error when trying to create the shards. Otherwise all control plane errors are overriden as "no shard available".
Description
Currently, control plane errors are returned as 429 by the ingest endpoints. In this PR control plane request errors are now recorded to the workbench so that they can properly be surfaced to the ingest requests. Transient control plane / metastore errors are surfaced as 503 (unavailable) and errors that can't be retried as 500 (internal)
How was this PR tested?
Unit tests