Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ingest error code when having control plane errors #5722

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rdettai
Copy link
Collaborator

@rdettai rdettai commented Mar 25, 2025

Description

Currently, control plane errors are returned as 429 by the ingest endpoints. In this PR control plane request errors are now recorded to the workbench so that they can properly be surfaced to the ingest requests. Transient control plane / metastore errors are surfaced as 503 (unavailable) and errors that can't be retried as 500 (internal)

How was this PR tested?

Unit tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the debouncer we add the capability to write errors back to the barrier so that control plane errors can be shared back with all ingest routing requests that are waiting for shards.

Comment on lines +174 to +191
pub fn record_get_or_create_open_shards_error(
&mut self,
subrequest_id: SubrequestId,
open_shard_error: &ControlPlaneError,
) {
let last_failure = match open_shard_error {
ControlPlaneError::Internal(_) => SubworkbenchFailure::Internal,
ControlPlaneError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable,
ControlPlaneError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable,
ControlPlaneError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable,
ControlPlaneError::Metastore(metastore_error) => match metastore_error {
MetastoreError::Timeout(_) => SubworkbenchFailure::ControlPlaneUnavailable,
MetastoreError::TooManyRequests => SubworkbenchFailure::ControlPlaneUnavailable,
MetastoreError::Unavailable(_) => SubworkbenchFailure::ControlPlaneUnavailable,
// TODO: are there other metastore errors that can be considered temporary?
_ => SubworkbenchFailure::Internal,
},
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping determines which requests are going to be 500 or 503. In either case they will be retried internally (is_pending() returns true for both)

@@ -68,6 +68,7 @@ enum IngestFailureReason {
INGEST_FAILURE_REASON_ROUTER_LOAD_SHEDDING = 8;
INGEST_FAILURE_REASON_LOAD_SHEDDING = 9;
INGEST_FAILURE_REASON_CIRCUIT_BREAKER = 10;
INGEST_FAILURE_REASON_UNAVAILABLE = 11;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until all servers are upgraded, this will be converted to IngestServiceError::Internal

@@ -375,7 +423,9 @@ impl IngestRouter {
let rate_limited_shards: &HashSet<ShardId> = &workbench.rate_limited_shards;
let state_guard = self.state.lock().await;

for subrequest in pending_subrequests(&workbench.subworkbenches) {
for subrequest in
pending_subrequests_for_attempt(&workbench.subworkbenches, workbench.num_attempts)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some logic to not process further during this retry attempt subrequests for which we already observed an error when trying to create the shards. Otherwise all control plane errors are overriden as "no shard available".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant