Description
https://github.com/bottlerocket-os/bottlerocket-update-operator/issues?q=is%3Aissue+backoff
Image I'm using:
v1.0.0
Issue or Feature Request:
We've started using the helm chart on the develop branch. We missed that the shadow chart had to be installed. When the operator started it rightly wrote error messages because the API didn't exist. We didn't see the error because the controller went healthy.
We saw around 53.6 million error messages in a 7 minute window and a peak of 130K messages / sec when the bottlerocket shadow CRD was not installed.
I would have expected the controller to not be healthy --or-- there's exponential backoffs used when these kind of errors are hit (could potentially DoS the kube-api server too). Controllers I've written in the past used backoff to prevent crazy logs and hammering the API. I saw that at least one of the errors was had a static backoff of 5 secs, but there's some others that seem to have no backoff.
- Please search for similar issues, including closed issues:
- https://github.com/bottlerocket-os/bottlerocket-update-operator/issues?q=is%3Aissue+is%3Aopen++exception.details%3A+BottlerocketShadowCreate
- https://github.com/bottlerocket-os/bottlerocket-update-operator/issues?q=is%3Aissue+is%3Aopen+%22%27ApiError%3A+%22%22404+page+not+found+%22
- https://github.com/bottlerocket-os/bottlerocket-update-operator/issues?q=is%3Aissue+backoff
- Please include details about the environment you're running in.
- EKS, 1.27, bottlerocket managed nodes. Benchmarked settings for CIS Level 1
- I can go through to the account manager if more details are need.
- Please include any error messages you received, with any required context.
- This is the pattern of the messages, it looks like the output from the controller has ascii color? I think this should probably be good enough to track down the spots:
Message
[[wildcard]*[/wildcard]:[wildcard][00-59][/wildcard]:[wildcard][00.018549-59.992987][/wildcard]Z�[0m �[33m WARN�[0m �[1;33mkube_client::client�[0m�[33m: �[33mUnsuccessful data error parse: 404 page not found
[0m
[2;3mat�[0m /src/.cargo/registry/src/github.com-1ecc6299db9ec823/kube-client-0.76.0/src/client/mod.rs:434
[2;3min�[0m [wildcard]*[/wildcard]::[wildcard]*[/wildcard]::�[[wildcard]*[/wildcard]�[0m
[2;3mat�[0m [wildcard]*[/wildcard][wildcard]/src/*[/wildcard]
"[[wildcard]*[/wildcard]:[wildcard][02-59][/wildcard]:[wildcard][00.593393-58.285542][/wildcard]Z�[0m �[32m INFO�[0m �[1;32magent::apiclient�[0m�[32m: �[32mAPI server busy, retrying later ...�[0m"
"[2;3min�[0m agent::agentclient::�[1mshadow_status_with_refreshed_system_matadata�[0m �[2;3mwith�[0m �[1mshadow_error_info�[0m: ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }"
"[2;3min�[0m agent::agentclient::�[1mupdate_status_in_shadow�[0m �[2;3mwith�[0m �[1mbottlerocket_shadow�[0m: BottlerocketShadow { metadata: ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: Some(Time([wildcard]yyyy-MM-ddTHH:mm:ssXXX[/wildcard])), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: Some(1), labels: None, managed_fields: Some([ManagedFieldsEntry { api_version: Some(""brupop.bottlerocket.aws/v2""), fields_type: Some(""FieldsV1""), fields_v1: Some(FieldsV1(Object {""f:metadata"": Object {""f:ownerReferences"": Object {""."": Object {}, ""k:{\""uid\"":\""[wildcard]*[/wildcard]\""}"": Object {}}}, ""f:spec"": Object {""."": Object {}, ""f:state"": Object {}, ""f:state_transition_timestamp"": Object {}, ""f:version"": Object {}}})), manager: Some(""unknown""), operation: Some(""Update""), time: Some(Time([wildcard]yyyy-MM-ddTHH:mm:ssXXX[/wildcard])) }, ManagedFieldsEntry { api_version: Some(""brupop.bottlerocket.aws/v2""), fields_type: Some(""FieldsV1""), fields_v1: Some(FieldsV1(Object {""f:status"": Object {""."": Object {}, ""f:crash_count"": Object {}, ""f:current_state"": Object {}, ""f:current_version"": Object {}, ""f:target_version"": Object {}}})), manager: Some(""unknown""), operation: Some(""Update""), time: Some(Time([wildcard]yyyy-MM-ddTHH:mm:ssXXX[/wildcard])) }]), name: Some(""[wildcard]*[/wildcard].redacted.compute.internal""), namespace: Some(""brupop-bottlerocket-aws""), owner_references: Some([OwnerReference { api_version: ""v1"", block_owner_deletion: None, controller: None, kind: ""Node"", name: ""[wildcard]*[/wildcard].us-west-2.compute.internal"", uid: ""[wildcard]*[/wildcard]"" }]), resource_version: Some(""[wildcard][54829321-278377934][/wildcard]""), self_link: None, uid: Some(""[wildcard]*[/wildcard]"") }, spec: BottlerocketShadowSpec { state: Idle, state_transition_timestamp: None, version: None }, status: Some(BottlerocketShadowStatus { current_version: ""1.14.1"", target_version: ""1.14.1"", current_state: Idle, crash_count: 0, state_transition_failure_timestamp: None }) }, �[1mstate�[0m: Idle, �[1mshadow_error_info�[0m: ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }"
"[[wildcard]*[/wildcard]:[wildcard][03-59][/wildcard]:[wildcard][04.738003-58.311911][/wildcard]Z�[0m �[32m INFO�[0m �[1;32mcontroller::controller�[0m�[32m: �[32mFound associated bottlerocketshadow name., �[1;32massociated_bottlerocketshadow_name�[0m�[32m: ""[wildcard]*[/wildcard].redacted.compute.internal""�[0m"
[[wildcard]*[/wildcard]:[wildcard][00-59][/wildcard]:[wildcard][01.302051-52.595513][/wildcard]Z�[0m �[32m INFO�[0m �[1;32mcontroller::controller�[0m�[32m: �[32mCalculating if current time is within update time window.�[0m
"[2;3min�[0m apiserver::telemetry::�[1mHTTP request�[0m �[2;3mwith�[0m �[1mhttp.method�[0m: POST, �[1mhttp.route�[0m: /bottlerocket-node-resource, �[1mhttp.flavor�[0m: 1.1, �[1mhttp.scheme�[0m: https, �[1mhttp.host�[0m: brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local, �[1mhttp.client_ip�[0m: [wildcard]10.52.XXX.XXX[/wildcard]:[wildcard]XXXX[/wildcard], �[1mhttp.user_agent�[0m: , �[1mhttp.target�[0m: /bottlerocket-node-resource, �[1motel.kind�[0m: ""server"", �[1mrequest_id�[0m: [wildcard]*[/wildcard], �[1mnode_name�[0m: ""[wildcard]*[/wildcard].redacted.compute.internal"""
"[2;3min�[0m apiserver::telemetry::�[1mHTTP request�[0m �[2;3mwith�[0m �[1mhttp.method�[0m: POST, �[1mhttp.route�[0m: /bottlerocket-node-resource, �[1mhttp.flavor�[0m: 1.1, �[1mhttp.scheme�[0m: https, �[1mhttp.host�[0m: brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local, �[1mhttp.client_ip�[0m: [wildcard]10.52.XXX.XXX[/wildcard]:[wildcard]XXXX[/wildcard], �[1mhttp.user_agent�[0m: , �[1mhttp.target�[0m: /bottlerocket-node-resource, �[1motel.kind�[0m: ""server"", �[1mrequest_id�[0m: [wildcard]*[/wildcard], �[1mnode_name�[0m: ""[wildcard]*[/wildcard].redacted.compute.internal"", �[1mexception.message�[0m: Error creating BottlerocketShadow: 'Unable to create BottlerocketShadow ([wildcard]*[/wildcard].us-west-2.compute.internal, [wildcard]*[/wildcard]): 'ApiError: ""404 page not found "": Failed to parse error data (ErrorResponse { status: ""404 Not Found"", message: ""\""404 page not found\\n\"""", reason: ""Failed to parse error data"", code: 404 })'', �[1mexception.details�[0m: BottlerocketShadowCreate { source: CreateBottlerocketShadow { source: Api(ErrorResponse { status: ""404 Not Found"", message: ""\""404 page not found\\n\"""", reason: ""Failed to parse error data"", code: 404 }), selector: BottlerocketShadowSelector { node_name: ""[wildcard]*[/wildcard].us-west-2.compute.internal"", node_uid: ""[wildcard]*[/wildcard]"" } } }, �[1mhttp.status_code�[0m: 500, �[1motel.status_code�[0m: ""ERROR"""
[2m2023-07-05T21:38:39.981472Z�[0m �[33m WARN�[0m �[1;33magent::agentclient�[0m�[33m: �[33mAn error occurred when try to create BottlerocketShadow. Restarting event loop.�[0m
"[2;3min�[0m models::node::client::�[1mcreate_node�[0m �[2;3mwith�[0m �[1mselector�[0m: BottlerocketShadowSelector { node_name: ""ip-10-52-96-90.redacted.compute.internal"", node_uid: ""340b1e0b-a34a-4557-99f9-4d2e55dfec7a"" }"
Slightly better view of logged message patterns: