Skip to content

[Bug]: The cert renewal will fail after encountering network connection issue #424

@peirulu

Description

@peirulu

Describe the expected outcome

The status of certificate request shall be Pending, so the certificate can be retry when the network connection is back.

Describe the actual outcome

The status of certificate request will be Failed after 3 attempts. And the certificaterequest will be ignored by aws-pricateca issuer (upstream code: https://github.com/cert-manager/aws-privateca-issuer/blob/main/pkg/controllers/certificaterequest_controller.go#L96-L104).

Was wondering if it is possible to create a similar PR to this (#384) to let the status be Pending instead of Failed when encountered network issue?

Steps to reproduce

  1. Install cert-manager v1.16.4
  2. Install aws privateca issuer v1.6.0
  3. Let the aws-privateca issuer pods in the Kubernetes cluster unable to reach the AWS PCA Service (to simulate the network error) by
  • Changing the CoreDNS Config map to make the AWS PCA Service Unreachable for aws-pca issuer pods
  • Change the security group of VPC client that connects the aws-private-ca pods and the aws pca service
  1. After 3 attempts, the certificate request status will be failed
  2. Recover the network and wait for next retry
  3. The certificaterequest will never be retried since the status is Failed

(We will need to delete the failed certificate request again and retry cmctl renew <certificate name>)

Relevant log output

The aws pca issuer will ignore the certificate request (set zap-log-level=9)

 k logs -n cert-manager aws-privateca-issuer-7678d777b9-k7vz2 -f | grep test-cert-20
{"level":"Level(-5)","ts":"2025-08-29T07:29:29Z","msg":"Reconciling","controller":"certificaterequest","controllerGroup":"cert-manager.io","controllerKind":"CertificateRequest","CertificateRequest":{"name":"test-cert-20","namespace":"cert-manager"},"namespace":"cert-manager","name":"test-cert-20","reconcileID":"eb8a3171-2e21-4dd7-b95d-649d76dfc69b"}
{"level":"Level(-4)","ts":"2025-08-29T07:29:29Z","logger":"controllers.CertificateRequest","msg":"CertificateRequest is Failed. Ignoring.","certificaterequest":{"name":"test-cert-20","namespace":"cert-manager"}}
{"level":"Level(-5)","ts":"2025-08-29T07:29:29Z","msg":"Reconcile successful","controller":"certificaterequest","controllerGroup":"cert-manager.io","controllerKind":"CertificateRequest","CertificateRequest":{"name":"test-cert-20","namespace":"cert-manager"},"namespace":"cert-manager","name":"test-cert-20","reconcileID":"eb8a3171-2e21-4dd7-b95d-649d76dfc69b"}


cert manager keeps backing off the retry:

I0829 16:21:28.058473       1 controller.go:152] "re-queuing item due to optimistic locking on resource" logger="cert-manager.controller" error="Operation cannot be fulfilled on certificates.cert-manager.io \"test-cert\": the object has been modified; please apply your changes to the latest version and try again"
W0829 16:21:28.074812       1 warnings.go:70] unknown field "status.failedIssuanceAttempts"
I0829 16:21:28.074833       1 trigger_controller.go:202] "Backing off from issuance due to previously failed issuance(s). Issuance will next be attempted at 2025-08-29 17:21:28.000000942 +0000 UTC m=+210011.565085580" logger="cert-manager.controller" key="cert-manager/test-cert"
I0829 16:21:28.105934       1 trigger_controller.go:202] "Backing off from issuance due to previously failed issuance(s). Issuance will next be attempted at 2025-08-29 17:21:28.000000756 +0000 UTC m=+210011.565085390" logger="cert-manager.controller" key="cert-manager/test-cert"
I0829 16:21:28.122107       1 controller.go:152] "re-queuing item due to optimistic locking on resource" logger="cert-manager.controller" error="Operation cannot be fulfilled on certificates.cert-manager.io \"test-cert\": the object has been modified; please apply your changes to the latest version and try again"
I0829 17:21:28.002067       1 trigger_controller.go:223] "Certificate must be re-issued" logger="cert-manager.controller" key="cert-manager/test-cert" reason="Renewing" message="Renewing certificate as renewal was scheduled at 2025-08-29 07:21:25 +0000 UTC"
I0829 17:21:28.002172       1 conditions.go:192] Found status change for Certificate "test-cert" condition "Issuing": "False" -> "True"; setting lastTransitionTime to 2025-08-29 17:21:28.002164224 +0000 UTC m=+210011.567248879
I0829 17:21:28.053873       1 conditions.go:192] Found status change for Certificate "test-cert" condition "Issuing": "True" -> "False"; setting lastTransitionTime to 2025-08-29 17:21:28.053862881 +0000 UTC m=+210011.618947526
I0829 17:21:28.061472       1 controller.go:152] "re-queuing item due to optimistic locking on resource" logger="cert-manager.controller" error="Operation cannot be fulfilled on certificates.cert-manager.io \"test-cert\": the object has been modified; please apply your changes to the latest version and try again"
W0829 17:21:28.073831       1 warnings.go:70] unknown field "status.failedIssuanceAttempts"
I0829 17:21:28.075364       1 trigger_controller.go:202] "Backing off from issuance due to previously failed issuance(s). Issuance will next be attempted at 2025-08-29 18:21:28.000000646 +0000 UTC m=+213611.565085280" logger="cert-manager.controller" key="cert-manager/test-cert"
I0829 17:21:28.104046       1 trigger_controller.go:202] "Backing off from issuance due to previously failed issuance(s). Issuance will next be attempted at 2025-08-29 18:21:28.000000441 +0000 UTC m=+213611.565085073" logger="cert-manager.controller" key="cert-manager/test-cert"


The result of `k describe -n cert-manager certificaterequests.cert-manager.io test-cert-20`

Name:         test-cert-20
Namespace:    cert-manager
Labels:       <none>
Annotations:  cert-manager.io/certificate-name: test-cert
              cert-manager.io/certificate-revision: 20
              cert-manager.io/private-key-secret-name: test-cert-576rm
API Version:  cert-manager.io/v1
Kind:         CertificateRequest
Metadata:
  Creation Timestamp:  2025-08-29T07:21:25Z
  Generation:          1
  Owner References:
    API Version:           cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Certificate
    Name:                  test-cert
    UID:                   e128b7f7-ecf7-44cc-9a28-283f10d3a573
  Resource Version:        810583
  UID:                     38fac994-9da4-48b1-9287-a6069b021e07
Spec:
  Duration:  1h0m0s
  Extra:
    authentication.kubernetes.io/credential-id:
      JTI=2e04894e-03f3-4e7b-b839-9e6ef08cee2f
    authentication.kubernetes.io/node-name:
      test-certmanager-awspca-wk01-md-0-v7xvt-kqbkr
    authentication.kubernetes.io/node-uid:
      bfbb90dc-0d85-4f6a-a42b-568b9b94c561
    authentication.kubernetes.io/pod-name:
      generated-cert-manager-64996dc5c5-8xtrx
    authentication.kubernetes.io/pod-uid:
      3ffd1e1c-3bc2-42d5-b28f-f81670619d5a
  Groups:
    system:serviceaccounts
    system:serviceaccounts:cert-manager
    system:authenticated
  Issuer Ref:
    Group:   awspca.cert-manager.io
    Kind:    AWSPCAClusterIssuer
    Name:    test-aws-pca-issuer
  Request:   LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ29qQ0NBWW9DQVFBd0Z<ignore here>LQo=
  UID:       ac1d2b9b-5335-43da-8fbf-1bd2dfc5c80b
  Username:  system:serviceaccount:cert-manager:generated-cert-manager
Status:
  Conditions:
    Last Transition Time:  2025-08-29T07:21:25Z
    Message:               Certificate request has been approved by cert-manager.io
    Reason:                cert-manager.io
    Status:                True
    Type:                  Approved
    Last Transition Time:  2025-08-29T07:21:28Z
    Message:               failed to request certificate from PCA: operation error ACM PCA: DescribeCertificateAuthority, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://vpce-01c10d5ae6bd7f703-7gzleefh.acm-pca.us-west-2.vpce.amazonaws.com/": dial tcp: lookup vpce-01c10d5ae6bd7f703-7gzleefh.acm-pca.us-west-2.vpce.amazonaws.com on 10.96.0.10:53: server misbehaving
    Reason:                Failed
    Status:                False
    Type:                  Ready
Events:                    <none>

Version

  • aws pca issuer: v1.6.0
  • Cert manager : v1.16.4

Have you tried the following?

Category

Supported Workflow Broken

Severity

Severity 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions