Skip to content

Conversation

@lubronzhan
Copy link
Contributor

@lubronzhan lubronzhan commented Jan 7, 2026

Thank you for contributing to Velero!

Please add a summary of your change

Report velero.status.message when pod fails.
For example, observed datamover pod failure like below, only status.message contains useful debugging info

---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2026-01-07T21:13:09Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2026-01-07T21:13:28Z"
  finalizers:
  - lifecycle-controller/system.vmware.com
  labels:
    velero.io/data-upload: test-backup-1-c4sbs
    velero.io/exposer-pod-group: snapshot-exposer
  name: test-backup-1-c4sbs
  namespace: velero
  ownerReferences:
  - apiVersion: velero.io/v2alpha1
    controller: true
    kind: DataUpload
    name: test-backup-1-c4sbs
    uid: 4dd7dd90-4060-4d66-9adf-31a35d760d3d
  resourceVersion: "4248380"
  uid: 9deb52c1-3d56-4d32-8f56-115d78e38c4d
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2026-01-07T21:13:17Z"
    status: "True"
    type: PodScheduled
  - lastProbeTime: null
    lastTransitionTime: "2026-01-07T21:13:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2026-01-07T21:13:30Z"
    reason: UnknownContainerStatuses
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2026-01-07T21:13:30Z"
    reason: UnknownContainerStatuses
    status: "False"
    type: Ready
  message: |
    rpc error: code = Internal desc = Could not run pod: mountVolume.MountDevice for (4dd7dd90-4060-4d66-9adf-31a35d760d3d) failed: mount failed: exit status 32
    Mounting command: mount
    Mounting arguments: -t ext4 -o defaults /dev/disk/by-id/wwn-0x6000c296705bae7110bbd2dc558e349a /mnt/volumes/plugins/kubernetes.io/vsphere-volume/mounts/4352474f-46ec-4a6c-8a8e-874533bbcc48
    Output: mount: /mnt/volumes/plugins/kubernetes.io/vsphere-volume/mounts/4352474f-46ec-4a6c-8a8e-874533bbcc48: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
           dmesg(1) may have more information after failed mount system call.
  phase: Failed
  qosClass: BestEffort
  reason: ProviderFailed
---

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

@lubronzhan
Copy link
Contributor Author

/kind changelog-not-required

@github-actions github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Jan 7, 2026
@lubronzhan lubronzhan force-pushed the topic/lubron/logging_when_pod_fails branch from ec9bf73 to 69b25f8 Compare January 7, 2026 21:29
@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.53%. Comparing base (e446ce5) to head (69b25f8).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9477   +/-   ##
=======================================
  Coverage   60.52%   60.53%           
=======================================
  Files         386      386           
  Lines       36355    36357    +2     
=======================================
+ Hits        22005    22007    +2     
  Misses      12770    12770           
  Partials     1580     1580           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Lyndon-Li
Copy link
Contributor

There are data mover diagnostic info in the node-agent log which includes the pod messsage.

@lubronzhan
Copy link
Contributor Author

There are data mover diagnostic info in the node-agent log which includes the pod messsage.

mm what I saw was

level=error msg="Cancel du velero/test-backup-1-6jfmp because of expose error Pod is in abnormal state [Failed], message []" controller=dataupload dataupload=velero/test-backup-1-6jfmp logSource="pkg/controller/data_upload_controller.go:296"

@Lyndon-Li
Copy link
Contributor

There are data mover diagnostic info in the node-agent log which includes the pod messsage.

mm what I saw was

level=error msg="Cancel du velero/test-backup-1-6jfmp because of expose error Pod is in abnormal state [Failed], message []" controller=dataupload dataupload=velero/test-backup-1-6jfmp logSource="pkg/controller/data_upload_controller.go:296"

Not this one. You could search begin diagnose CSI exposer from this node-agent log, which gives you more detailed info.

@lubronzhan
Copy link
Contributor Author

There are data mover diagnostic info in the node-agent log which includes the pod messsage.

There are data mover diagnostic info in the node-agent log which includes the pod messsage.

mm what I saw was

level=error msg="Cancel du velero/test-backup-1-6jfmp because of expose error Pod is in abnormal state [Failed], message []" controller=dataupload dataupload=velero/test-backup-1-6jfmp logSource="pkg/controller/data_upload_controller.go:296"

Not this one. You could search begin diagnose CSI exposer from this node-agent log, which gives you more detailed info.

I couldn't find this log

root@4231f35d9ba653ffa6ed48e34311695f [ ~ ]# k logs -n velero node-agent-sckxb  | grep "begin diagnose CSI exposer"
root@4231f35d9ba653ffa6ed48e34311695f [ ~ ]# k logs -n velero node-agent-sfw8p  | grep "begin diagnose CSI exposer"
root@4231f35d9ba653ffa6ed48e34311695f [ ~ ]# k logs -n velero node-agent-ws7hj  | grep "begin diagnose CSI exposer"
root@4231f35d9ba653ffa6ed48e34311695f [ ~ ]#

You mean this func will print this messge?
https://github.com/vmware-tanzu/velero/blob/main/pkg/util/kube/pod.go#L271-L287

@Lyndon-Li
Copy link
Contributor

Then please open an issue, any dignostic info should go by the diagnostic mechanism. I will check why it didn't.

@lubronzhan
Copy link
Contributor Author

Then please open an issue, any dignostic info should go by the diagnostic mechanism. I will check why it didn't.

Ok created this one #9478

@reasonerjt reasonerjt requested a review from Lyndon-Li January 8, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

has-unit-tests kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants