Skip to content

Git Resolver - git binary not reaping zombie processes #8830

@aThorp96

Description

@aThorp96

When the git resolver switched to using the git binary, it introduced an issue where every git-based ResolutionRequest results in an orphaned zombie process on the pod. This is caused by git remote-https forking to git-remote-https and orphaning the fork before it completes. Since git clone depends on this forking behavior to clone a repo, and the resolvers binary/image does not have any init process or zombie reaper, after these zombies build up the resolver container runs out of PIDs and is unable to resolve git resolution requests.

The only workaround to get the resolver working again is to restart the pod/container.

There are a couple ways this can be solved and I think it's worth discussing.

  • Option 1: Revert the switch from go-git to the git binary and accept the memory leak
    • If Option 1 is not chosen, unless this can be fixed quite quickly I believe we should at least put the git binary git-resolver implementation behind a feature flag in the next patch release.
  • Option 2: Use an init process such as tini in the resolvers image to reap the processes. This does not appear to be possible using ko.
  • Option 3: Modify the resolvers cmd so that it spawns or doubles as a zombie reaper.
    • Go-reaper has one example of how to have the command reap zombies without interfering with the subprocesses in their README
  • Option 4: Include a check for this in the resolver's healthcheck - if 4-5 child-processes cannot be created simultaneously then the pod is unhealthy (Since git resolution spawns 4-5 processes and only one of the grandchildren becomes a zombie, there will always be at least 3-4 PIDs available, so you have to spawn half a dozen or so to check for exhaustion)

Expected Behavior

When a git-resolver ResolutionRequest is resolved, it should have no persistent side effects on the resolver container.

Actual Behavior

When a git-resolver ResolutionRequest is resolved, one orphaned zombie process is created. After a large number of these requests are made, the git resolver is unable to resolve any resolutionrequests.

Steps to Reproduce the Problem

  1. Have access to the nodes for a k8s cluster with Tekton running and the git-resolver enabled (a local kind cluster works)
  2. On the node which is running the resolvers container/pod, run ps afux (or ps o user,pgid,ppid,pid,command f U <user-id> if the user-id of the container runtime is known) should show the resolvers process with no children. E.g.:
65532     798458  0.1  0.3 2451296 126632 ?      Sl   Jun13   4:52              /ko-app/resolvers
  1. Use kubectl create to create a ResolutionRequest like this:
apiVersion: resolution.tekton.dev/v1beta1
kind: ResolutionRequest
metadata:
  labels:
    resolution.tekton.dev/type: git
  generateName: git-test-zombie-
  namespace: default
spec:
  params:
  - name: url
    value: https://github.com/tektoncd/catalog.git
  - name: revision
    value: main
  - name: pathInRepo
    value: task/git-clone/0.9/git-clone.yaml
  1. Use the ps command again to observe the resolvers app. Depending on the timing you may see the the child git processes as they're in use:
$  ps fu U 65532
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65532      59727  0.5  0.4 2245120 137360 ?      Ssl  15:54   0:07 /ko-app/resolvers

65532      73989  2.7  0.0  21000  5692 ?        Sl   16:17   0:00  \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532      73992  0.0  0.0  12804  4836 ?        S    16:17   0:00      \_ /usr/libexec/git-core/git remote-https origin https://github.com/tektoncd/catalog.git
65532      73994 11.1  0.0  88988 10676 ?        S    16:17   0:00      |   \_ /usr/libexec/git-core/git-remote-https origin https://github.com/tektoncd/catalog.git
65532      74047 16.4  0.0  14308  5908 ?        R    16:17   0:00      \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch

However once the resolution request is complete you will see the zombie process created:

$ ps fu U 65532
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65532      59727  0.5  0.4 2245120 137360 ?      Ssl  15:54   0:07 /ko-app/resolvers
65532      73989  2.6  0.0  21000  5820 ?        S    16:17   0:00  \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532      73992  0.0  0.0      0     0 ?        Z    16:17   0:00      \_ [git] <defunct>
65532      74047 20.2  0.0 440308  6676 ?        D    16:17   0:00      \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch
  1. Note a short time later that the defunct process will be adopted by the /ko-app/resolvers process since it has PID 1 on the container and will remain there indefinitely

Additional Info

  • Kubernetes version:

    Output of kubectl version:

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

$ tkn version
Client version: 0.41.0
Pipeline version: v1.0.0
Dashboard version: v0.55.0

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

Status

Todo

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions