-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
When the git resolver switched to using the git binary, it introduced an issue where every git-based ResolutionRequest results in an orphaned zombie process on the pod. This is caused by git remote-https
forking to git-remote-https
and orphaning the fork before it completes. Since git clone
depends on this forking behavior to clone a repo, and the resolvers binary/image does not have any init process or zombie reaper, after these zombies build up the resolver container runs out of PIDs and is unable to resolve git resolution requests.
The only workaround to get the resolver working again is to restart the pod/container.
There are a couple ways this can be solved and I think it's worth discussing.
- Option 1: Revert the switch from
go-git
to thegit
binary and accept the memory leak- If Option 1 is not chosen, unless this can be fixed quite quickly I believe we should at least put the
git
binary git-resolver implementation behind a feature flag in the next patch release.
- If Option 1 is not chosen, unless this can be fixed quite quickly I believe we should at least put the
- Option 2: Use an init process such as
tini
in the resolvers image to reap the processes. This does not appear to be possible using ko. - Option 3: Modify the resolvers cmd so that it spawns or doubles as a zombie reaper.
- Go-reaper has one example of how to have the command reap zombies without interfering with the subprocesses in their README
- Option 4: Include a check for this in the resolver's healthcheck - if 4-5 child-processes cannot be created simultaneously then the pod is unhealthy (Since git resolution spawns 4-5 processes and only one of the grandchildren becomes a zombie, there will always be at least 3-4 PIDs available, so you have to spawn half a dozen or so to check for exhaustion)
Expected Behavior
When a git-resolver ResolutionRequest is resolved, it should have no persistent side effects on the resolver container.
Actual Behavior
When a git-resolver ResolutionRequest is resolved, one orphaned zombie process is created. After a large number of these requests are made, the git resolver is unable to resolve any resolutionrequests.
Steps to Reproduce the Problem
- Have access to the nodes for a k8s cluster with Tekton running and the git-resolver enabled (a local kind cluster works)
- On the node which is running the resolvers container/pod, run
ps afux
(orps o user,pgid,ppid,pid,command f U <user-id>
if the user-id of the container runtime is known) should show the resolvers process with no children. E.g.:
65532 798458 0.1 0.3 2451296 126632 ? Sl Jun13 4:52 /ko-app/resolvers
- Use
kubectl create
to create a ResolutionRequest like this:
apiVersion: resolution.tekton.dev/v1beta1
kind: ResolutionRequest
metadata:
labels:
resolution.tekton.dev/type: git
generateName: git-test-zombie-
namespace: default
spec:
params:
- name: url
value: https://github.com/tektoncd/catalog.git
- name: revision
value: main
- name: pathInRepo
value: task/git-clone/0.9/git-clone.yaml
- Use the
ps
command again to observe the resolvers app. Depending on the timing you may see the the childgit
processes as they're in use:
$ ps fu U 65532
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
65532 59727 0.5 0.4 2245120 137360 ? Ssl 15:54 0:07 /ko-app/resolvers
65532 73989 2.7 0.0 21000 5692 ? Sl 16:17 0:00 \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532 73992 0.0 0.0 12804 4836 ? S 16:17 0:00 \_ /usr/libexec/git-core/git remote-https origin https://github.com/tektoncd/catalog.git
65532 73994 11.1 0.0 88988 10676 ? S 16:17 0:00 | \_ /usr/libexec/git-core/git-remote-https origin https://github.com/tektoncd/catalog.git
65532 74047 16.4 0.0 14308 5908 ? R 16:17 0:00 \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch
However once the resolution request is complete you will see the zombie process created:
$ ps fu U 65532
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
65532 59727 0.5 0.4 2245120 137360 ? Ssl 15:54 0:07 /ko-app/resolvers
65532 73989 2.6 0.0 21000 5820 ? S 16:17 0:00 \_ git -C /tmp/catalog.git-3627028645 clone https://github.com/tektoncd/catalog.git /tmp/catalog.git-3627028645 --depth=1 --no-checkout
65532 73992 0.0 0.0 0 0 ? Z 16:17 0:00 \_ [git] <defunct>
65532 74047 20.2 0.0 440308 6676 ? D 16:17 0:00 \_ /usr/libexec/git-core/git --shallow-file /tmp/catalog.git-3627028645/.git/shallow.lock index-pack --stdin --fix-thin --keep=fetch-pack 482 on tekton-pipelines-remote-resolvers-546c458b47-jqbch
- Note a short time later that the defunct process will be adopted by the
/ko-app/resolvers
process since it has PID 1 on the container and will remain there indefinitely
Additional Info
-
Kubernetes version:
Output of
kubectl version
:
$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
-
Tekton Pipeline version:
Output of
tkn version
orkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
$ tkn version
Client version: 0.41.0
Pipeline version: v1.0.0
Dashboard version: v0.55.0
Metadata
Metadata
Assignees
Labels
Type
Projects
Status