Description
This ticket aims to give some feedback on the choice made for the implementation of the self-hosted runner (Chapter 3.7 and 3.8). Indeed, unlike the previous chapters that are quite straightforward, there are various possible solution, all with their own advantages and drawbacks. The solution to adopt is heavily dependent on the context of the user.
The resulting ticket should be also useful to not only document the "how", but the "why" and gives pointers for another implementation that might be more adequate in specific context (especially, enterprise context with data privacy in mind). It could be used as a basis for a future "what's next ?" or "What if... ?" page.
For the guide, the solution adopted is Option 4, which requires 2 self-hosted pods, but allow to easily bypass the firewall restrictions.
Goal
Create a self-hosted runner to have access to GPU on a machine/cluster for training. Update the dvc.lock file remotely, and pushes the date to S3 from the k8s server (with autocommit in git repo)
Note: Kubernetes does not support running Docker directly inside a container due to security and architectural reasons (container in container).
GPU access from container in container might also be problematic (might not apply here though).
Option 1 (Starting point): 1 GH runner on GitHub
This runs on a simple VM
It
- gets data from S3 (dvc pull)
- checks the experience is reproducible / "trains" ML model with up-to-date dvc.lock and dvc puch (done locally)
- containerizes the model with bentoml
- logs in with docker and pushes the image to the artifact registry
- deploys the image on k8s
Cons:
- training the model remotely is not possible (slow, timeout for long jobs, action minutes limited, data on GH)
Option 2: 1 GH runner on k8s (GCP)
Pro: One self-hosted runner, no runner on GH. Data/model is not on GitHub.
Cons:
- docker login and push: requires nested container: which breaks! Kubernetes does not support running Docker directly inside a container due to security and architectural reasons
Cons:
- does not work and requires workaround! docker containairization on k8s not directly possible.
- Connectivity/firewall issue
Option 2a: Use GCP custom solution
Solution specifically developed to solve this issue.
Cons: vendor lock-in, not applicable for on-premise solution, data privacy.
Option 2b: mount docker socket
- mount docker socket (access docker daemon, install docker-cli on runner): works (for now) but will stop working soon in future version of k8s.
Cons: docker is running out of the pod and can live indefinitely even if the pod is destroyed, bad practice unless one has total ownership of the machine, bad in terms of security, deprecated solution and will stop working soon.
Option 2c: using docker in docker
mount socket is obsolete. But docker in docker should now work!
See https://hub.docker.com/_/docker
Option 3: set up GH runner on (GCP) VM + k8s
- runner running on docker / containerization possible
Cons:
- need a VM in addition to a k8s cluster
Option 4: 2 GH runner (1 GH, 1 k8s)
- split steps depending on Docker need
- use gcloud/kubectl on K8s GCP
Cons: data privacy issue: Data is containerized on GH servers.
Option 5 (as 2e): K8S + kubevirt
- abstraction of pods using VMs
Pros:
- best of two worlds
Cons:
- additional abstration layers, complexity
Option 5 (as 2f): k8s + kaniko
Kaniko is used to deploy pod and create docker images.
See https://devopscube.com/build-docker-image-kubernetes-pod/
Cons:
- additional abstration layers, complexity
Connection / Firewall issue
Reaching k8s from GH action:
- requires public IP
- could make use of VPN layer (wireguard, see GH docs)
- can use polling, bu running an instance on the self-hosted runenr which "listens" to the GH runner. This is the approach currently selected.
- could make use of VPN layer (wireguard, see GH docs) instead to avoid the listening pod.