Skip to content

Document self-hosted implementation choice #244

Open
@rmarquis

Description

@rmarquis

This ticket aims to give some feedback on the choice made for the implementation of the self-hosted runner (Chapter 3.7 and 3.8). Indeed, unlike the previous chapters that are quite straightforward, there are various possible solution, all with their own advantages and drawbacks. The solution to adopt is heavily dependent on the context of the user.

The resulting ticket should be also useful to not only document the "how", but the "why" and gives pointers for another implementation that might be more adequate in specific context (especially, enterprise context with data privacy in mind). It could be used as a basis for a future "what's next ?" or "What if... ?" page.

For the guide, the solution adopted is Option 4, which requires 2 self-hosted pods, but allow to easily bypass the firewall restrictions.


Goal

Create a self-hosted runner to have access to GPU on a machine/cluster for training. Update the dvc.lock file remotely, and pushes the date to S3 from the k8s server (with autocommit in git repo)

Note: Kubernetes does not support running Docker directly inside a container due to security and architectural reasons (container in container).

GPU access from container in container might also be problematic (might not apply here though).

Option 1 (Starting point): 1 GH runner on GitHub

This runs on a simple VM
It

  • gets data from S3 (dvc pull)
  • checks the experience is reproducible / "trains" ML model with up-to-date dvc.lock and dvc puch (done locally)
  • containerizes the model with bentoml
  • logs in with docker and pushes the image to the artifact registry
  • deploys the image on k8s

Cons:

  • training the model remotely is not possible (slow, timeout for long jobs, action minutes limited, data on GH)

Option 2: 1 GH runner on k8s (GCP)

Pro: One self-hosted runner, no runner on GH. Data/model is not on GitHub.

Cons:

  • docker login and push: requires nested container: which breaks! Kubernetes does not support running Docker directly inside a container due to security and architectural reasons

Cons:

  • does not work and requires workaround! docker containairization on k8s not directly possible.
  • Connectivity/firewall issue

Option 2a: Use GCP custom solution

Solution specifically developed to solve this issue.

Cons: vendor lock-in, not applicable for on-premise solution, data privacy.

Option 2b: mount docker socket

  • mount docker socket (access docker daemon, install docker-cli on runner): works (for now) but will stop working soon in future version of k8s.

Cons: docker is running out of the pod and can live indefinitely even if the pod is destroyed, bad practice unless one has total ownership of the machine, bad in terms of security, deprecated solution and will stop working soon.

Option 2c: using docker in docker

mount socket is obsolete. But docker in docker should now work!
See https://hub.docker.com/_/docker

Option 3: set up GH runner on (GCP) VM + k8s

  • runner running on docker / containerization possible

Cons:

  • need a VM in addition to a k8s cluster

Option 4: 2 GH runner (1 GH, 1 k8s)

  • split steps depending on Docker need
  • use gcloud/kubectl on K8s GCP

Cons: data privacy issue: Data is containerized on GH servers.

Option 5 (as 2e): K8S + kubevirt

  • abstraction of pods using VMs

Pros:

  • best of two worlds

Cons:

  • additional abstration layers, complexity

Option 5 (as 2f): k8s + kaniko

Kaniko is used to deploy pod and create docker images.
See https://devopscube.com/build-docker-image-kubernetes-pod/

Cons:

  • additional abstration layers, complexity

Connection / Firewall issue

Reaching k8s from GH action:

  • requires public IP
  • could make use of VPN layer (wireguard, see GH docs)
  • can use polling, bu running an instance on the self-hosted runenr which "listens" to the GH runner. This is the approach currently selected.
  • could make use of VPN layer (wireguard, see GH docs) instead to avoid the listening pod.

See also

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions