-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist /godata
at static agent restart
#111
Comments
While personally I have limited capacity, generally open to doing something to improve your situation,.although it'd need to stay backward compatible and thus retain the Deployment-by-default. Similar to how other charts allow you to change the type Deployment to StatefulSet I'm not sure StatefulSet is quite the right abstraction though? Rollouts would be very slow due to the ordered stop/start implicit, whereas agents are independent (other than their identity) and you'd presumably still have coupling with the server configuration of which agent offers which resources and environments? I haven't really thought things through but would it be an option to have the chart manage 'n' This is definitely going down the road of all the complexity the elastic agent profiles allow you to manage though, so not sure if it's wise.and perhaps better for folks to just deploy 'n' instances of the existing helm chart with different values overrides. |
Sure, thank you for your consideration.
Rollouts will be fast, I'd use parallel mode.
I guess the underlying point is that the current non-persistent static agents are contrary to the way GoCD treats static agents.
Most cases are indeed covered that way, I don't know if I'm just a particularly big user or if I'm using GoCD wrong (keep in mind agents used to be on physical instances, Elastic was not an option till now). If none of this looks appealing to you, I'll reflect a bit on Elastic vs forking the chart, nevertheless I thank you for your time. |
Ahh, nice, thanks! That didn't exist when I was last deep in Kubernetes (a while ago now, seems it become enabled by default in
So how would the chart vary other aspects of the configuration for a statefulset for your 100 agents? In most people's cases managing so many different agents alongside seeking to vary GoCD the resources/environments by groups of those agents, they'd typically need to vary things such as the specific agent container image used, or other aspects of the agent configuration - memory, CPU, space available. Basically, I am looking to be convinced that this isn't just an edge case. :) Or perhaps you would also intend to deploy
Yes, this is somewhat true. Up until Kubernetes
No it's not good, and I've had on my personal list to add automatic cleanup of LostContact agents for years now, just no time/energy to get there as no-one else had complained about it :-) indeed there was a major performance problem on the dashboard caused by combinations of large numbers of agents linked to individual environments etc which I fixed relatively recently.
Yes, agreed. FWIW, almost everything I know about the design philosophy I have inferred, as I was not part of the original dev team that worked on GoCD for Thoughtworks. So you're possibly as well trained as I!
It's a fair criticism that elastic agents have some problems in their inability to easily cache anything (e.g build tools dynamically fetched for a given job with deterministic versions, large git/repository clones, certain build tools' own caches, e.g docker or Gradle), and so with GoCD's lack of a first-class "cache" notion falling back to static agents can seem attractive. Which Kubernetes variant are you deploying into, and which persistent storage variants do you have available to you?
Just curious, why do you think this would need more of a refactoring than what is required to do it with static agents? It feels like it's also quite complex to do it this |
Yes, it's as you said. I value availability more than fast rollouts and I specially don't want cases where a dying agent gets a job and no one finishes it.
To get into more details (I wanted to avoid making you feel like a consultant), I have ~10 clusters, for different clients and dev/test/prod and a special cluster for tooling at the center of the star. The clusters provide all the variance I need already for things like pod resources, I can use a unified image, see below for the pvc size. So you can see it like deploying 11 times this chart, once with a server, the other 10 with 10 agents each.
I don't mind the ordering much, mostly I choose STS vs Depl based on the need of "personalized resources", which the
Adding persistence is for sure a lot easier than changing the agent model.
This project did a lot for me and my team, I'm sad to see it slowly fade.
Indeed, caching is the other reason I want to persist
Vanilla kubernetes, 1.28 at the moment, going to 1.31 is planned shortly. My storage of choice for normal workloads is a big NFS server paired to a cluster (10x of these setups), accessed via PVC via https://github.com/kubernetes-csi/csi-driver-nfs .
See above, it'd be a 10x10. |
Sorry for the delayed reply, was a bit distracted getting security fixes out. 😬
I'm not sure that cases where a dying agent gets a job that no-one finishes will be any more or less likely with I'll probably need to play around with the parallel mode to see how it applies in any case.
Well, asking consulting style questions is probably an occupational hazard in open source as well :-) - if you're trying to figure out how widely reusable someone's idea and considering upstreaming it. Hah. But thanks, that helps me understand a bit more how this might help you. I'll probably have to ponder a bit more to see whether this could be more widely useful. :-/
I think that's what I was getting at re: why not bother with elastic agents. For teh raw mechanics of config i'm not convinced needing to remodel everything as 10 Helm chart deployments with different values files and such is easier than modelling as 10 cluster profiles. However, if it gets to trying to find a way to address file system caching or performance on agents, that's probably fair.
Sorry, I meant AKS vs EKS vs GKE etc. But based on "big NFS server" it sounds like you are rolling your own on-prem. So is it vanilla Kubernetes? Or some other distro like OpenShift?
Yeah, OK. I've never really tried it, however I believe there might be some creative ways to use EFS/NFS with elastic agents to give builds re-usable storage with some creative pod-spec templating. Maybe not worth going into here though, as it's not validated. Anyway, to summarise, we can probably give this a go. Let's see how it behaves? |
Thanks, I'll get a PR rolling soon-ish and then you can give feedback on how the new customization points look. |
I got back from holiday and opened a PR as discussed in #113 . To answer your questions on my setup:
I'm mainly on Oracle Cloud, so I use OKE, but I also have envs on AKS@AWS, I have vanilla kubernetes 1.28 on top of them. I try to keep the kubernetes deployments completely cloud-agnostic, the cloud providers have fancy distributed FS available as storage classes, but they are:
So I've made my own NFS server (one per cluster) out of a normal compute instance and I interact with it via both the in-tree NFS driver (to mount existing folders) or via https://github.com/kubernetes-csi/csi-driver-nfs to have an experience very similar to the dynamic block storage provisioning in cloud. Assuming #113 is merged, my deployments are as follows, the server deployment is base+server, the agent is base+agent:
rbac:
create: false
serviceAccount:
create: false
server:
enabled: false
shouldPreconfigure: false
env:
extraEnvVars:
- name: GOCD_PLUGIN_INSTALL_vault-secret-plugin
value: https://github.com/gocd/gocd-vault-secret-plugin/releases/download/v1.3.0-355/gocd-vault-secret-plugin-1.3.0-355.jar
service:
type: "ClusterIP"
ingress:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0" # import config massive
persistence:
storageClass: nfs-csi
agent:
controller:
kind: "StatefulSet"
image:
repository: "CUSTOM_IMAGE"
env:
goServerUrl: "https://CUSTOM_URL/go"
agentAutoRegisterKey: "CUSTOM_KEY"
extraEnvVars:
- name: ORDINAL_NUMBER
valueFrom:
fieldRef:
fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
persistence:
claimTemplates:
- metadata:
name: godata
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
storageClassName: nfs-csi
extraVolumeMounts:
- name: godata
mountPath: /godata
healthCheck:
enabled: true
server:
enabled: true
ingress:
hosts:
- CUSTOM_URL
tls:
- hosts:
- CUSTOM_URL
agent:
enabled: false
agent:
replicaCount: CUSTOM_NUMBER
env:
agentAutoRegisterHostname: "CUSTOM_NAME-$(ORDINAL_NUMBER)" |
I am migrating a large preexisting configuration (1 server, 200+ environments, 100+ agents) into an handful of Kubernetes clusters.
To ease the migration burden I'd like to start with static agents, replicating my setup above, because moving to Elastic would necessitate a big refactoring.
This chart has been of great help, but now I've hit an issue, whenever I restart an agent container the content of
/godata/config
is regenerated, therefore the agent is seen as "new" and I lose the environment association.This cannot be readily solved by
agentAutoRegisterEnvironments
because I'd need to deploy dozens of different combinations.Would you be open to a PR that:
/godata
This would let each agent keep its identity after a restart, and also let me expand the storage space where repos are pulled, which right now is forced into the kubernetes node.
I'm a big user of this project and I can effectively contribute (e.g. gocd/gocd-vault-secret-plugin#160 ), lemme know.
The text was updated successfully, but these errors were encountered: