Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist /godata at static agent restart #111

Open
12345ieee opened this issue Dec 16, 2024 · 7 comments · May be fixed by #113
Open

Persist /godata at static agent restart #111

12345ieee opened this issue Dec 16, 2024 · 7 comments · May be fixed by #113
Labels
enhancement New feature or request

Comments

@12345ieee
Copy link

I am migrating a large preexisting configuration (1 server, 200+ environments, 100+ agents) into an handful of Kubernetes clusters.

To ease the migration burden I'd like to start with static agents, replicating my setup above, because moving to Elastic would necessitate a big refactoring.

This chart has been of great help, but now I've hit an issue, whenever I restart an agent container the content of /godata/config is regenerated, therefore the agent is seen as "new" and I lose the environment association.
This cannot be readily solved by agentAutoRegisterEnvironments because I'd need to deploy dozens of different combinations.

Would you be open to a PR that:

  • turns the static agent Deployment into a StatefulSet
  • adds an optional PVC template to it, mounted under /godata

This would let each agent keep its identity after a restart, and also let me expand the storage space where repos are pulled, which right now is forced into the kubernetes node.

I'm a big user of this project and I can effectively contribute (e.g. gocd/gocd-vault-secret-plugin#160 ), lemme know.

@chadlwilson
Copy link
Member

While personally I have limited capacity, generally open to doing something to improve your situation,.although it'd need to stay backward compatible and thus retain the Deployment-by-default. Similar to how other charts allow you to change the type Deployment to StatefulSet

I'm not sure StatefulSet is quite the right abstraction though? Rollouts would be very slow due to the ordered stop/start implicit, whereas agents are independent (other than their identity) and you'd presumably still have coupling with the server configuration of which agent offers which resources and environments?

I haven't really thought things through but would it be an option to have the chart manage 'n' Deployments using the base agent values and then allow overrides in agent-group[n] with different #s replicas? This would allow other setups with different sets of auto registered envs/resources, different images for folks who don't need the persistence but also allow you to create single replica deployments with persistence.

This is definitely going down the road of all the complexity the elastic agent profiles allow you to manage though, so not sure if it's wise.and perhaps better for folks to just deploy 'n' instances of the existing helm chart with different values overrides.

@12345ieee
Copy link
Author

12345ieee commented Dec 17, 2024

While personally I have limited capacity, generally open to doing something to improve your situation,.although it'd need to stay backward compatible and thus retain the Deployment-by-default. Similar to how other charts allow you to change the type Deployment to StatefulSet

Sure, thank you for your consideration.
To give you a sense of scale, I have around 1200 pipelines, each environment acts as shared env container for 2-10 pipelines each.

I'm not sure StatefulSet is quite the right abstraction though? Rollouts would be very slow due to the ordered stop/start implicit, whereas agents are independent (other than their identity) and you'd presumably still have coupling with the server configuration of which agent offers which resources and environments?

Rollouts will be fast, I'd use parallel mode.
StatefulSet is the neatest way to give each agent its own scratch space, not just for persistence, but also to allow me to download GBs of materials in a pipeline (which goes into godata as well) without running out of node space.
The coupling is indeed real, in my case it's a feature, because I have the delivery manager only interact with the GoCD UI, but I do understand your concern.

I haven't really thought things through but would it be an option to have the chart manage 'n' Deployments using the base agent values and then allow overrides in agent-group[n] with different #s replicas? This would allow other setups with different sets of auto registered envs/resources, different images for folks who don't need the persistence but also allow you to create single replica deployments with persistence.

I guess the underlying point is that the current non-persistent static agents are contrary to the way GoCD treats static agents.
Even if I manage to get the env composition right, the AGENTS panel in the UI will rapidly become a graveyard of hundreds or thousands of agents in LostContact status, which can't really be good for performance.
I don't presume to know the design philosophy of GoCD, you sure know it better, but it seems to me static agents go to great lengths and strongly assume to remember who they are, otherwise the whole guid.txt mechanism would not exist at all.

This is definitely going down the road of all the complexity the elastic agent profiles allow you to manage though, so not sure if it's wise.and perhaps better for folks to just deploy 'n' instances of the existing helm chart with different values overrides.

Most cases are indeed covered that way, I don't know if I'm just a particularly big user or if I'm using GoCD wrong (keep in mind agents used to be on physical instances, Elastic was not an option till now).

If none of this looks appealing to you, I'll reflect a bit on Elastic vs forking the chart, nevertheless I thank you for your time.

@chadlwilson
Copy link
Member

chadlwilson commented Dec 17, 2024

Rollouts will be fast, I'd use parallel mode.

Ahh, nice, thanks! That didn't exist when I was last deep in Kubernetes (a while ago now, seems it become enabled by default in 1.26?) - so that might address one of my reservations on the abstraction here. Although it does say "This option only affects the behavior for scaling operations. Updates are not affected." which implies that updates to the agent pod spec will still cause slow sequential rollouts? Never used it myself.

StatefulSet is the neatest way to give each agent its own scratch space, not just for persistence, but also to allow me to download GBs of materials in a pipeline (which goes into godata as well) without running out of node space. The coupling is indeed real, in my case it's a feature, because I have the delivery manager only interact with the GoCD UI, but I do understand your concern.

So how would the chart vary other aspects of the configuration for a statefulset for your 100 agents? In most people's cases managing so many different agents alongside seeking to vary GoCD the resources/environments by groups of those agents, they'd typically need to vary things such as the specific agent container image used, or other aspects of the agent configuration - memory, CPU, space available. StatefulSet would not allow that would it, since the PodSpecs must be identical?

Basically, I am looking to be convinced that this isn't just an edge case. :)

Or perhaps you would also intend to deploy n Helm chart releases with this approach with different StatefulSet groups for different configurations?

I guess the underlying point is that the current non-persistent static agents are contrary to the way GoCD treats static agents.

Yes, this is somewhat true. Up until Kubernetes 1.26 a StatefulSet definitely wasn't an appropriate abstraction for this either though due to the ordering/sequence requirement, which probably explains why it was this way using Deployments earlier.

Even if I manage to get the env composition right, the AGENTS panel in the UI will rapidly become a graveyard of hundreds or thousands of agents in LostContact status, which can't really be good for performance.

No it's not good, and I've had on my personal list to add automatic cleanup of LostContact agents for years now, just no time/energy to get there as no-one else had complained about it :-) indeed there was a major performance problem on the dashboard caused by combinations of large numbers of agents linked to individual environments etc which I fixed relatively recently.

I don't presume to know the design philosophy of GoCD, you sure know it better, but it seems to me static agents go to great lengths and strongly assume to remember who they are, otherwise the whole guid.txt mechanism would not exist at all.

Yes, agreed. FWIW, almost everything I know about the design philosophy I have inferred, as I was not part of the original dev team that worked on GoCD for Thoughtworks. So you're possibly as well trained as I!

This is definitely going down the road of all the complexity the elastic agent profiles allow you to manage though, so not sure if it's wise.and perhaps better for folks to just deploy 'n' instances of the existing helm chart with different values overrides.

Most cases are indeed covered that way, I don't know if I'm just a particularly big user or if I'm using GoCD wrong (keep in mind agents used to be on physical instances, Elastic was not an option till now).

It's a fair criticism that elastic agents have some problems in their inability to easily cache anything (e.g build tools dynamically fetched for a given job with deterministic versions, large git/repository clones, certain build tools' own caches, e.g docker or Gradle), and so with GoCD's lack of a first-class "cache" notion falling back to static agents can seem attractive.

Which Kubernetes variant are you deploying into, and which persistent storage variants do you have available to you?

moving to Elastic would necessitate a big refactoring.

Just curious, why do you think this would need more of a refactoring than what is required to do it with static agents? It feels like it's also quite complex to do it this StatefulSet way unless you really can have a single pod configuration handle all 100 static agents.

@chadlwilson chadlwilson added the enhancement New feature or request label Dec 17, 2024
@12345ieee
Copy link
Author

Rollouts will be fast, I'd use parallel mode.

Ahh, nice, thanks! That didn't exist when I was last deep in Kubernetes (a while ago now, seems it become enabled by default in 1.26?) - so that might address one of my reservations on the abstraction here. Although it does say "This option only affects the behavior for scaling operations. Updates are not affected." which implies that updates to the agent pod spec will still cause slow sequential rollouts? Never used it myself.

Yes, it's as you said. I value availability more than fast rollouts and I specially don't want cases where a dying agent gets a job and no one finishes it.

StatefulSet is the neatest way to give each agent its own scratch space, not just for persistence, but also to allow me to download GBs of materials in a pipeline (which goes into godata as well) without running out of node space. The coupling is indeed real, in my case it's a feature, because I have the delivery manager only interact with the GoCD UI, but I do understand your concern.

So how would the chart vary other aspects of the configuration for a statefulset for your 100 agents? In most people's cases managing so many different agents alongside seeking to vary GoCD the resources/environments by groups of those agents, they'd typically need to vary things such as the specific agent container image used, or other aspects of the agent configuration - memory, CPU, space available. StatefulSet would not allow that would it, since the PodSpecs must be identical?

Basically, I am looking to be convinced that this isn't just an edge case. :)

Or perhaps you would also intend to deploy n Helm chart releases with this approach with different StatefulSet groups for different configurations?

To get into more details (I wanted to avoid making you feel like a consultant), I have ~10 clusters, for different clients and dev/test/prod and a special cluster for tooling at the center of the star.
Which means I already have 10 deployments of the helm chart to play with, one of which will only have a server.

The clusters provide all the variance I need already for things like pod resources, I can use a unified image, see below for the pvc size.

So you can see it like deploying 11 times this chart, once with a server, the other 10 with 10 agents each.

I guess the underlying point is that the current non-persistent static agents are contrary to the way GoCD treats static agents.

Yes, this is somewhat true. Up until Kubernetes 1.26 a StatefulSet definitely wasn't an appropriate abstraction for this either though due to the ordering/sequence requirement, which probably explains why it was this way using Deployments earlier.

I don't mind the ordering much, mostly I choose STS vs Depl based on the need of "personalized resources", which the guid.txt qualifies for.
I don't fancy myself as a Kubernetes guru though.

Even if I manage to get the env composition right, the AGENTS panel in the UI will rapidly become a graveyard of hundreds or thousands of agents in LostContact status, which can't really be good for performance.

No it's not good, and I've had on my personal list to add automatic cleanup of LostContact agents for years now, just no time/energy to get there as no-one else had complained about it :-) indeed there was a major performance problem on the dashboard caused by combinations of large numbers of agents linked to individual environments etc which I fixed relatively recently.

Adding persistence is for sure a lot easier than changing the agent model.

I don't presume to know the design philosophy of GoCD, you sure know it better, but it seems to me static agents go to great lengths and strongly assume to remember who they are, otherwise the whole guid.txt mechanism would not exist at all.

Yes, agreed. FWIW, almost everything I know about the design philosophy I have inferred, as I was not part of the original dev team that worked on GoCD for Thoughtworks. So you're possibly as well trained as I!

This project did a lot for me and my team, I'm sad to see it slowly fade.

This is definitely going down the road of all the complexity the elastic agent profiles allow you to manage though, so not sure if it's wise.and perhaps better for folks to just deploy 'n' instances of the existing helm chart with different values overrides.

Most cases are indeed covered that way, I don't know if I'm just a particularly big user or if I'm using GoCD wrong (keep in mind agents used to be on physical instances, Elastic was not an option till now).

It's a fair criticism that elastic agents have some problems in their inability to easily cache anything (e.g build tools dynamically fetched for a given job with deterministic versions, large git/repository clones, certain build tools' own caches, e.g docker or Gradle), and so with GoCD's lack of a first-class "cache" notion falling back to static agents can seem attractive.

Indeed, caching is the other reason I want to persist /godata by agent, because a build from scratch could easily munch several GBs of downloads and as many build intermediates and it's not great to redo them all each time.

Which Kubernetes variant are you deploying into, and which persistent storage variants do you have available to you?

Vanilla kubernetes, 1.28 at the moment, going to 1.31 is planned shortly.

My storage of choice for normal workloads is a big NFS server paired to a cluster (10x of these setups), accessed via PVC via https://github.com/kubernetes-csi/csi-driver-nfs .
Which means I have effectively infinite, scalable and 0 waste storage, otherwise I'd hit the pitfall of having to size the PVC for the biggest agent and wasting a ton of space.

moving to Elastic would necessitate a big refactoring.

Just curious, why do you think this would need more of a refactoring than what is required to do it with static agents? It feels like it's also quite complex to do it this StatefulSet way unless you really can have a single pod configuration handle all 100 static agents.

See above, it'd be a 10x10.
The refactoring would mostly go into getting my delivery manager to either manage envs from helm values, or refactor environments until I just have 10 or so, both things I really don't want to get into.

@chadlwilson
Copy link
Member

Sorry for the delayed reply, was a bit distracted getting security fixes out. 😬

Rollouts will be fast, I'd use parallel mode.

Ahh, nice, thanks! That didn't exist when I was last deep in Kubernetes (a while ago now, seems it become enabled by default in 1.26?) - so that might address one of my reservations on the abstraction here. Although it does say "This option only affects the behavior for scaling operations. Updates are not affected." which implies that updates to the agent pod spec will still cause slow sequential rollouts? Never used it myself.

Yes, it's as you said. I value availability more than fast rollouts and I specially don't want cases where a dying agent gets a job and no one finishes it.

I'm not sure that cases where a dying agent gets a job that no-one finishes will be any more or less likely with StatefulSets. In practice I don't believe GoCD agents shut down sufficiently cleanly that the server deterministically will know the agent is dead and re-schedule the job; and if they do I don't really see a StatefulSet will be any better than a Deployment - but I guess your wider preference holds.

I'll probably need to play around with the parallel mode to see how it applies in any case.

To get into more details (I wanted to avoid making you feel like a consultant), I have ~10 clusters, for different clients and dev/test/prod and a special cluster for tooling at the center of the star. Which means I already have 10 deployments of the helm chart to play with, one of which will only have a server.

The clusters provide all the variance I need already for things like pod resources, I can use a unified image, see below for the pvc size.

So you can see it like deploying 11 times this chart, once with a server, the other 10 with 10 agents each.

Well, asking consulting style questions is probably an occupational hazard in open source as well :-) - if you're trying to figure out how widely reusable someone's idea and considering upstreaming it. Hah.

But thanks, that helps me understand a bit more how this might help you. I'll probably have to ponder a bit more to see whether this could be more widely useful. :-/

Adding persistence is for sure a lot easier than changing the agent model.

Indeed, caching is the other reason I want to persist /godata by agent, because a build from scratch could easily munch several GBs of downloads and as many build intermediates and it's not great to redo them all each time.

I think that's what I was getting at re: why not bother with elastic agents. For teh raw mechanics of config i'm not convinced needing to remodel everything as 10 Helm chart deployments with different values files and such is easier than modelling as 10 cluster profiles. However, if it gets to trying to find a way to address file system caching or performance on agents, that's probably fair.

Which Kubernetes variant are you deploying into, and which persistent storage variants do you have available to you?

Vanilla kubernetes, 1.28 at the moment, going to 1.31 is planned shortly.

Sorry, I meant AKS vs EKS vs GKE etc. But based on "big NFS server" it sounds like you are rolling your own on-prem. So is it vanilla Kubernetes? Or some other distro like OpenShift?

My storage of choice for normal workloads is a big NFS server paired to a cluster (10x of these setups), accessed via PVC via https://github.com/kubernetes-csi/csi-driver-nfs . Which means I have effectively infinite, scalable and 0 waste storage, otherwise I'd hit the pitfall of having to size the PVC for the biggest agent and wasting a ton of space.

Yeah, OK. I've never really tried it, however I believe there might be some creative ways to use EFS/NFS with elastic agents to give builds re-usable storage with some creative pod-spec templating. Maybe not worth going into here though, as it's not validated.


Anyway, to summarise, we can probably give this a go. Let's see how it behaves?

@12345ieee
Copy link
Author

Thanks, I'll get a PR rolling soon-ish and then you can give feedback on how the new customization points look.

@12345ieee
Copy link
Author

12345ieee commented Jan 12, 2025

I got back from holiday and opened a PR as discussed in #113 .

To answer your questions on my setup:

Sorry, I meant AKS vs EKS vs GKE etc. But based on "big NFS server" it sounds like you are rolling your own on-prem. So is it vanilla Kubernetes? Or some other distro like OpenShift?

I'm mainly on Oracle Cloud, so I use OKE, but I also have envs on AKS@AWS, I have vanilla kubernetes 1.28 on top of them.

I try to keep the kubernetes deployments completely cloud-agnostic, the cloud providers have fancy distributed FS available as storage classes, but they are:

  • completely vendor-locked
  • very pricy

So I've made my own NFS server (one per cluster) out of a normal compute instance and I interact with it via both the in-tree NFS driver (to mount existing folders) or via https://github.com/kubernetes-csi/csi-driver-nfs to have an experience very similar to the dynamic block storage provisioning in cloud.

Assuming #113 is merged, my deployments are as follows, the server deployment is base+server, the agent is base+agent:

base.yaml:

rbac:
  create: false

serviceAccount:
  create: false

server:
  enabled: false
  shouldPreconfigure: false
  env:
    extraEnvVars:
    - name: GOCD_PLUGIN_INSTALL_vault-secret-plugin
      value: https://github.com/gocd/gocd-vault-secret-plugin/releases/download/v1.3.0-355/gocd-vault-secret-plugin-1.3.0-355.jar
  service:
    type: "ClusterIP"
  ingress:
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "0" # import config massive
  persistence:
    storageClass: nfs-csi

agent:
  controller:
    kind: "StatefulSet"
  image:
    repository: "CUSTOM_IMAGE"
  env:
    goServerUrl: "https://CUSTOM_URL/go"
    agentAutoRegisterKey: "CUSTOM_KEY"
    extraEnvVars:
      - name: ORDINAL_NUMBER
        valueFrom:
          fieldRef:
            fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
  persistence:
    claimTemplates:
      - metadata:
          name: godata
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 1Gi
          storageClassName: nfs-csi
    extraVolumeMounts:
      - name: godata
        mountPath: /godata
  healthCheck:
    enabled: true

server.yaml (1 only):

server:
  enabled: true
  ingress:
    hosts:
    - CUSTOM_URL
    tls:
    - hosts:
      - CUSTOM_URL

agent:
  enabled: false

agent.yaml (1 of this per cluster, around 10 at the moment):

agent:
  replicaCount: CUSTOM_NUMBER
  env:
    agentAutoRegisterHostname: "CUSTOM_NAME-$(ORDINAL_NUMBER)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants