Skip to content

Commit 7435675

Browse files
fix(agent): memory leaks and optimize resource usage with Datadog/Pyroscope integrations (#351)
* an attempt to fix memory leak * add pyroscope profiler support * use protobuf instead of application/json for api access and try to avoid retaining list items * fallback to json in case protobuf is not accepted * update pyroscope config * fix lint * use single content type for k8s config * improve retry watcher list and k8s config * try using watch with initial list without dedicated list * start with empty resource version to get synthetic list first * revert retry watcher changes * add some more grafana metrics and decrease watcher restart attempts * disable RetryListerWatcher and use raw RetryWatcher * revert parts of the objeststatusreporter logic and use original reporter for cli utils * `revert some changes and start from scratch` * `revert some more code` * `refactor: enhance RetryListerWatcher with stop logic and improve watcher handling while cleaning up unused metrics` * refactor: fix lint issues, clean up imports, and improve cache and reconciliation logic * fix: add buffer size to resultChan in RetryListerWatcher to prevent potential blocking * feat: add Datadog profiling integration to deployment-operator agent * chore(dependencies): update go module dependencies in go.sum * feat: update Datadog profiler to support custom agent address configuration * `feat(agent): add Datadog tracer initialization and refactor Datadog address to host configuration` * feat: add service name and environment configuration to Datadog tracer setup * `chore: update go module dependencies to the latest compatible versions` * refactor: update Datadog tracer and profiler initialization with improved address handling and unified configuration * fix(datadog): correct argument assignment for tracer setup in deployment-operator agent configuration * fix(datadog): correct argument assignment for tracer setup in deployment-operator agent configuration * feat: add support for configurable Datadog environment with new `datadog-env` flag * feat: integrate Kubernetes client tracing with DataDog and update dependencies * `refactor: move Kubernetes client trace setup closer to initialization for improved code clarity` * `fix: update cache expiration logic to refresh timestamps and retain entries with cleared specific fields` * `feat: add tracing to RetryListerWatcher for improved observability` * feat(retry-watcher): add context support, enhance synchronization, and improve cleanup handling * `ci: update golangci-lint-action to v7.0.0 and linter version to v2.3.4` * ci: downgrade golangci-lint-action to v2.1.2 in workflow configuration * fix(args): change default value of Datadog integration flag to false * `test: fix test initialization by moving Init call to BeforeAll setup` * fix(watcher): improve stop handling and add tracing in RetryListerWatcher * feat: add profiling period and CPU duration configuration to Datadog profiler setup * add prealloc linter * fix prealloc issues * add more linters * `fix: use thread-safe check for watcher started status` * test: move Init invocation to specific test cases in resource_cache_test * feat: add environment variable support for feature toggles in deployment operator flags * add more linters * refactor(agent): remove protobuf content type configuration from agent main setup * refactor: remove Datadog tracing spans from RetryListerWatcher implementation --------- Co-authored-by: Marcin Maciaszczyk <[email protected]>
1 parent 576bbbe commit 7435675

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+662
-284
lines changed

.github/workflows/ci.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ jobs:
4040
with:
4141
go-version-file: go.mod
4242
check-latest: true
43-
- uses: golangci/golangci-lint-action@v6
43+
- uses: golangci/golangci-lint-action@v7.0.0
4444
with:
45-
version: v1.63.4
45+
version: v2.1.2
4646
build-image:
4747
name: Build image
4848
needs: [build, test]

.golangci.yml

Lines changed: 45 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,45 @@
1-
run:
2-
modules-download-mode: readonly
3-
allow-parallel-runners: true
4-
timeout: 5m
5-
6-
linters:
7-
disable-all: true
8-
enable:
9-
# default linters
10-
- errcheck
11-
- gosimple
12-
- govet
13-
- ineffassign
14-
- staticcheck
15-
- typecheck
16-
- unused
17-
18-
# additional linters
19-
- errorlint
20-
- errname
21-
- gocyclo
22-
- goimports
23-
- misspell
24-
- gofmt
25-
- importas
26-
- goconst
27-
- gocritic
28-
- misspell
1+
version: "2"
2+
run:
3+
modules-download-mode: readonly
4+
allow-parallel-runners: true
5+
linters:
6+
default: none
7+
enable:
8+
- copyloopvar
9+
- errcheck
10+
- errname
11+
- errorlint
12+
- goconst
13+
- gocritic
14+
- gocyclo
15+
- govet
16+
- importas
17+
- ineffassign
18+
- misspell
19+
- prealloc
20+
- staticcheck
21+
- unused
22+
- usestdlibvars
23+
- wastedassign
24+
- whitespace
25+
exclusions:
26+
generated: lax
27+
presets:
28+
- comments
29+
- common-false-positives
30+
- legacy
31+
- std-error-handling
32+
paths:
33+
- third_party$
34+
- builtin$
35+
- examples$
36+
formatters:
37+
enable:
38+
- gofmt
39+
- goimports
40+
exclusions:
41+
generated: lax
42+
paths:
43+
- third_party$
44+
- builtin$
45+
- examples$

api/v1alpha1/pipelinegate_types.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ func (pgs *PipelineGateStatus) IsClosed() bool {
9898
}
9999

100100
func (pgs *PipelineGateStatus) HasJobRef() bool {
101-
return !(pgs.JobRef == nil || *pgs.JobRef == console.NamespacedName{})
101+
return pgs.JobRef != nil && *pgs.JobRef != console.NamespacedName{}
102102
}
103103

104104
func (pgs *PipelineGateStatus) GetConsoleGateState() *console.GateState {

cmd/agent/args/args.go

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,12 @@ import (
1919
)
2020

2121
const (
22-
EnvDeployToken = "DEPLOY_TOKEN"
22+
EnvDeployToken = "DEPLOY_TOKEN"
23+
EnvDatadogEnabled = "DATADOG_ENABLED"
24+
EnvPyroscopeEnabled = "PYROSCOPE_ENABLED"
25+
EnvProfilerEnabled = "PROFILER_ENABLED"
26+
EnvResourceCacheEnabled = "RESOURCE_CACHE_ENABLED"
27+
EnvLocal = "LOCAL"
2328

2429
defaultProbeAddress = ":9001"
2530
defaultMetricsAddress = ":8000"
@@ -49,15 +54,21 @@ const (
4954

5055
defaultProfilerPath = "/debug/pprof/"
5156
defaultProfilerAddress = ":7777"
57+
58+
defaultPyroscopeAddress = "http://pyroscope.monitoring.svc.cluster.local:4040"
59+
defaultDatadogHost = "datadog-agent.monitoring.svc.cluster.local"
60+
defaultDatadogEnv = "plrl-dev-aws"
5261
)
5362

5463
var (
5564
argDisableHelmTemplateDryRunServer = flag.Bool("disable-helm-dry-run-server", false, "Disable helm template in dry-run=server mode.")
5665
argEnableHelmDependencyUpdate = flag.Bool("enable-helm-dependency-update", false, "Enable update Helm chart's dependencies.")
5766
argEnableLeaderElection = flag.Bool("leader-elect", false, "Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.")
58-
argLocal = flag.Bool("local", false, "Whether you're running the operator locally.")
59-
argProfiler = flag.Bool("profiler", false, "Enable pprof handler. By default it will be exposed on localhost:7777 under '/debug/pprof'")
60-
argDisableResourceCache = flag.Bool("disable-resource-cache", false, "Control whether resource cache should be enabled or not.")
67+
argLocal = flag.Bool("local", helpers.GetPluralEnvBool(EnvLocal, false), "Whether you're running the operator locally.")
68+
argProfiler = flag.Bool("profiler", helpers.GetPluralEnvBool(EnvProfilerEnabled, false), "Enable pprof handler. By default it will be exposed on localhost:7777 under '/debug/pprof'")
69+
argPyroscope = flag.Bool("pyroscope", helpers.GetPluralEnvBool(EnvPyroscopeEnabled, false), "Enable pyroscope integration for detailed application profiling. By default it will push to http://pyroscope.monitoring.svc.cluster.local:4040")
70+
argDatadog = flag.Bool("datadog", helpers.GetPluralEnvBool(EnvDatadogEnabled, false), "Enable datadog integration for detailed application profiling. By default it will push to http://datadog.monitoring.svc.cluster.local:8125")
71+
argDisableResourceCache = flag.Bool("disable-resource-cache", !helpers.GetPluralEnvBool(EnvResourceCacheEnabled, true), "Control whether resource cache should be enabled or not.")
6172
argEnableKubecostProxy = flag.Bool("enable-kubecost-proxy", false, "If set, will proxy a Kubecost API request through the K8s API server.")
6273

6374
argMaxConcurrentReconciles = flag.Int("max-concurrent-reconciles", 20, "Maximum number of concurrent reconciles which can be run.")
@@ -78,6 +89,9 @@ var (
7889
argControllerCacheTTL = flag.String("controller-cache-ttl", defaultControllerCacheTTL, "The time to live of console controller cache entries.")
7990
argRestoreNamespace = flag.String("restore-namespace", defaultRestoreNamespace, "The namespace where Velero restores are located.")
8091
argServices = flag.String("services", "", "A comma separated list of service ids to reconcile. Leave empty to reconcile all.")
92+
argPyroscopeAddress = flag.String("pyroscope-address", defaultPyroscopeAddress, "The address of the Pyroscope server.")
93+
argDatadogHost = flag.String("datadog-host", defaultDatadogHost, "The address of the Datadog server.")
94+
argDatadogEnv = flag.String("datadog-env", defaultDatadogEnv, "The environment of the Datadog server.")
8195

8296
serviceSet containers.Set[string]
8397
)
@@ -278,6 +292,26 @@ func ResourceCacheEnabled() bool {
278292
return !(*argDisableResourceCache)
279293
}
280294

295+
func PyroscopeEnabled() bool {
296+
return *argPyroscope
297+
}
298+
299+
func PyroscopeAddress() string {
300+
return *argPyroscopeAddress
301+
}
302+
303+
func DatadogEnabled() bool {
304+
return *argDatadog
305+
}
306+
307+
func DatadogHost() string {
308+
return *argDatadogHost
309+
}
310+
311+
func DatadogEnv() string {
312+
return *argDatadogEnv
313+
}
314+
281315
func ensureOrDie(argName string, arg *string) {
282316
if arg == nil || len(*arg) == 0 {
283317
pflag.PrintDefaults()

cmd/agent/args/datadog.go

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
package args
2+
3+
import (
4+
"fmt"
5+
"time"
6+
7+
"github.com/DataDog/dd-trace-go/v2/ddtrace/tracer"
8+
"github.com/DataDog/dd-trace-go/v2/profiler"
9+
"k8s.io/klog/v2"
10+
)
11+
12+
func InitDatadog() error {
13+
klog.Info("initializing datadog")
14+
15+
env := DatadogEnv()
16+
service := "deployment-operator"
17+
agentAddr := fmt.Sprintf("%s:%s", DatadogHost(), "8126")
18+
dogstatsdAddr := fmt.Sprintf("%s:%s", DatadogHost(), "8125")
19+
20+
if err := tracer.Start(
21+
tracer.WithRuntimeMetrics(),
22+
tracer.WithDogstatsdAddr(dogstatsdAddr),
23+
tracer.WithAgentAddr(agentAddr),
24+
tracer.WithService(service),
25+
tracer.WithEnv(env),
26+
); err != nil {
27+
return err
28+
}
29+
30+
return profiler.Start(
31+
profiler.WithService(service),
32+
profiler.WithEnv(env),
33+
profiler.WithTags(fmt.Sprintf("cluster_id:%s", ClusterId()), fmt.Sprintf("console_url:%s", ConsoleUrl())),
34+
profiler.WithAgentAddr(agentAddr),
35+
profiler.WithPeriod(30*time.Second),
36+
profiler.CPUDuration(30*time.Second),
37+
profiler.WithProfileTypes(
38+
profiler.CPUProfile,
39+
profiler.HeapProfile,
40+
// The profiles below are disabled by default to keep overhead
41+
// low, but can be enabled as needed.
42+
43+
// profiler.BlockProfile,
44+
profiler.MutexProfile,
45+
profiler.GoroutineProfile,
46+
),
47+
)
48+
}

cmd/agent/args/pyroscope.go

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
package args
2+
3+
import (
4+
"os"
5+
"runtime"
6+
"time"
7+
8+
"github.com/grafana/pyroscope-go"
9+
"k8s.io/klog/v2"
10+
)
11+
12+
func InitPyroscope() (*pyroscope.Profiler, error) {
13+
klog.Info("initializing pyroscope")
14+
15+
runtime.SetMutexProfileFraction(5)
16+
runtime.SetBlockProfileRate(5)
17+
18+
return pyroscope.Start(pyroscope.Config{
19+
ApplicationName: "deployment-operator",
20+
21+
// replace this with the address of pyroscope server
22+
ServerAddress: PyroscopeAddress(),
23+
24+
// you can disable logging by setting this to nil
25+
Logger: nil,
26+
27+
// Push metrics once every 30 seconds
28+
UploadRate: 30 * time.Second,
29+
30+
// you can provide static tags via a map:
31+
Tags: map[string]string{"hostname": os.Getenv("HOSTNAME")},
32+
33+
ProfileTypes: []pyroscope.ProfileType{
34+
// these profile types are enabled by default:
35+
pyroscope.ProfileCPU,
36+
pyroscope.ProfileAllocObjects,
37+
pyroscope.ProfileAllocSpace,
38+
pyroscope.ProfileInuseObjects,
39+
pyroscope.ProfileInuseSpace,
40+
41+
// these profile types are optional:
42+
pyroscope.ProfileGoroutines,
43+
pyroscope.ProfileMutexCount,
44+
pyroscope.ProfileMutexDuration,
45+
pyroscope.ProfileBlockCount,
46+
pyroscope.ProfileBlockDuration,
47+
},
48+
})
49+
}

cmd/agent/console.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ import (
44
"os"
55
"time"
66

7+
"k8s.io/apimachinery/pkg/runtime"
8+
79
"github.com/pluralsh/deployment-operator/cmd/agent/args"
810
"github.com/pluralsh/deployment-operator/internal/utils"
911
"github.com/pluralsh/deployment-operator/pkg/client"
1012
consolectrl "github.com/pluralsh/deployment-operator/pkg/controller"
1113
"github.com/pluralsh/deployment-operator/pkg/controller/stacks"
1214
v1 "github.com/pluralsh/deployment-operator/pkg/controller/v1"
13-
"k8s.io/apimachinery/pkg/runtime"
1415

1516
"k8s.io/client-go/rest"
1617
ctrclient "sigs.k8s.io/controller-runtime/pkg/client"
1718

18-
"github.com/pluralsh/deployment-operator/pkg/controller"
1919
"github.com/pluralsh/deployment-operator/pkg/controller/namespaces"
2020
"github.com/pluralsh/deployment-operator/pkg/controller/pipelinegates"
2121
"github.com/pluralsh/deployment-operator/pkg/controller/restore"
@@ -47,7 +47,7 @@ const (
4747
)
4848

4949
func registerConsoleReconcilersOrDie(
50-
mgr *controller.Manager,
50+
mgr *consolectrl.Manager,
5151
config *rest.Config,
5252
k8sClient ctrclient.Client,
5353
scheme *runtime.Scheme,

cmd/agent/kubernetes.go

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,6 @@ import (
1212
rolloutv1alpha1 "github.com/argoproj/argo-rollouts/pkg/apis/rollouts/v1alpha1"
1313
roclientset "github.com/argoproj/argo-rollouts/pkg/client/clientset/versioned"
1414
cmap "github.com/orcaman/concurrent-map/v2"
15-
"github.com/pluralsh/deployment-operator/cmd/agent/args"
16-
"github.com/pluralsh/deployment-operator/internal/controller"
17-
"github.com/pluralsh/deployment-operator/pkg/cache"
18-
consoleclient "github.com/pluralsh/deployment-operator/pkg/client"
19-
consolectrl "github.com/pluralsh/deployment-operator/pkg/controller"
20-
"github.com/pluralsh/deployment-operator/pkg/controller/service"
2115
"github.com/prometheus/client_golang/prometheus/promhttp"
2216
velerov1 "github.com/vmware-tanzu/velero/pkg/apis/velero/v1"
2317
"k8s.io/apimachinery/pkg/runtime/schema"
@@ -31,6 +25,13 @@ import (
3125
"sigs.k8s.io/controller-runtime/pkg/healthz"
3226
"sigs.k8s.io/controller-runtime/pkg/manager"
3327
"sigs.k8s.io/controller-runtime/pkg/metrics/server"
28+
29+
"github.com/pluralsh/deployment-operator/cmd/agent/args"
30+
"github.com/pluralsh/deployment-operator/internal/controller"
31+
"github.com/pluralsh/deployment-operator/pkg/cache"
32+
consoleclient "github.com/pluralsh/deployment-operator/pkg/client"
33+
consolectrl "github.com/pluralsh/deployment-operator/pkg/controller"
34+
"github.com/pluralsh/deployment-operator/pkg/controller/service"
3435
)
3536

3637
const serviceIDCacheExpiry = 12 * time.Hour
@@ -108,7 +109,6 @@ func registerKubeReconcilersOrDie(
108109
discoveryClient discovery.DiscoveryInterface,
109110
enableKubecostProxy bool,
110111
) {
111-
112112
rolloutsClient, dynamicClient, kubeClient, metricsClient := initKubeClientsOrDie(config)
113113

114114
backupController := &controller.BackupReconciler{

cmd/agent/main.go

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ import (
55
"os"
66
"time"
77

8+
kubernetestrace "github.com/DataDog/dd-trace-go/contrib/k8s.io/client-go/v2/kubernetes"
9+
datadogtracer "github.com/DataDog/dd-trace-go/v2/ddtrace/tracer"
10+
datadogprofiler "github.com/DataDog/dd-trace-go/v2/profiler"
811
trivy "github.com/aquasecurity/trivy-operator/pkg/apis/aquasecurity/v1alpha1"
912
rolloutv1alpha1 "github.com/argoproj/argo-rollouts/pkg/apis/rollouts/v1alpha1"
1013
certmanagerv1 "github.com/cert-manager/cert-manager/pkg/apis/certmanager/v1"
@@ -55,6 +58,33 @@ func main() {
5558
config := ctrl.GetConfigOrDie()
5659
ctx := ctrl.LoggerInto(ctrl.SetupSignalHandler(), setupLog)
5760

61+
if args.PyroscopeEnabled() {
62+
profiler, err := args.InitPyroscope()
63+
if err != nil {
64+
setupLog.Error(err, "unable to initialize pyroscope")
65+
os.Exit(1)
66+
}
67+
68+
defer func() {
69+
_ = profiler.Stop()
70+
}()
71+
}
72+
73+
if args.DatadogEnabled() {
74+
err := args.InitDatadog()
75+
if err != nil {
76+
panic("unable to initialize datadog")
77+
}
78+
79+
// Trace kubernetes client calls
80+
config.WrapTransport = kubernetestrace.WrapRoundTripper
81+
82+
defer func() {
83+
datadogtracer.Stop()
84+
datadogprofiler.Stop()
85+
}()
86+
}
87+
5888
extConsoleClient := client.New(args.ConsoleUrl(), args.DeployToken())
5989
discoveryClient := initDiscoveryClientOrDie(config)
6090
kubeManager := initKubeManagerOrDie(config)

0 commit comments

Comments
 (0)