Skip to content

Fix: Prevent Collector Crash on Invalid Config Reload via SIGHUP (#11817) #13432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

gokulvootla
Copy link

Which problem is this PR solving?

Description of the changes

  • Modified the reloadConfiguration function to validate the new configuration in a dry-run mode before applying it.

  • If the new config is invalid, the Collector:

    • Logs a clear error message.
    • Retains the existing running configuration.
    • Does not restart or crash.

    How was this change tested?

    Valid Changes

  1. Applied a valid config change in Kubernetes ConfigMap.
  2. Triggered reload via kill -HUP 1 in the Collector container.
  3. And Verified logs
2025-07-18T18:58:16.716Z        info    extensions/extensions.go:41     Starting extensions...  {"resource": {"service.instance.id": "4fcb8dd7-fabf-4ce6-bb3e-c44864649628", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T18:58:16.717Z        info    [email protected]/otlp.go:117       Starting GRPC server    {"resource": {"service.instance.id": "4fcb8dd7-fabf-4ce6-bb3e-c44864649628", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}, "otelcol.component.id": "otlp", "otelcol.component.kind": "receiver", "otelcol.signal": "traces", "endpoint": "localhost:4317"}
2025-07-18T18:58:16.717Z        info    [email protected]/service.go:280 Everything is ready. Begin running and processing data. {"resource": {"service.instance.id": "4fcb8dd7-fabf-4ce6-bb3e-c44864649628", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
[INFO] Configuration reloaded successfully.

InValid Changes

  1. Introduced breaking changes to the configmap
  2. Triggred Kill -HUP
  3. Verified from logs
[ERROR] Failed to load new config: decoding failed...
'' has invalid keys: expor
[INFO] Keeping the existing configuration.

Final logs

2025-07-18T19:00:41.286Z        info    [email protected]/collector.go:403       Received signal from OS {"resource": {"service.instance.id": "8609bb65-b5de-4bed-bf80-183ef3b13a8e", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}, "signal": "hangup"}
[ERROR] Failed to load new config: cannot unmarshal the configuration: decoding failed due to the following error(s):

'' has invalid keys: expor
[INFO] Keeping the existing configuration.



2025-07-18T19:00:47.800Z        info    [email protected]/collector.go:403       Received signal from OS {"resource": {"service.instance.id": "8609bb65-b5de-4bed-bf80-183ef3b13a8e", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}, "signal": "hangup"}
2025-07-18T19:00:47.804Z        info    [email protected]/service.go:197 Setting up own telemetry...     {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T19:00:47.804Z        info    builders/builders.go:26 Development component. May change in the future.        {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}, "otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "traces"}
2025-07-18T19:00:47.805Z        info    [email protected]/service.go:239 Skipped telemetry setup.        {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T19:00:47.805Z        info    [email protected]/service.go:322 Starting shutdown...    {"resource": {"service.instance.id": "8609bb65-b5de-4bed-bf80-183ef3b13a8e", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T19:00:47.805Z        info    extensions/extensions.go:69     Stopping extensions...  {"resource": {"service.instance.id": "8609bb65-b5de-4bed-bf80-183ef3b13a8e", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T19:00:47.805Z        info    [email protected]/service.go:336 Shutdown complete.      {"resource": {"service.instance.id": "8609bb65-b5de-4bed-bf80-183ef3b13a8e", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T19:00:47.805Z        info    [email protected]/service.go:257 Starting otelcorecol... {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}, "Version": "0.130.0-dev", "NumCPU": 2}
2025-07-18T19:00:47.805Z        info    extensions/extensions.go:41     Starting extensions...  {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
2025-07-18T19:00:47.805Z        info    [email protected]/otlp.go:117       Starting GRPC server    {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}, "otelcol.component.id": "otlp", "otelcol.component.kind": "receiver", "otelcol.signal": "traces", "endpoint": "localhost:4317"}
2025-07-18T19:00:47.806Z        info    [email protected]/service.go:280 Everything is ready. Begin running and processing data. {"resource": {"service.instance.id": "38f0b3fc-2eb6-4427-83c3-2cc761cda92b", "service.name": "otelcorecol", "service.version": "0.130.0-dev"}}
[INFO] Configuration reloaded successfully

Observations

  • Even when a configuration with syntax errors or invalid fields was applied:
       The Collector rejected the new config, retained the current state, and continued serving.
       Kubernetes did not restart the pod, and container liveness/readiness probes remained healthy.

@gokulvootla gokulvootla requested a review from a team as a code owner July 18, 2025 20:04
Copy link

linux-foundation-easycla bot commented Jul 18, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tackling this thorny issue.

return nil
factories, err := col.set.Factories()
if err != nil {
fmt.Println("[ERROR] Failed to initialize factories:", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to keep using col.service.Logger().Warn() for this.

Comment on lines 251 to 301
factories, err := col.set.Factories()
if err != nil {
fmt.Println("[ERROR] Failed to initialize factories:", err)
return nil
}

cfg, err := col.configProvider.Get(ctx, factories)
if err != nil {
fmt.Println("[ERROR] Failed to load new config:", err)
fmt.Println("[INFO] Keeping the existing configuration.")
return nil
}

tempService, err := service.New(ctx, service.Settings{
BuildInfo: col.set.BuildInfo,
ReceiversConfigs: cfg.Receivers,
ReceiversFactories: factories.Receivers,
ProcessorsConfigs: cfg.Processors,
ProcessorsFactories: factories.Processors,
ExportersConfigs: cfg.Exporters,
ExportersFactories: factories.Exporters,
ConnectorsConfigs: cfg.Connectors,
ConnectorsFactories: factories.Connectors,
ExtensionsConfigs: cfg.Extensions,
ExtensionsFactories: factories.Extensions,
}, service.Config{
Extensions: cfg.Service.Extensions,
Pipelines: cfg.Service.Pipelines,
Telemetry: cfg.Service.Telemetry,
})
if err != nil {
fmt.Println("[ERROR] New configuration is invalid:", err)
fmt.Println("[INFO] Keeping the existing configuration.")
return nil
}

col.setCollectorState(StateClosing)
if err := col.service.Shutdown(ctx); err != nil {
fmt.Println("[ERROR] Failed to shutdown current service:", err)
return err
}

col.service = tempService
if err := col.service.Start(ctx); err != nil {
fmt.Println("[ERROR] Failed to start new service:", err)
return err
}

col.setCollectorState(StateRunning)
fmt.Println("[INFO] Configuration reloaded successfully.")
return nil
Copy link
Contributor

@jade-guiton-dd jade-guiton-dd Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few issues with this approach:

  • This duplicates code from setupConfigurationComponents, and more importantly doesn't behave the same way (some fields in service.Settings are missing for example).
  • I'm not sure that calling service.New while the previous service is still running is safe. Maybe other approvers can opine on this.
  • I don't think checking for errors in service.New is a good way of validating the config. Some errors in service.New should probably be fatal, and there are configuration errors it won't catch.
  • Notably, xconfmap.Validate is never called, even though it's a very important part of validating the config.

I think it would be better to extract the parts of setupConfigurationComponents which load and validate the config (ie. the calls to col.set.Factories(), col.configProvider.Get(), and xconfmap.Validate()) into their own loadConfiguration [name to be bikeshed] method.

Then, we can have Collector.Run:

  • call loadConfiguration and exit if it fails
  • call setupConfigurationComponents with the results to start the service

And reloadConfiguration can:

  • call loadConfiguration and log a warning if it fails
  • shut down the current service
  • call setupConfigurationComponents to start the new one

If there are config-related errors that only surface inside service.New and we want to avoid a crash when they occur, I think we should move those checks into the appropriate .Validate() method.

Copy link

codecov bot commented Jul 21, 2025

Codecov Report

Attention: Patch coverage is 52.17391% with 22 lines in your changes missing coverage. Please review.

Project coverage is 91.40%. Comparing base (be09659) to head (1096e1b).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
otelcol/collector.go 52.17% 20 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (52.17%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13432      +/-   ##
==========================================
- Coverage   91.48%   91.40%   -0.09%     
==========================================
  Files         529      529              
  Lines       29508    29544      +36     
==========================================
+ Hits        26996    27004       +8     
- Misses       1985     2010      +25     
- Partials      527      530       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gokulvootla
Copy link
Author

Hi @jade-guiton-dd

I’ve updated the PR based on your feedback. Please take a look when you get a chance. Thank you!

Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update.

Comment on lines +113 to +114
config *Config
factories Factories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a need for these fields. Passing the values through return / call arguments should be enough.

Comment on lines 178 to 185
factories, err := col.set.Factories()
if err != nil {
return fmt.Errorf("failed to initialize factories: %w", err)
}
cfg, err := col.configProvider.Get(ctx, factories)
cfg, err = col.configProvider.Get(ctx, factories)
if err != nil {
return fmt.Errorf("failed to get config: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in recreating the factories and config since they're passed as arguments (and same thing for the call to xconfmap.Validate() below).

return nil, Factories{}, fmt.Errorf("invalid configuration: %w", err)
}

return cfg, factories, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple formatting and linter errors, you'll want to fix those. Make sure you have a Go formatter setup in your IDE. You can run make golint locally to check your work.

}
cfg, factories, err := col.loadConfiguration(ctx)
if err != nil {
col.setCollectorState(StateClosed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to keep the newFallbackLogger logic in this function. To avoid writing it twice, you can write something like:

cfg, factories, err := col.loadConfiguration(ctx)
if err == nil {
	err = col.setupConfigurationComponents(ctx, col.config, col.factories)
}
if err != nil {
	col.setCollectorState(StateClosed)
	logger, loggerErr := newFallbackLogger(col.set.LoggingOptions)
	// etc.
}

}

col.setCollectorState(StateRunning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for that, this is already done in setupConfigurationComponents in the successful case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hot reload of configuration changes should validate changes
2 participants