-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix: Prevent Collector Crash on Invalid Config Reload via SIGHUP (#11817) #13432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: gokulvootla <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tackling this thorny issue.
otelcol/collector.go
Outdated
return nil | ||
factories, err := col.set.Factories() | ||
if err != nil { | ||
fmt.Println("[ERROR] Failed to initialize factories:", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to keep using col.service.Logger().Warn()
for this.
otelcol/collector.go
Outdated
factories, err := col.set.Factories() | ||
if err != nil { | ||
fmt.Println("[ERROR] Failed to initialize factories:", err) | ||
return nil | ||
} | ||
|
||
cfg, err := col.configProvider.Get(ctx, factories) | ||
if err != nil { | ||
fmt.Println("[ERROR] Failed to load new config:", err) | ||
fmt.Println("[INFO] Keeping the existing configuration.") | ||
return nil | ||
} | ||
|
||
tempService, err := service.New(ctx, service.Settings{ | ||
BuildInfo: col.set.BuildInfo, | ||
ReceiversConfigs: cfg.Receivers, | ||
ReceiversFactories: factories.Receivers, | ||
ProcessorsConfigs: cfg.Processors, | ||
ProcessorsFactories: factories.Processors, | ||
ExportersConfigs: cfg.Exporters, | ||
ExportersFactories: factories.Exporters, | ||
ConnectorsConfigs: cfg.Connectors, | ||
ConnectorsFactories: factories.Connectors, | ||
ExtensionsConfigs: cfg.Extensions, | ||
ExtensionsFactories: factories.Extensions, | ||
}, service.Config{ | ||
Extensions: cfg.Service.Extensions, | ||
Pipelines: cfg.Service.Pipelines, | ||
Telemetry: cfg.Service.Telemetry, | ||
}) | ||
if err != nil { | ||
fmt.Println("[ERROR] New configuration is invalid:", err) | ||
fmt.Println("[INFO] Keeping the existing configuration.") | ||
return nil | ||
} | ||
|
||
col.setCollectorState(StateClosing) | ||
if err := col.service.Shutdown(ctx); err != nil { | ||
fmt.Println("[ERROR] Failed to shutdown current service:", err) | ||
return err | ||
} | ||
|
||
col.service = tempService | ||
if err := col.service.Start(ctx); err != nil { | ||
fmt.Println("[ERROR] Failed to start new service:", err) | ||
return err | ||
} | ||
|
||
col.setCollectorState(StateRunning) | ||
fmt.Println("[INFO] Configuration reloaded successfully.") | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few issues with this approach:
- This duplicates code from
setupConfigurationComponents
, and more importantly doesn't behave the same way (some fields inservice.Settings
are missing for example). - I'm not sure that calling
service.New
while the previous service is still running is safe. Maybe other approvers can opine on this. - I don't think checking for errors in
service.New
is a good way of validating the config. Some errors inservice.New
should probably be fatal, and there are configuration errors it won't catch. - Notably,
xconfmap.Validate
is never called, even though it's a very important part of validating the config.
I think it would be better to extract the parts of setupConfigurationComponents
which load and validate the config (ie. the calls to col.set.Factories()
, col.configProvider.Get()
, and xconfmap.Validate()
) into their own loadConfiguration
[name to be bikeshed] method.
Then, we can have Collector.Run
:
- call
loadConfiguration
and exit if it fails - call
setupConfigurationComponents
with the results to start the service
And reloadConfiguration
can:
- call
loadConfiguration
and log a warning if it fails - shut down the current service
- call
setupConfigurationComponents
to start the new one
If there are config-related errors that only surface inside service.New
and we want to avoid a crash when they occur, I think we should move those checks into the appropriate .Validate()
method.
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (52.17%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #13432 +/- ##
==========================================
- Coverage 91.48% 91.40% -0.09%
==========================================
Files 529 529
Lines 29508 29544 +36
==========================================
+ Hits 26996 27004 +8
- Misses 1985 2010 +25
- Partials 527 530 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I’ve updated the PR based on your feedback. Please take a look when you get a chance. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update.
config *Config | ||
factories Factories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a need for these fields. Passing the values through return / call arguments should be enough.
factories, err := col.set.Factories() | ||
if err != nil { | ||
return fmt.Errorf("failed to initialize factories: %w", err) | ||
} | ||
cfg, err := col.configProvider.Get(ctx, factories) | ||
cfg, err = col.configProvider.Get(ctx, factories) | ||
if err != nil { | ||
return fmt.Errorf("failed to get config: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point in recreating the factories and config since they're passed as arguments (and same thing for the call to xconfmap.Validate()
below).
return nil, Factories{}, fmt.Errorf("invalid configuration: %w", err) | ||
} | ||
|
||
return cfg, factories, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple formatting and linter errors, you'll want to fix those. Make sure you have a Go formatter setup in your IDE. You can run make golint
locally to check your work.
} | ||
cfg, factories, err := col.loadConfiguration(ctx) | ||
if err != nil { | ||
col.setCollectorState(StateClosed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to keep the newFallbackLogger
logic in this function. To avoid writing it twice, you can write something like:
cfg, factories, err := col.loadConfiguration(ctx)
if err == nil {
err = col.setupConfigurationComponents(ctx, col.config, col.factories)
}
if err != nil {
col.setCollectorState(StateClosed)
logger, loggerErr := newFallbackLogger(col.set.LoggingOptions)
// etc.
}
} | ||
|
||
col.setCollectorState(StateRunning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for that, this is already done in setupConfigurationComponents
in the successful case.
Which problem is this PR solving?
Description of the changes
Modified the reloadConfiguration function to validate the new configuration in a dry-run mode before applying it.
If the new config is invalid, the Collector:
How was this change tested?
Valid Changes
InValid Changes
Final logs
Observations