-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share host info between packages #5884
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
|
|
046216e
to
b56c685
Compare
} | ||
|
||
func (s *threadSafeHost) CPUTime() (types.CPUTimes, error) { | ||
s.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my first reaction/thinking was that this thread-safety code most probably belongs to the go-sysinfo codebase?! but then looking again in the exposed functions we use, and from a quick look in go-sysinfo they do not mutate anything, so I am wondering do we need to hold a mutex to call them?! did I miss a mutation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go-sysinfo doesn't guarantee that these are thread-safe, so I don't have much of a choice here. We could try to make them so in the library, but I'm not sure it's worth it. These functions aren't called very often, so in practice there shouldn't be any contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since these functions are not mutating anything I would propose to remove the mutex. Or again you may have spotted a mutation that I missed?! If not, I am not entirely convinced that losing the concurrency of calling these functions, given that this is now a shared instance across multiple places, is wise. But if you insist sure go with it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions are not mutating anything right now, but go-sysinfo doesn't guarantee this interface is thread-safe, so this is an implementation detail that may change. If we want to remove the locks, we should make the change in go-sysinfo, and I don't think that's worth the hassle. This PR is a bit borderline already in my opinion when it comes to complexity introduced vs performance gained.
At the very least, I'd have to see evidence of actual contention before looking into this further.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
With @blakerouse in PTO, probably better to assign someone else from the Control Plane team to this review. |
I already chatted with @swiatekm about this. This PR is not urgent so we are OK keeping the reviewers as is. In any case, @pkoutsovasilis is reviewing it, which is 👍 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I still hold some reservations regarding the necessity of a thread-safe inner host in this context, given that these functions aren’t mutating any state and introducing a mutex could impact concurrent usage. However, since the usage frequency is low, I’m fine proceeding with this for now. We can revisit if contention becomes an issue. Nice work overall! 🙂
PS: The CI failures seem unrelated to this PR and I think that pulling from main will make the CI happy again
b56c685
to
ff85105
Compare
Quality Gate passedIssues Measures |
@@ -132,6 +132,7 @@ func ContextProviderBuilder(log *logger.Logger, c *config.Config, _ bool) (corec | |||
|
|||
func getHostInfo(log *logger.Logger) func() (map[string]interface{}, error) { | |||
return func() (map[string]interface{}, error) { | |||
// We don't use the shared host info from util here, as we explicitly want the latest host information on every call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we didn't have two different ways to get Host information where callers need to think about which one to use, or whether they actually need the parts of it that can change at runtime (FQDN for an obvious example).
Can the periodic updates be pushed into util.GetHost()
? They could be unconditional updates at a fixed rate, or it could behave as a rate limit where it updates at most every X seconds.
As a possible simplification for all of this, perhaps the network address information isn't valuable at all for containerized applications and we can add a way to filter it into https://github.com/elastic/go-sysinfo. There is an IsContainerized() already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some related context, add_kubernetes_metadata
stopped including the host IPs by default:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could just filter out the link local addresses (assuming those are the problem here too) if just not having the addresses at all doesn't work, or its too risky to think being in a container always means you are in k8s (it doesn't). elastic/integrations#6674 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the periodicity of the host info checks in the provider should be a part of the provider's logic, as opposed to being hidden in the utils package. I could see adding a parameter to this API indicating if a cached value is fine, if that helps.
Honestly, what I'm really trying to do here is avoid recomputing the host info, which is only done at the start. All the other methods on this struct fetch the latest data anyway. Maybe the solution is to compute host info at call time, same as everything else? Then every component could have its own types.Host
(as creating one would be cheap), and we'd separately cache host info for the packages that want it at startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all I really care about is that it's obvious when a call to get host info will potentially contain stale data. Keeping the periodicity of the host info check in the provider is fine if that is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good however I’m less familiar with the underlying context, so I can’t fully judge the implementation’s accuracy.
What does this PR do?
Lets packages share the same host info struct. We add a thread-safe wrapper about
sysinfo.Host
and keep a singleton that callers can transparently request and use.The host provider is exempted from this change, as it explicitly wants to update this information periodically. I added a comment explaining why that is.
Why is it important?
There's several packages where we need the host information. Thus far, each of them just computed it on their own, which was wasted effort. Normally this wouldn't be a big deal, as it only happens once at startup. However, this call can be expensive in some circumstances - most notably when the host has a lot of network interfaces. See #5835 (comment) for an example in Kubernetes. The result is agent having unnecessarily high memory consumption after startup, which is annoying.
Checklist
./changelog/fragments
using the changelog toolHow to test this PR locally
Starting agent with any configuration and looking at the host information emitted in the logs is good enough. Unit testing this functionality is quite awkward unfortunately.
Related issues