Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share host info between packages #5884

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Oct 30, 2024

What does this PR do?

Lets packages share the same host info struct. We add a thread-safe wrapper about sysinfo.Host and keep a singleton that callers can transparently request and use.

The host provider is exempted from this change, as it explicitly wants to update this information periodically. I added a comment explaining why that is.

Why is it important?

There's several packages where we need the host information. Thus far, each of them just computed it on their own, which was wasted effort. Normally this wouldn't be a big deal, as it only happens once at startup. However, this call can be expensive in some circumstances - most notably when the host has a lot of network interfaces. See #5835 (comment) for an example in Kubernetes. The result is agent having unnecessarily high memory consumption after startup, which is annoying.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

Starting agent with any configuration and looking at the host information emitted in the logs is good enough. Unit testing this functionality is quite awkward unfortunately.

Related issues

@swiatekm swiatekm added the enhancement New feature or request label Oct 30, 2024
Copy link
Contributor

mergify bot commented Oct 30, 2024

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 30, 2024

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 30, 2024
@swiatekm swiatekm force-pushed the feat/shared-hostinfo branch 2 times, most recently from 046216e to b56c685 Compare October 30, 2024 15:20
@swiatekm swiatekm marked this pull request as ready for review October 30, 2024 16:34
@swiatekm swiatekm requested a review from a team as a code owner October 30, 2024 16:34
}

func (s *threadSafeHost) CPUTime() (types.CPUTimes, error) {
s.Lock()
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my first reaction/thinking was that this thread-safety code most probably belongs to the go-sysinfo codebase?! but then looking again in the exposed functions we use, and from a quick look in go-sysinfo they do not mutate anything, so I am wondering do we need to hold a mutex to call them?! did I miss a mutation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go-sysinfo doesn't guarantee that these are thread-safe, so I don't have much of a choice here. We could try to make them so in the library, but I'm not sure it's worth it. These functions aren't called very often, so in practice there shouldn't be any contention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these functions are not mutating anything I would propose to remove the mutex. Or again you may have spotted a mutation that I missed?! If not, I am not entirely convinced that losing the concurrency of calling these functions, given that this is now a shared instance across multiple places, is wise. But if you insist sure go with it 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions are not mutating anything right now, but go-sysinfo doesn't guarantee this interface is thread-safe, so this is an implementation detail that may change. If we want to remove the locks, we should make the change in go-sysinfo, and I don't think that's worth the hassle. This PR is a bit borderline already in my opinion when it comes to complexity introduced vs performance gained.

At the very least, I'd have to see evidence of actual contention before looking into this further.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 30, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pierrehilbert
Copy link
Contributor

With @blakerouse in PTO, probably better to assign someone else from the Control Plane team to this review.
cc @ycombinator

@ycombinator
Copy link
Contributor

ycombinator commented Oct 30, 2024

With @blakerouse in PTO, probably better to assign someone else from the Control Plane team to this review. cc @ycombinator

I already chatted with @swiatekm about this. This PR is not urgent so we are OK keeping the reviewers as is.

In any case, @pkoutsovasilis is reviewing it, which is 👍 .

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I still hold some reservations regarding the necessity of a thread-safe inner host in this context, given that these functions aren’t mutating any state and introducing a mutex could impact concurrent usage. However, since the usage frequency is low, I’m fine proceeding with this for now. We can revisit if contention becomes an issue. Nice work overall! 🙂

PS: The CI failures seem unrelated to this PR and I think that pulling from main will make the CI happy again

@ycombinator ycombinator requested review from faec and kaanyalti and removed request for blakerouse, kaanyalti and faec October 31, 2024 12:38
@@ -132,6 +132,7 @@ func ContextProviderBuilder(log *logger.Logger, c *config.Config, _ bool) (corec

func getHostInfo(log *logger.Logger) func() (map[string]interface{}, error) {
return func() (map[string]interface{}, error) {
// We don't use the shared host info from util here, as we explicitly want the latest host information on every call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we didn't have two different ways to get Host information where callers need to think about which one to use, or whether they actually need the parts of it that can change at runtime (FQDN for an obvious example).

Can the periodic updates be pushed into util.GetHost()? They could be unconditional updates at a fixed rate, or it could behave as a rate limit where it updates at most every X seconds.

As a possible simplification for all of this, perhaps the network address information isn't valuable at all for containerized applications and we can add a way to filter it into https://github.com/elastic/go-sysinfo. There is an IsContainerized() already.

Copy link
Member

@cmacknz cmacknz Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some related context, add_kubernetes_metadata stopped including the host IPs by default:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could just filter out the link local addresses (assuming those are the problem here too) if just not having the addresses at all doesn't work, or its too risky to think being in a container always means you are in k8s (it doesn't). elastic/integrations#6674 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the periodicity of the host info checks in the provider should be a part of the provider's logic, as opposed to being hidden in the utils package. I could see adding a parameter to this API indicating if a cached value is fine, if that helps.

Honestly, what I'm really trying to do here is avoid recomputing the host info, which is only done at the start. All the other methods on this struct fetch the latest data anyway. Maybe the solution is to compute host info at call time, same as everything else? Then every component could have its own types.Host (as creating one would be cheap), and we'd separately cache host info for the packages that want it at startup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all I really care about is that it's obvious when a call to get host info will potentially contain stale data. Keeping the periodicity of the host info check in the provider is fine if that is clear.

Copy link
Contributor

@kaanyalti kaanyalti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good however I’m less familiar with the underlying context, so I can’t fully judge the implementation’s accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants