Skip to content

LX libaudit clients cause kernel memory usage to expand until exhausted #366

@arekinath

Description

@arekinath

The Linux libaudit library uses NETLINK sockets to send audit events. It opens a netlink socket, then calls sendto() to write the event out (see https://github.com/linux-audit/audit-userspace/blob/master/lib/netlink.c#L239) and then it checks for an ACK from the kernel.

When checking for an ACK, libaudit looks specifically for an NLMSG_ERROR message: https://github.com/linux-audit/audit-userspace/blob/master/lib/netlink.c#L287 It does recvfrom() with MSG_PEEK to look for the NLMSG_ERROR code, and then only does a real read without MSG_PEEK once it's seen it. If any other message arrives, it will never read from the netlink socket again.

Unfortunately, in e.g. lx_netlink_au_um, we just call lx_netlink_reply, which sets up an NLMSG_DONE message (with NLM_F_MULTI in the header). This is not the kind of ACK which libaudit is expecting, and so libaudit never actually reads from its netlink socket on LX.

When libaudit stops reading from the netlink socket, replies start to queue up. Normally this backpressure is handled by the layer calling the su_recv callback (e.g. the TCP/IP stack) -- you're meant to watch for ENOSPC from that socket upcall and set a flag to stop sucking in new messages until the downcall comes to tell you things are unblocked again.

Unfortunately, in lx_netlink_reply_sendup, after we call su_recv, we have:

	if (error != 0)
		lx_netlink_flowctrld++;

And that's it. End of function. We don't set any flags, we just increment a global counter (which is never read anywhere in the code). This means that we can accumulate replies on the socket queue of a netlink socket indefinitely.

Now, this might not sound like a big deal: each netlink reply is ~20 bytes long, you may say we can accumulate an awful lot of them before this becomes a critical issue. Alas, in lx_netlink_reply_msg we always call allocb() with lxns_bufsize which is set to 4096. Because of the header on the front, this actually results in an allocation from the kmem_alloc_8192 cache. For each one of these replies on the queue we are setting aside a bit over 8k of memory.

What's even better is that amongst libaudit's clients is the ever-wonderful systemd. It runs for a very long time, and it produces one of these audit events every time a unit (service) changes state.

On a machine with ~150 LX zones running, I am currently allocating a bit over 1GB per day of these buffers due to systemd alone, which will persist until the machine or the zones are rebooted. Eventually, the kernel memory usage expands and pushes out ARC, causes kmem_reap to kick in, and the machine grinds to a halt and never recovers.

Netlink should be replying to audit requests with single-part NLMSG_ERROR responses to be compatible with real Linux, and the LX netlink code needs to correctly handle ENOSPC from su_recv.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions