Skip to content

Commit

Permalink
Stop using bubblewrap and subuids for image builds
Browse files Browse the repository at this point in the history
Over the last years, we've accumulated a rather nasty set of workarounds
for various issues in bubblewrap:

- We contributed setpgid to util-linux and use it if available because
  bubblewrap does not support making its child process the foreground
  process.
- We added the innerpid logic to run() because bubblewrap does not forward
  signals to the separate child process it runs in the sandbox which meant
  they were getting SIGKILLed when we killed bubblewrap, preventing proper
  cleanup from happening.
- bubblewrap does not provide a proper way to detect whether the command
  was found in the sandbox or not, which meant we had to execute command -v
  within the sandbox separately to check whether the command exists or not.
- We had to add extra logic to make sure / was a mount in the initramfs to
  allow running mkosi in the initramfs as bubblewrap does not fall back to
  MS_MOVE if pivot_root() doesn't work.
- We had to stitch together shell invocations after bubblewrap but before
  executing the actual command we want to run to make sure directories had
  the correct mode as bubblewrap creates everything with mode 0700 which was
  too restrictive in many cases for us. This was fixed with new --perms and
  --chmod options in bubblewrap 0.5 but we had to keep compat with 0.4
  because that's what's shipped in CentOS Stream 9.
- Debugging all the above was made all the harder by the fact that bubblewrap's
  source code is full of tech debt from its history of being a setuid tool
  instead of using user namespaces. Getting any fixes into upstream is almost
  impossible as the tool is practically unmaintained.

Aside from bubblewrap, our other source of troubles has been newuidmap/newgidmap.
Running as a user within the subuid range configured in /etc/sub{u,g}id has
meant we're constantly fixing ownership and permissions issues where stuff needs
to be chowned and chmodded everywhere to make sure the current user and the
subuid user can access the proper files. Another unfortunate side effect is that
users end up with many files owned by the subuid root user in their home
directories when building images with mkosi;

Let's fix all these issues at once by getting rid of bubblewrap and
newuidmap/newgidmap.

bubblewrap is replaced with a new tool in mkosi/cage.py. It looks and behaves a
lot like bubblewrap, except it's much less code and much more flexible to fit
our needs, allowing us to get rid of all the hacks we built up over the years to
work around issues that didn't get fixed in bubblewrap.

To get rid of newuidmap/newgidmap, a rework of our user namespacing was needed.
The need to use newuidmap/newgidmap came from the assumption that we need a full
65k subuid range to do unprivileged image builds, as distributions ship packages
containing files and directories that are not owned by the root user. After some
investigation, it turns out that there's very few files and directories not owned
by root in distribution packages if you ignore /var. If we could temporarily
ignore the ownership on these files and directories until we can get distributions
to only ship root owned files in /usr and /etc of their packages, we could simply
map the current user to root in a user namespace and get rid of the subuid range
completely.

Turns out that's possible with a seccomp filter. seccomp allows you to make all
chown() syscalls succeed without actually doing anything. The files and directories
end up owned by the root user instead. If we assume this is OK and are OK with
instructing users to use tmpfiles to fix up the permissions on first boot if needed,
a seccomp filter like this is sufficient to allow us to get rid of doing image
builds within a subuid user namespace.

It turns out we can go one step further. It turns out that for the majority of
the image build, one doesn't actually need to be the root user. Only package
managers and systemd-repart need the current user to be mapped to root to do their
job correctly. The reason we did the entire build mapped to root until now was
that we need to do a few mounts as part of the image build process and for now
I was under the assumption that you needed to be root for that. It turns out that
when you unshare a user namespace, you get a full set of capabilities regardless
of whether you're root or some other uid in the user namespace. The only difference
is that when you exec a subprocess as root, the capabilities aren't lost, whereas
they are when you exec a subprocess as a non-root user. This can be avoided by
adding the capabilities of the non-root user to the inheritable and ambient set.
Once that's done, any subprocess exec'd by a non-root user in the user namespace
can mount as many bind and overlay mounts as they can think of.

The above allows us to run most of the image build under the current user uid
instead of root, only switching to root when running package managers, invoking
systemd-repart or systemd-tmpfiles, or when chroot-ing into the image. This allows
us to get rid of various hacks we had to look up the proper user name or home
directory.

Specifically, we can get rid of the following:

- mkosi-as-caller can become a noop since we now by default run the build as the
  caller.
- Lots of chmod()'s and chown()'s can be removed
- All uses of INVOKING_USER.uid/gid can be removed, and most can be replaced with
  simple os.getuid()/os.getgid()
- We can use /etc/passwd and /etc/group from the host instead of building our own
- We can get rid of the Acl= option as the user will now be able to remove (almost)
  all files written by mkosi.
- We don't have to rchown the package manager cache directory anymore after each
  build. Root user builds will now use the system cache instead of the per user
  cache.

One thing to note is that if we're invoked as root, none of the seccomp or capabilities
stuff applies and it is all skipped as it's not required in that case. This means that
when building as root it's still possible to have more than one user in the generated
image unlike when building unprivileged. Also note that users can still be added to
/etc/passwd and such, they just can't own any files or directories in the image itself
until the image is booted.
  • Loading branch information
DaanDeMeyer committed Aug 17, 2024
1 parent aa67388 commit 61dc66f
Show file tree
Hide file tree
Showing 23 changed files with 846 additions and 615 deletions.
2 changes: 1 addition & 1 deletion kernel-install/50-mkosi.install
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ from typing import Optional

from mkosi import identify_cpu
from mkosi.archive import make_cpio
from mkosi.cage import umask
from mkosi.config import OutputFormat, __version__
from mkosi.log import die, log_setup
from mkosi.run import run, uncaught_exception_handler
from mkosi.types import PathString
from mkosi.util import umask


@dataclasses.dataclass(frozen=True)
Expand Down
315 changes: 90 additions & 225 deletions mkosi/__init__.py

Large diffs are not rendered by default.

3 changes: 0 additions & 3 deletions mkosi/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
from mkosi.config import parse_config
from mkosi.log import log_setup
from mkosi.run import find_binary, run, uncaught_exception_handler
from mkosi.user import INVOKING_USER
from mkosi.util import resource_path


Expand All @@ -26,8 +25,6 @@ def main() -> None:
signal.signal(signal.SIGHUP, onsignal)

log_setup()
# Ensure that the name and home of the user we are running as are resolved as early as possible.
INVOKING_USER.init()

with resource_path(mkosi.resources) as resources:
args, images = parse_config(sys.argv[1:], resources=resources)
Expand Down
8 changes: 6 additions & 2 deletions mkosi/archive.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
from pathlib import Path
from typing import Optional

from mkosi.cage import umask
from mkosi.log import log_step
from mkosi.run import run
from mkosi.sandbox import Mount, SandboxProtocol, finalize_passwd_mounts, nosandbox
from mkosi.types import PathString
from mkosi.util import chdir, umask
from mkosi.util import chdir


def tar_exclude_apivfs_tmp() -> list[str]:
Expand Down Expand Up @@ -42,6 +43,8 @@ def make_tar(src: Path, dst: Path, *, sandbox: SandboxProtocol = nosandbox) -> N
"--pax-option=delete=atime,delete=ctime,delete=mtime",
"--sparse",
"--force-local",
*(["--owner=root:0"] if os.getuid() != 0 else []),
*(["--group=root:0"] if os.getuid() != 0 else []),
*tar_exclude_apivfs_tmp(),
".",
],
Expand Down Expand Up @@ -78,7 +81,7 @@ def extract_tar(
"--keep-directory-symlink",
"--no-overwrite-dir",
"--same-permissions",
"--same-owner" if (dst / "etc/passwd").exists() else "--numeric-owner",
"--same-owner" if (dst / "etc/passwd").exists() and os.getuid() == 0 else "--numeric-owner",
"--same-order",
"--acls",
"--selinux",
Expand Down Expand Up @@ -120,6 +123,7 @@ def make_cpio(
"--format=newc",
"--quiet",
"--directory", src,
*(["--owner=0:0"] if os.getuid() != 0 else []),
],
input="\0".join(os.fspath(f) for f in files),
stdout=f,
Expand Down
Loading

0 comments on commit 61dc66f

Please sign in to comment.