Introduce mkosi-sandbox and stop using subuids for image builds #2956

DaanDeMeyer · 2024-08-16T22:21:37Z

Over the last years, we've accumulated a rather nasty set of workarounds
for various issues in bubblewrap:

We contributed setpgid to util-linux and use it if available because
bubblewrap does not support making its child process the foreground
process.
We added the innerpid logic to run() because bubblewrap does not forward
signals to the separate child process it runs in the sandbox which meant
they were getting SIGKILLed when we killed bubblewrap, preventing proper
cleanup from happening.
bubblewrap does not provide a proper way to detect whether the command
was found in the sandbox or not, which meant we had to execute command -v
within the sandbox separately to check whether the command exists or not.
We had to add extra logic to make sure / was a mount in the initramfs to
allow running mkosi in the initramfs as bubblewrap does not fall back to
MS_MOVE if pivot_root() doesn't work.
We had to stitch together shell invocations after bubblewrap but before
executing the actual command we want to run to make sure directories had
the correct mode as bubblewrap creates everything with mode 0700 which was
too restrictive in many cases for us. This was fixed with new --perms and
--chmod options in bubblewrap 0.5 but we had to keep compat with 0.4
because that's what's shipped in CentOS Stream 9.
We had to figure out a shell hack to do overlayfs mounts as these are not
supported by bubblewrap (even though a PR for the feature has been open for
years).
We had to introduce a Mount struct to pass around mounts so we could deduplicate
and sort them before passing them to bubblewrap as bubblewrap did not do this
itself.
Debugging all the above was made all the harder by the fact that bubblewrap's
source code is full of tech debt from its history of being a setuid tool
instead of using user namespaces. Getting any fixes into upstream is almost
impossible as the tool is practically unmaintained.

Aside from bubblewrap, our other source of troubles has been newuidmap/newgidmap.
Running as a user within the subuid range configured in /etc/sub{u,g}id has
meant we're constantly fixing ownership and permissions issues where stuff needs
to be chowned and chmodded everywhere to make sure the current user and the
subuid user can access the proper files. Another unfortunate side effect is that
users end up with many files owned by the subuid root user in their home
directories when building images with mkosi;

Let's fix all these issues at once by getting rid of bubblewrap and
newuidmap/newgidmap.

bubblewrap is replaced with a new tool mkosi-sandbox. It looks and behaves a
lot like bubblewrap, except it's much less code and much more flexible to fit
our needs, allowing us to get rid of all the hacks we've built up over the years to
work around issues that didn't get fixed in bubblewrap.

To get rid of newuidmap/newgidmap, a rework of our user namespacing was needed.
The need to use newuidmap/newgidmap came from the assumption that we need a full
65k subuid range to do unprivileged image builds, as distributions ship packages
containing files and directories that are not owned by the root user. After some
investigation, it turns out that there's very few files and directories not owned
by root in distribution packages if you ignore /var. If we could temporarily
ignore the ownership on these files and directories until we can get distributions
to only ship root owned files in /usr and /etc of their packages, we could simply
map the current user to root in a user namespace and get rid of the subuid range
completely.

Turns out that's possible with a seccomp filter. seccomp allows you to make all
chown() syscalls succeed without actually doing anything. The files and directories
end up owned by the root user instead. If we assume this is OK and are OK with
instructing users to use tmpfiles to fix up the permissions on first boot if needed,
a seccomp filter like this is sufficient to allow us to get rid of doing image
builds within a subuid user namespace.

It turns out we can go one step further. It turns out that for the majority of
the image build, one doesn't actually need to be the root user. Only package
managers and systemd-repart need the current user to be mapped to root to do their
job correctly. The reason we did the entire build mapped to root until now was
that we need to do a few mounts as part of the image build process and for now
I was under the assumption that you needed to be root for that. It turns out that
when you unshare a user namespace, you get a full set of capabilities regardless
of whether you're root or some other uid in the user namespace. The only difference
is that when you exec a subprocess as root, the capabilities aren't lost, whereas
they are when you exec a subprocess as a non-root user. This can be avoided by
adding the capabilities of the non-root user to the inheritable and ambient set.
Once that's done, any subprocess exec'd by a non-root user in the user namespace
can mount as many bind and overlay mounts as they can think of.

The above allows us to run most of the image build under the current user uid
instead of root, only switching to root when running package managers, invoking
systemd-repart or systemd-tmpfiles, or when chroot-ing into the image. This allows
us to get rid of various hacks we had to look up the proper user name or home
directory.

Specifically, we can get rid of the following:

mkosi-as-caller can become a noop since we now by default run the build as the
caller.
Lots of chmod()'s and chown()'s can be removed
All uses of INVOKING_USER.uid/gid can be removed, and most can be replaced with
simple os.getuid()/os.getgid()
We can use /etc/passwd and /etc/group from the host instead of building our own
We can get rid of the Acl= option as the user will now be able to remove (almost)
all files written by mkosi.
We don't have to rchown the package manager cache directory anymore after each
build. Root user builds will now use the system cache instead of the per user
cache.
We can get rid of the Mount struct as mkosi-sandbox dedups and sorts operations
itself.

One thing to note is that if we're invoked as root, none of the seccomp or capabilities
stuff applies and it is all skipped as it's not required in that case. This means that
when building as root it's still possible to have more than one user in the generated
image unlike when building unprivileged. Also note that users can still be added to
/etc/passwd and such, they just can't own any files or directories in the image itself
until the image is booted.

mkosi/cage.py

mkosi/run.py

mkosi/cage.py

NekkoDroid · 2024-08-17T12:51:28Z

I just attempted to build my image with this and immediatly ran into a problem:

‣ Copying in package manager file trees…
‣ Syncing package manager metadata for image-initrd image
Traceback (most recent call last):
  File "/home/nekko/.local/src/mkosi/mkosi/cage.py", line 543, in <module>
  File "/home/nekko/.local/src/mkosi/mkosi/cage.py", line 435, in main
  File "/home/nekko/.local/src/mkosi/mkosi/cage.py", line 188, in mount_rbind
  File "/home/nekko/.local/src/mkosi/mkosi/cage.py", line 82, in oserror
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/wsl/resolv.conf'
‣ "pacman --root=/buildroot --logfile=/dev/null --dbpath=/var/lib/pacman --cachedir=/var/cache/pacman/mkosi --cachedir=/var/cache/pacman/pkg --hookdir=/buildroot/etc/pacman.d/hooks --arch x86_64 --color auto --noconfirm --sync --refresh" returned non-zero exit code 1.

Seems it doesn't copy/use the target of /etc/resolve.conf, which I assume might also fail when using sd-resolved.

DaanDeMeyer · 2024-08-17T13:56:06Z

@NekkoDroid Should be fixed, please try again.

NekkoDroid · 2024-08-17T14:49:26Z

Indeed the resolve.conf issue is gone. After I had some issues with pacman failing to lock the database, but I managed to fix it by using mkosi -fff.

Now I was able to build the image successfully.

septatrix · 2024-08-18T10:07:43Z

I like the idea of not having to deal with bubblewrap but I am not so sure about dropping uid mapping. I fear that most distros are not there yet and for any larger image this would mean having to create dozens of sd-tmpfile configs (and likely still missing a few leading to hard to troubleshoot bugs). Based on the motivation these seem like independent things so maybe it would be better to first adopt mkosi-cage and leave the uid mapping in place until a later time

mkosi/resources/mkosi.md

mkosi/context.py

mkosi/cage.py

mkosi/resources/mkosi.md

mkosi/run.py

+        binary: Optional[PathString],
+        vartmp: bool = False,
+        options: Sequence[PathString] = (),
+    ) -> AbstractContextManager[list[PathString]]: ...


mkosi/sandbox/__init__.py

+                f.write(f"{uid} {os.getuid()} 1\n".encode())
+        except OSError as e:
+            os._exit(e.errno)
+        except BaseException:


mkosi/sandbox/resources/mkosi-sandbox.md

mkosi/__init__.py

Over the last years, we've accumulated a rather nasty set of workarounds for various issues in bubblewrap: - We contributed setpgid to util-linux and use it if available because bubblewrap does not support making its child process the foreground process. - We added the innerpid logic to run() because bubblewrap does not forward signals to the separate child process it runs in the sandbox which meant they were getting SIGKILLed when we killed bubblewrap, preventing proper cleanup from happening. - bubblewrap does not provide a proper way to detect whether the command was found in the sandbox or not, which meant we had to execute command -v within the sandbox separately to check whether the command exists or not. - We had to add extra logic to make sure / was a mount in the initramfs to allow running mkosi in the initramfs as bubblewrap does not fall back to MS_MOVE if pivot_root() doesn't work. - We had to stitch together shell invocations after bubblewrap but before executing the actual command we want to run to make sure directories had the correct mode as bubblewrap creates everything with mode 0700 which was too restrictive in many cases for us. This was fixed with new --perms and --chmod options in bubblewrap 0.5 but we had to keep compat with 0.4 because that's what's shipped in CentOS Stream 9. - We had to figure out a shell hack to do overlayfs mounts as these are not supported by bubblewrap (even though a PR for the feature has been open for years). - We had to introduce a Mount struct to pass around mounts so we could deduplicate and sort them before passing them to bubblewrap as bubblewrap did not do this itself. - Debugging all the above was made all the harder by the fact that bubblewrap's source code is full of tech debt from its history of being a setuid tool instead of using user namespaces. Getting any fixes into upstream is almost impossible as the tool is practically unmaintained. Aside from bubblewrap, our other source of troubles has been newuidmap/newgidmap. Running as a user within the subuid range configured in /etc/sub{u,g}id has meant we're constantly fixing ownership and permissions issues where stuff needs to be chowned and chmodded everywhere to make sure the current user and the subuid user can access the proper files. Another unfortunate side effect is that users end up with many files owned by the subuid root user in their home directories when building images with mkosi; Let's fix all these issues at once by getting rid of bubblewrap and newuidmap/newgidmap. bubblewrap is replaced with a new tool mkosi-sandbox. It looks and behaves a lot like bubblewrap, except it's much less code and much more flexible to fit our needs, allowing us to get rid of all the hacks we've built up over the years to work around issues that didn't get fixed in bubblewrap. To get rid of newuidmap/newgidmap, a rework of our user namespacing was needed. The need to use newuidmap/newgidmap came from the assumption that we need a full 65k subuid range to do unprivileged image builds, as distributions ship packages containing files and directories that are not owned by the root user. After some investigation, it turns out that there's very few files and directories not owned by root in distribution packages if you ignore /var. If we could temporarily ignore the ownership on these files and directories until we can get distributions to only ship root owned files in /usr and /etc of their packages, we could simply map the current user to root in a user namespace and get rid of the subuid range completely. Turns out that's possible with a seccomp filter. seccomp allows you to make all chown() syscalls succeed without actually doing anything. The files and directories end up owned by the root user instead. If we assume this is OK and are OK with instructing users to use tmpfiles to fix up the permissions on first boot if needed, a seccomp filter like this is sufficient to allow us to get rid of doing image builds within a subuid user namespace. It turns out we can go one step further. It turns out that for the majority of the image build, one doesn't actually need to be the root user. Only package managers and systemd-repart need the current user to be mapped to root to do their job correctly. The reason we did the entire build mapped to root until now was that we need to do a few mounts as part of the image build process and for now I was under the assumption that you needed to be root for that. It turns out that when you unshare a user namespace, you get a full set of capabilities regardless of whether you're root or some other uid in the user namespace. The only difference is that when you exec a subprocess as root, the capabilities aren't lost, whereas they are when you exec a subprocess as a non-root user. This can be avoided by adding the capabilities of the non-root user to the inheritable and ambient set. Once that's done, any subprocess exec'd by a non-root user in the user namespace can mount as many bind and overlay mounts as they can think of. The above allows us to run most of the image build under the current user uid instead of root, only switching to root when running package managers, invoking systemd-repart or systemd-tmpfiles, or when chroot-ing into the image. This allows us to get rid of various hacks we had to look up the proper user name or home directory. Specifically, we can get rid of the following: - mkosi-as-caller can become a noop since we now by default run the build as the caller. - Lots of chmod()'s and chown()'s can be removed - All uses of INVOKING_USER.uid/gid can be removed, and most can be replaced with simple os.getuid()/os.getgid() - We can use /etc/passwd and /etc/group from the host instead of building our own - We can get rid of the Acl= option as the user will now be able to remove (almost) all files written by mkosi. - We don't have to rchown the package manager cache directory anymore after each build. Root user builds will now use the system cache instead of the per user cache. - We can get rid of the Mount struct as mkosi-sandbox dedups and sorts operations itself. One thing to note is that if we're invoked as root, none of the seccomp or capabilities stuff applies and it is all skipped as it's not required in that case. This means that when building as root it's still possible to have more than one user in the generated image unlike when building unprivileged. Also note that users can still be added to /etc/passwd and such, they just can't own any files or directories in the image itself until the image is booted.

github-advanced-security bot found potential problems Aug 16, 2024

View reviewed changes

mkosi/cage.py Fixed Show fixed Hide fixed

mkosi/cage.py Fixed Show fixed Hide fixed

mkosi/cage.py Fixed Show fixed Hide fixed

mkosi/run.py Fixed Show fixed Hide fixed

DaanDeMeyer force-pushed the var branch 10 times, most recently from 01388f7 to e659084 Compare August 17, 2024 11:01

github-advanced-security bot found potential problems Aug 17, 2024

View reviewed changes

mkosi/cage.py Fixed Show fixed Hide fixed

DaanDeMeyer force-pushed the var branch from e659084 to 3d8439f Compare August 17, 2024 11:10

DaanDeMeyer force-pushed the var branch from 3d8439f to 7f8734c Compare August 17, 2024 13:55

DaanDeMeyer force-pushed the var branch 3 times, most recently from 61dc66f to 6d1eb0b Compare August 17, 2024 14:07

DaanDeMeyer force-pushed the var branch 2 times, most recently from e5848d7 to 6df286f Compare August 18, 2024 09:35

DaanDeMeyer force-pushed the var branch from 6df286f to af4f45c Compare August 18, 2024 11:17

DaanDeMeyer mentioned this pull request Aug 18, 2024

Handle unprivileged user namespaces gracefully in tests systemd/systemd#34026

Merged

DaanDeMeyer force-pushed the var branch 5 times, most recently from 0d449d9 to 14b4bcd Compare August 18, 2024 15:17

DaanDeMeyer force-pushed the var branch from a976df5 to 7ae572e Compare August 21, 2024 10:31

behrmann reviewed Aug 21, 2024

View reviewed changes

mkosi/resources/mkosi.md Outdated Show resolved Hide resolved

mkosi/resources/mkosi.md Show resolved Hide resolved

DaanDeMeyer force-pushed the var branch 2 times, most recently from d57f975 to 8c41c6b Compare August 21, 2024 13:24

github-advanced-security bot found potential problems Aug 21, 2024

View reviewed changes

mkosi/context.py Fixed Show fixed Hide fixed

mkosi/context.py Fixed Show fixed Hide fixed

mkosi/context.py Fixed Show fixed Hide fixed

DaanDeMeyer force-pushed the var branch 4 times, most recently from dd66daa to c52724b Compare August 21, 2024 16:03

behrmann reviewed Aug 21, 2024

View reviewed changes

mkosi/cage.py Outdated Show resolved Hide resolved

mkosi/cage.py Outdated Show resolved Hide resolved

mkosi/resources/mkosi.md Show resolved Hide resolved

DaanDeMeyer force-pushed the var branch 2 times, most recently from 7bcfb93 to d5d5b68 Compare August 21, 2024 20:17

github-advanced-security bot found potential problems Aug 21, 2024

View reviewed changes

DaanDeMeyer force-pushed the var branch from d5d5b68 to fb03e10 Compare August 22, 2024 07:48

DaanDeMeyer changed the title ~~Stop using bubblewrap and subuids for image builds~~ Introduce mkosi-sandbox and stop using subuids for image builds Aug 22, 2024

DaanDeMeyer added a commit to DaanDeMeyer/systemd that referenced this pull request Aug 22, 2024

mkosi: Make sure systemd/mkosi#2956 works

49b50ec

DaanDeMeyer added a commit to DaanDeMeyer/systemd that referenced this pull request Aug 22, 2024

mkosi: Make sure systemd/mkosi#2956 works

935e9aa

DaanDeMeyer force-pushed the var branch 5 times, most recently from a1b6639 to 9b16f3c Compare August 22, 2024 08:16

behrmann reviewed Aug 22, 2024

View reviewed changes

DaanDeMeyer force-pushed the var branch from 9b16f3c to a7e4fcd Compare August 22, 2024 08:39

behrmann reviewed Aug 22, 2024

View reviewed changes

mkosi/__init__.py Outdated Show resolved Hide resolved

DaanDeMeyer force-pushed the var branch from a7e4fcd to b3a3e7e Compare August 22, 2024 09:26

behrmann approved these changes Aug 22, 2024

View reviewed changes

DaanDeMeyer merged commit 2d338ab into systemd:main Aug 22, 2024
28 of 31 checks passed

DaanDeMeyer deleted the var branch August 22, 2024 09:28

Introduce mkosi-sandbox and stop using subuids for image builds #2956

Introduce mkosi-sandbox and stop using subuids for image builds #2956

Uh oh!

Conversation

DaanDeMeyer commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NekkoDroid commented Aug 17, 2024

Uh oh!

DaanDeMeyer commented Aug 17, 2024

Uh oh!

NekkoDroid commented Aug 17, 2024

Uh oh!

septatrix commented Aug 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaanDeMeyer commented Aug 16, 2024 •

edited

Loading