support User= in systemd for running rootless services #20573

Gchbg · 2022-01-09T13:12:03Z

Gchbg
Jan 9, 2022

Is this a BUG REPORT or FEATURE REQUEST?

/kind bug

Description

I want to have a systemd system service that runs a rootless container under an isolated user, but systemd rejects the sd_notify call and terminates the service.

Got notification message from PID 15150, but reception only permitted for main PID 14978

A similar problem was menitoned but not resolved in #5572, which seems to have been closed without a resolution.

Happy to help tracking this down.

Steps to reproduce the issue:

Start with a Debian testing system. Create a system user with an empty home dir, and enable lingering:

groupadd -g 200 nginx
useradd -r -s /usr/sbin/nologin -l -b /var/lib -M -g nginx -u 200 nginx
usermod -v 165536-231071 -w 165536-231071 nginx
mkdir -m 770 /var/lib/nginx
nginx:nginx /var/lib/nginx
loginctl enable-linger nginx

Use this unit file, adapted from podman generate systemd --new:

❯ cat /etc/systemd/system/nginx.service
[Unit]
Description=Nginx
Wants=network-online.target
After=network-online.target

[Service]
WorkingDirectory=/var/lib/nginx
User=nginx
Group=nginx
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=no
TimeoutStopSec=70
Type=notify
NotifyAccess=all
ExecStartPre=/bin/rm -f %T/%N.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%T/%N.ctr-id --replace --rm -d --sdnotify=conmon --cgroups=no-conmon --name nginx nginx:mainline
ExecStop=/usr/bin/podman stop --cidfile=%T/%N.ctr-id -i
ExecStopPost=/usr/bin/podman rm --cidfile=%T/%N.ctr-id -f -i
KillMode=none

[Install]
WantedBy=default.target

❯ sudo systemctl daemon-reload

Start the unit:

❯ sudo systemctl start nginx

Describe the results you received:

Jan 09 14:54:00 Cubert systemd[1]: /etc/systemd/system/nginx.service:24: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Jan 09 14:54:00 Cubert systemd[1]: Starting Nginx...
Jan 09 14:54:00 Cubert systemd[14978]: Started podman-15150.scope.
Jan 09 14:54:00 Cubert podman[15150]: Resolving "nginx" using unqualified-search registries (/etc/containers/registries.conf)
Jan 09 14:54:00 Cubert podman[15150]: Trying to pull docker.io/library/nginx:mainline...
Jan 09 14:54:03 Cubert podman[15150]: Getting image source signatures
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a0bcbecc962ed2552e817f45127ffb3d14be31642ef3548997f58ae054deb5b2
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a9edb18cadd1336142d6567ebee31be2a03c0905eeefe26cb150de7b0fbc520b
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:589b7251471a3d5fe4daccdddfefa02bdc32ffcba0a6d6a2768bf2c401faf115
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:186b1aaa4aa6c480e92fbd982ee7c08037ef85114fbed73dbb62503f24c1dd7d
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:b4df32aa5a72e2a4316aad3414508ccd907d87b4ad177abd7cbd62fa4dab2a2f
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:589b7251471a3d5fe4daccdddfefa02bdc32ffcba0a6d6a2768bf2c401faf115
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a0bcbecc962ed2552e817f45127ffb3d14be31642ef3548997f58ae054deb5b2
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a9edb18cadd1336142d6567ebee31be2a03c0905eeefe26cb150de7b0fbc520b
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:b4df32aa5a72e2a4316aad3414508ccd907d87b4ad177abd7cbd62fa4dab2a2f
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:186b1aaa4aa6c480e92fbd982ee7c08037ef85114fbed73dbb62503f24c1dd7d
Jan 09 14:54:12 Cubert podman[15150]: Copying config sha256:605c77e624ddb75e6110f997c58876baa13f8754486b461117934b24a9dc3a85
Jan 09 14:54:12 Cubert podman[15150]: Writing manifest to image destination
Jan 09 14:54:12 Cubert podman[15150]: Storing signatures
Jan 09 14:54:12 Cubert podman[15150]:
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.101247642 +0200 EET m=+11.607938154 container create 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, maintainer=NGINX Docker Maintainers <[email protected]>, PODMAN_SYSTEMD_UNIT=nginx.service)
Jan 09 14:54:12 Cubert systemd[14978]: Started libcrun container.
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:00.536382139 +0200 EET m=+0.043073791 image pull  nginx:mainline
Jan 09 14:54:12 Cubert systemd[1]: [email protected]: Got notification message from PID 15150, but reception only permitted for main PID 14978
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.141137063 +0200 EET m=+11.647827815 container init 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <[email protected]>)
Jan 09 14:54:12 Cubert systemd[1]: [email protected]: Got notification message from PID 15150, but reception only permitted for main PID 14978
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.145611861 +0200 EET m=+11.652302766 container start 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <[email protected]>)
Jan 09 14:54:12 Cubert podman[15150]: 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
Jan 09 14:54:12 Cubert conmon[15215]: 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
Jan 09 14:54:12 Cubert conmon[15215]: 10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Configuration complete; ready for start up
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: using the "epoll" event method
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: nginx/1.21.5
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: OS: Linux 5.15.0-2-amd64
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 524288:524288
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: start worker processes
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: start worker process 26
Jan 09 14:54:12 Cubert systemd[14978]: Started podman-15271.scope.
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: gracefully shutting down
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: exiting
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: exit
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: signal 17 (SIGCHLD) received from 26
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: worker process 26 exited with code 0
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: exit
Jan 09 14:54:12 Cubert podman[15299]: 2022-01-09 14:54:12.393064442 +0200 EET m=+0.052274069 container remove 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <[email protected]>)
Jan 09 14:54:12 Cubert podman[15271]: 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a
Jan 09 14:54:12 Cubert systemd[14978]: podman-15150.scope: Consumed 7.547s CPU time.
Jan 09 14:54:12 Cubert systemd[1]: nginx.service: Failed with result 'protocol'.
Jan 09 14:54:12 Cubert systemd[1]: Failed to start Nginx.

Describe the results you expected:

Nginx runs until the end of time.

Output of podman version:

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.17.5
Built:        Thu Jan  1 02:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: unknown'
  cpus: 1
  distribution:
    distribution: debian
    version: unknown
  eventLogger: journald
  hostname: Cubert
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 200
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 200
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.15.0-2-amd64
  linkmode: dynamic
  logDriver: journald
  memFree: 1015083008
  memTotal: 2041786368
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/200/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 0
  swapTotal: 0
  uptime: 8h 1m 8.23s (Approximately 0.33 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/lib/nginx/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/nginx/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 0
  runRoot: /run/user/200/containers
  volumePath: /var/lib/nginx/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 0
  BuiltTime: Thu Jan  1 02:00:00 1970
  GitCommit: ""
  GoVersion: go1.17.5
  OsArch: linux/amd64
  Version: 3.4.4

Package info (e.g. output of apt list podman):

podman/testing,now 3.4.4+ds1-1 amd64 [installed]

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes and yes.

Additional environment details (AWS, VirtualBox, physical, etc.):

Machine is a VM.

mheon · 2022-01-10T14:49:23Z

mheon
Jan 10, 2022
Maintainer

This is a limitation on the systemd side. They will only accept notifications, or PID files, that are created by or sent by root, for security reasons - even if the User and Group of the unit file are explicitly set to start the process as a non-root user. Their recommendation was to start the container as a user service of the user in question via systemctl --user. There have been a few other issues about this, I'll try and dig them up.

0 replies

eriksjolund · 2022-01-15T07:31:54Z

eriksjolund
Jan 15, 2022

Previous discussion: #9642
It contains links to some issues.

0 replies

Gchbg · 2022-01-16T12:59:48Z

Gchbg
Jan 16, 2022
Author

Thank you both. For now I've worked around it by managing the service under the user's systemd which is clunky to say the least. I don't understand systemd's security argument - if the process is run as a given user, why would systemd not allow that user's process to send sd_notify? Who else could? But I guess this is no flaw of podman.

#9642 mentions some code changes that need to happen to podman for sd_notify, what are those? And have they progressed since March?

I guess you could close this issue or use it to track progress.

0 replies

vrothberg · 2022-01-17T10:35:47Z

vrothberg
Jan 17, 2022
Maintainer

#9642 mentions some code changes that need to happen to podman for sd_notify, what are those? And have they progressed since March?

Yes, there is some progress. The main PID is now communicated via sd notify but there are still some remaining issues. For instance, %t resolves to the root's runtime dir - even when User=foo is set.

0 replies

vrothberg · 2022-01-17T10:41:59Z

vrothberg
Jan 17, 2022
Maintainer

I think the next big thing to tackle is finding a way how to lift the User= setting. While the process in ExecStart itself is run as the specified User/Group, the systemd specifiers (e.g., %t, %U, etc) remain to be root.

0 replies

Gchbg · 2022-01-17T10:45:46Z

Gchbg
Jan 17, 2022
Author

[...] The main PID is now communicated via sd notify [...]

But even that is rejected by systemd, as seen in the logs above.

0 replies

vrothberg · 2022-01-24T14:34:07Z

vrothberg
Jan 24, 2022
Maintainer

I fear there's not much Podman can do at the moment.

0 replies

wc7086 · 2022-01-26T14:44:03Z

wc7086
Jan 26, 2022

Only after solving this problem can become truly rootless.

So I have to keep using the root account for now.

0 replies

svdHero · 2022-01-31T08:34:11Z

svdHero
Jan 31, 2022

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

0 replies

wc7086 · 2022-01-31T08:48:36Z

wc7086
Jan 31, 2022

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid

use id username check UID and GID

0 replies

vrothberg · 2022-01-31T12:07:36Z

vrothberg
Jan 31, 2022
Maintainer

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

The services need to be started and managed as the specific non-root user. Using the User= directive does not work yet.

0 replies

Gchbg · 2022-01-31T19:42:01Z

Gchbg
Jan 31, 2022
Author

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

For the moment my workaround is to run such containers in a systemd --user. This means that for every system service I want to run as a rootless container, I need to create a separate system user, enable linger, and run a separate systemd --user instance for that user.

It works but it's clunky, e.g. restarting Nginx is sudo su -l nginx -s /bin/sh -c 'XDG_RUNTIME_DIR="/run/user/$(id -u)" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" systemctl --user restart nginx' and running a command inside such a container might be something like sudo su -l nextcloud -s /bin/sh -c 'XDG_RUNTIME_DIR="/run/user/$(id -u)" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" podman exec -u www-data -w /var/www/html nextcloud ./occ status'.

Inside these rootless containers root is mapped to the system user, which is a different uid for each service. If something inside the containers runs as non-root, that gets mapped to a high-numbered host uid by default. However with some magic on the host you can map a specific non-root uid in the container to a host uid of your choice, which can then be mapped to a different non-root uid in a different container running under a different user.

I should probably document my setup one of these days...

0 replies

eriksjolund · 2022-01-31T20:09:53Z

eriksjolund
Jan 31, 2022

@Gchbg If you are running a recent systemd version (for instance by running Fedora 35), I think you could run

sudo systemd-run --machine=nginx@ --quiet --user --collect --pipe --wait systemctl --user restart nginx

No need to set DBUS_SESSION_BUS_ADDRESS and XDG_RUNTIME_DIR

0 replies

svdHero · 2022-02-01T08:23:52Z

svdHero
Feb 1, 2022

@wc7086

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid

use id username check UID and GID

Is that -e as in the podman run option --env for environment variables?

@vrothberg

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

The services need to be started and managed as the specific non-root user. Using the User= directive does not work yet.

How does that relate to what @Gchbg and @eriksjolund wrote above? Do I have to run several instances of systemd or is there another way?

For systemd beginners like me, it is quite difficult to understand the various layers of abstraction and user permission between systemd, host processes and containers.
It would be really helpful to have a complete example in the podmand generate docs, that shows how to start a container or pod under a specific user during boot time.

After all, I would assume that this is the use case for 80 % of the users: run some container service that gets restarted automatically when the machine boots and that is as restricted as possible (by means of user permissions).

0 replies

wc7086 · 2022-02-01T08:44:53Z

wc7086
Feb 1, 2022

@wc7086

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid
use id username check UID and GID

Is that -e as in the podman run option --env for environment variables?

I got it wrong, modifying UID and GID via env requires entrypoint.sh。

https://docs.docker.com/engine/security/userns-remap/
Most of the docker documentation applies to podman.

0 replies

quulah · 2023-09-20T07:46:58Z

quulah
Sep 20, 2023

@markstos Not really, just documenting the fact that you need to use --user -M when kicking these services.

0 replies

ppenguin · 2023-10-12T14:15:41Z

ppenguin
Oct 12, 2023

Specifically what I would like is to be able to use DynamicUser= so I don't even have to worry about pre-creating users for each service, never mind writing user units for them all.

This is also what I need, and what seems to be a pretty valid use case. Additionally, for peristent state one can still use StateDirectory and User and Group in combination with DynamicUser. The advantage is that with this one doesn't have to take care of creating StateDirectory and can use LoadCredential transparently.

I actually almost got it working but ran against a brick wall with this issue, i.e. podman(-compose) appears to choke on newuidmap, presumably because the DynamicUser environment has in some unknown way limited permissions.

Hacking away with things like AmbientCapabilities = "CAP_SETUID" and/or verifying the capabilities on newuidmap didn't make a difference.

(I got the normal systemd --user stuff working pretty well, but it's extremely cumbersome (even on a declarative system like NixOS), because you have to manually take care of ensuring the service users etc. and their respective home dirs, like @Gchbg and @tomhughes already mentioned).

0 replies

Visne · 2023-11-01T23:05:58Z

Visne
Nov 1, 2023

I've kind of forgotten everything that's been tried, but what's wrong with using Type=simple

Type=simple should work as well but only for simple use cases. But it's leaving supported terrain.

Since I don't think it was mentioned yet, you should probably not do that since it can happen that the Podman process is killed while the container keeps running (see #9642 (reply in thread)).

0 replies

sjpb · 2023-11-02T10:50:23Z

sjpb
Nov 2, 2023

I'm still confused why we're all having problems with this; clearly using User= is not the recommended/supported approach. So the recommended/supported approach really is to run containers as root? Am I missing something and people generally think that's ok? Non-containerised services wouldn't be running as root right? So why is it ok to run containerised services as root?

0 replies

mattventura · 2023-11-02T20:37:08Z

mattventura
Nov 2, 2023

I'm still confused why we're all having problems with this; clearly using User= is not the recommended/supported approach. So the recommended/supported approach really is to run containers as root? Am I missing something and people generally think that's ok? Non-containerised services wouldn't be running as root right? So why is it ok to run containerised services as root?

Personally, I have taken to just running it as a user service with lingering enabled. It still lets me start it on boot, and manage/observe it via systemctl and journalctl.

0 replies

rhatdan · 2023-11-02T21:28:41Z

rhatdan
Nov 2, 2023
Maintainer

I think at this point we should change this to a discussion. User= causes lots of issues with running podman and rootless support is fairly easy. I also recomend that people look at using rootful with --userns=auto, which will run your containers each in a unigue user nemespace.

13 replies

rhatdan Nov 9, 2023
Maintainer

IDMap is only available in Rootful mode, Kernel does not support this for rootless mode.

If you want to share the directory between containers running in different User Namespaces, then you either need to map the same group into each or setup the directory as setgroup and leak that group into each container and make the directory writable by group access, of make the directory world writable.

kaivol Nov 9, 2023

Thank you very much for the answer!

I have just come across #17753, stating essentially the same.
It mentions "talk of relaxing" the root restriction, do you happen to know if that's still the case, and if so, where I can follow the discussion?

kaivol Nov 12, 2023

If you want to share the directory between containers running in different User Namespaces, then you either need to map the same group into each or setup the directory as setgroup and leak that group into each container and make the directory writable by group access, of make the directory world writable.

@rhatdan I would be very grateful if you could elaborate on this approach a little further.

If I understand correctly, you propose creating a new group which owns the directory of interest (directory writable by group and with setgid bit set, and probably also setting default ACLs like setfacl -d -m g::rwx).
But at this point I'm unsure how to proceed:

should my podman user join the group?
should the group be added to the podman user's subgids?
what do I need to configure in podman to use the group inside of the container (--userns=auto:gidmapping=?, --group-add=?, run.oci.keep_original_groups=1)?

I am happy about every pointer!

markstos Nov 12, 2023

I'm also interested in this. Each container will have their own data directory from the host's perspective, but the host wants the files created there to have same owner/group across all of them. For my threat model, it's fine if all the containers run as the same non-root host user.

I seem to recall a podman option that allowed all user ids within a container to be squashed to a single user id from the host's perspective. I can't find an option like that now, but it might help for me if it exists.

I've added the book to my "to read" list.

So far I'm having luck using --userns=auto as long my container only needs to read data from a bind mount and not write it back as a certain host user.

rhatdan Nov 12, 2023
Maintainer

Podman allows you to squash all of the UIDs/GIDs in an image to a single UID. But if the process within the container attempts to setuid to a different User, then Podman does nothing, the kernel would take over and either deny or allow it based on the user namespace the container is running in.

The primary use case of squashing the UIDs/GIDs is for users with only their UID, IE no entries in /etc/subuid and /etc/subgid.

rhatdan · 2023-11-12T13:31:43Z

rhatdan
Nov 12, 2023
Maintainer

If you setup a directory that is setgid to a GID (foobar), and the user running podman is in the foobar group, then you can leak the foobar group into the container with --group-add keep-groups. This should allow any containers written this way to write to the group.

0 replies

rhatdan · 2023-11-12T13:32:20Z

rhatdan
Nov 12, 2023
Maintainer

@giuseppe WDYT of allowing something like.

podman run --userns=auto --gidmap=1000:1000:1 alpine cat /proc/self/uid_map

Error: --userns and --uidmap/--gidmap/--subuidname/--subgidname are mutually exclusive

2 replies

giuseppe Nov 13, 2023
Maintainer

if you want to force a specific mapping with --userns=auto you can use the following command:

$ podman run --rm --userns=auto:gidmapping=1:1:1 fedora cat /proc/self/gid_map
         0          2          1
         2          3       1022
         1          1          1

rhatdan Nov 13, 2023
Maintainer

Nice so you do this for two different containers as well with both sharing the same GID.

eriksjolund · 2023-11-19T17:53:55Z

eriksjolund
Nov 19, 2023

I tried some more with User=test, Type=notify but this time I tried to stay as close as possible to the style of services that Quadlet generates. The service started and the nginx container is active.

$ curl localhost:80 | head -4
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

I don't know how robust the solution is but at least something is working.

/etc/systemd/system/example3.service

[Unit]
Wants=network-online.target
After=network-online.target
[email protected]
[email protected]
RequiresMountsFor=/run/user/1000/containers

[Service]
User=test
Environment=PODMAN_SYSTEMD_UNIT=%n
KillMode=mixed
ExecStop=/usr/bin/podman rm -f -i --cidfile=/run/user/1000/%N.cid
ExecStopPost=-/usr/bin/podman rm -f -i --cidfile=/run/user/1000/%N.cid
Delegate=yes
Type=notify
NotifyAccess=all
SyslogIdentifier=%N
ExecStart=/usr/bin/podman run \
     --cidfile=/run/user/1000/%N.cid \
     --cgroups=split \
     --rm \
     --env "NGINX=3;" \
      -d \
     --replace \
     --name systemd-%N \
     --sdnotify=conmon \
     docker.io/library/nginx

/etc/systemd/system/example3.socket

[Unit]
Description=Example 3 socket

[Socket]
ListenStream=0.0.0.0:80

[Install]
WantedBy=sockets.target

Note that rootless podman runs the nginx container with socket activation (port 80) without being blocked by the ip_unprivileged_port_start value which is normally set to 1024.

$ cat /proc/sys/net/ipv4/ip_unprivileged_port_start
1024

I added this as Example 3 in
https://github.com/eriksjolund/podman-nginx-socket-activation

0 replies

sjpb · 2024-01-25T10:28:43Z

sjpb
Jan 25, 2024

In case it helps anyone else trying to use rootless podman to run systemd user units, configured by ansible, on Rocky8 or Rocky9 I did a bit of experimenting here: https://github.com/sjpb/systemd-podman-experiments.

The conclusion I drew is that actually this is still pretty user-unfriendly for my use-case and I think it'd still be really nice to just be able to set User= so that e.g. you can start/stop the service and see logging etc just using a user with sudo rights. I can see that sudo isn't the systemd way so that's probably not considered helpful though! But I'm putting this out here in case there's some useful discussion/things I've missed.

1 reply

markstos Jan 26, 2024

I also considered trying to get this to work and gave up.

My attempts to use --userns=auto only worked for a first container that had that, but not a second.

I guess the automatically selected ranges conflicted.

My next attempt to try might be more explicitly set non-overlapping ranges for the containers, or to get --userns=auto to work for mulitple containers (Maybe I need to set a much larger pool of ids in the first place? The default range had something like 65,000 Ids in it, and I only need less 10 per container, so it seems like there should be plenty.

robbycuenot · 2024-02-26T23:20:15Z

robbycuenot
Feb 26, 2024

I've been working on setting up a bare-metal environment that runs exclusively in ram on fcos and boots over pxe with an ignition file. My goal is to go from zero -> running github actions container / terraform cloud agent container securely without manual intervention.

Admittedly I am pretty unfamiliar with linux internals and systemd, but I am trying to familiarize myself. Ignition only supports systemd for launching services, so understanding this is pretty crucial.

At this point I've been able to figure out every step of my config, but getting rootless containers working has been a thorn in my side. Trying to set the User= parameter brought me to this thread. I've also tried having systemd launch a script that invokes systemd-run --machine=, and while that works when running interactively, when launching the service it causes systemd to close the connection for some reason.

I'm a bit stumped here and am reverting back to running as root for now, but I'm open to any pointers (including this ignition process as a whole):

Machine boots UEFI PXE w/ Secureboot
DHCP sends shim.efi
shim.efi loads grubx64.efi
grub loads fcos kernel and initrd
Ignition file is loaded and executed
- Management user created with public key
- ghactions and tfcagent users created
- udev file created to allow users to use the TPM
- helper scripts downloaded
- TPM encrypted api keys downloaded
- systemd definition created
  - launches helper script
  - helper script decrypts api keys
  - api keys passed in as podman -e args
  - podman should launch the container(s) as rootless tfcagent / ghactions user

0 replies

joshqou · 2024-03-23T10:51:59Z

joshqou
Mar 23, 2024

+1 As quadlets were created with the idea of "integrating better with systemd", not supporting the systemd way of managing rootless services does not make sense.

0 replies

markstos · 2024-03-26T13:23:58Z

markstos
Mar 26, 2024

What's the next-best-thing approach for running a container as systemd service, but as a non-root user that's easy and repeatable?

I tried --userns=auto, but it failed for the second container. Apparently the first container used the entire available range of IDs.
I tried using UIDmap= today with Podman 4.9.3, but got a strange error: ""Unmounting /var/lib/containers/storage/overlay/81a4cfebf5de5af08b205735ae3d3394118bf0496a5031fcd3f0a2959a65d44f/merged: invalid argument". If there was a problem with my UID mapping, it would great if Podman could validate that as reasonable before failing like this.

I would happy if all uids in the container were collapsed to a single user and I could specific that user with User=. For the common case of a container that runs a single service, it seems like a single uid for all processes in the container should work.

13 replies

rhatdan Mar 27, 2024
Maintainer

Yes it does --userns=auto will look at all containers created and pick a range of UIDs/GIDs that are not in use.

If you run with --userns=auto in rootless mode, then you can quickly run out of UIDs/GIDs. In rootful mode you have around 4 billion UIDs to share. In rootless by default you only have 65k.

giuseppe Mar 27, 2024
Maintainer

Well one of the big selling points of podman is rootless mode so why make it so hard to actually use? Yes you can use user >services but they're a massive pain to work with - you have to enable linger, quadlets are tied to a directory based on the >numeric UID which isn't very portable, and actually using systemctl to manage user services for a different user requires a weird >baroque command that you have to google every time you want to use it :-(

rootless podman is "running podman in a user namespace" + "running workload in a user namespace".

If you use --userns with root, you only lose the first half of rootless. Once the container runs there are no differences between root with userns and rootless.

The confusion here seems to be around User=. That doesn't set the environment for rootless to work correctly: it does not set the user session so there are no tmp dirs, as well as no journal for logs.

mheon Mar 27, 2024
Maintainer

I will note that the issues @giuseppe raised are fundamentally systemd problems - if those were fixed, rootless Podman would work with User= without issue. If you'd like this to work, please talk to the systemd maintainers; there's nothing the Podman team can do about it. For reference, we have had this conversation with the systemd team in the past, and the recommended solution back then was to use systemd --user instead of the main systemd instance + User=

runiq Mar 27, 2024

In rootful mode you have around 4 billion UIDs to share. In rootless by default you only have 65k.

@rhatdan Is there a way to work around this by e.g. changing the overflowuid value for a user namespace? Or is that a system-wide setting?

rhatdan Mar 27, 2024
Maintainer

No way that I know of .

dv1618 · 2024-06-09T09:58:08Z

dv1618
Jun 9, 2024

I was able to run rootless podman containers using systemd units with "User=" without using --user option.

I just use normal service commands like systemctl stop podman-web or systemctl start podman-web.

Podman was run with the following options, where 1111 is the serviceuser's uid:

--cidfile=/run/user/1111/podman-web.ctr-id
--cgroups=no-conmon
--sdnotify=conmon

Systemd unit file has the following lines (I posted only the most important lines):

Environment=PODMAN_SYSTEMD_UNIT=podman-web.service
NotifyAccess=all
Type=notify
User=serviceuser
Group=servicegroup
Delegate=yes

Subuids was configured for the serviceuser:

$ cat /etc/subuid | grep serviceuser

serviceuser:165536:65536

Also linger state was enabled for servicesuser account:

# loginctl  enable-linger serviceuser
# ls /var/lib/systemd/linger | grep serviceuser
serviceuser

The key option is Delegate=yes, without it systemd stopped the service with the following error in the logs:

[email protected]: Got notification message from PID 161000, but reception only permitted for main PID 69181

@eriksjolund posted this option in his message.

1 reply

sjpb Jul 23, 2024

Interesting, thank you. We use something similar except without delegate=yes and without podman .. --cidfile and it does work.

vlk-charles · 2024-07-22T00:29:26Z

vlk-charles
Jul 22, 2024

This seems to be the magic combination that makes --sdnotify conmon work:

Type=notify
NotifyAccess=all
Delegate=yes

--sdnotify healthy however crashes with the following:

Jul 21 23:58:12 systemd[1]: Starting rootless Podman notify...
Jul 21 23:58:14 podman[2561982]: panic: operation not permitted
Jul 21 23:58:14 podman[2561982]: goroutine 103 [running]:
Jul 21 23:58:14 podman[2561982]: panic({0x556e716b2ec0?, 0xc000754cf0?})
[very long trace cut out]
Jul 21 23:58:14 conmon[2562059]: conmon c8c03da35012b398b195 <nwarn>: Failed to write to remote console socket
Jul 21 23:58:14 systemd-coredump[2562092]: Process 2561982 (podman) of user 987 dumped core.
Jul 21 23:59:42 systemd[1]: rootless-podman-notify.service: start operation timed out. Terminating.
Jul 21 23:59:44 systemd[1]: rootless-podman-notify.service: Failed with result 'timeout'.
Jul 21 23:59:44 systemd[1]: Failed to start rootless Podman notify.
Jul 21 23:59:44 systemd[1]: rootless-podman-notify.service: Consumed 1.279s CPU time.

2 replies

markstos Jul 22, 2024

An attempt to provide clarity to this method, here are the docs for all the mentioned options. I have not yet tested this method myself. Let's start with with podman options:

podman run options

--cidfile=file Write the container ID to file.
--cgroups=no-conmon Determines whether the container creates CGroups. The no-conmon option disables a new CGroup only for the conmon process.
--sdnotify=conmon sets MAINPID to conmon's pid, and sends READY when the container has started. The socket is never passed to the runtime or the container.

systemd unit options.

Type=notify Behavior of notify is similar to exec; however, it is expected that the service sends a "READY=1" notification message via sd_notify(3) or an equivalent call when it has finished starting up. If this option is used, NotifyAccess= (see below) should be set to open access to the notification socket provided by systemd.
NotifyAccess=all NotifyAccess= Controls access to the service status notification socket, as accessible via the sd_notify(3) call. If all, all services updates from all members of the service's control group are accepted.
Delegate=yes Turns on delegation of further resource control partitioning to processes of the unit. Units where this is enabled may create and manage their own private subhierarchy of control groups below the control group of the unit itself. For unprivileged services (i.e. those using the User= setting) the unit's control group will be made accessible to the relevant user.

Interpretation

It's not clear why including --cidfile matters. The other bits don't seem to referenced it.
Maybe --cgroups=no-conmon is trying to say in the docs that the conmon process is in the same same cgroup as the main process, which might support the other bits.
--sdnotify=conmon and Type=notify work together to expect that the container will use the "notify" method to signal readiness while "Type=notify" does just that. However, it seems that the "ready" is being issued as soon as the container consider itself "ready", not necessarily when the contained app considers itself "ready". So while this solves the User= problem, it seems that it may signal readiness too early, which might be a problem for slow-starting apps.
The docs NotifyAccess=all also mention the sdnotify logic, so I presume this is required for the notifications to get through.
Delegate=yes sounds like it might be related allowing the service to run as a different user, but the docs aren't clear here.

Summary

It's amazing that someone found a combination of flags and directives which seem to enable User=, because even from reading all their docs together, it still not clear how they could work together to accomplish that!

runiq Jul 22, 2024

Thanks for the overview! Shallow and pedantic, but:

--cgroups=no-conman

conmon

--sdnotify=conman

conmon

Edit: Some clarifications, as far as I know of them:

It's not clear why including --cidfile matters. The other bits don't seem to referenced it.

The CID file is used as a sentinel so the service doesn't leak resources while stopping. Cf. #13236

Maybe --cgroups=no-conman is trying to say in the docs that the conman process is in the same same cgroup as the main process, which might support the other bits.

Yes.

The docs NotifyAccess=all also mention the sdnotify logic, so I presume this is required for the notifications to get through.

It's not the conmon process that handles service readiness, but the podman process. Since systemd looks at the conmon process to determine service uptime, NotifyAccess=all is required so systemd doesn't ignore the READY=1 signal from Podman.

Delegate=yes sounds like it might be related allowing the service to run as a different user, but the docs aren't clear here.

No, delegation here refers to cgroup handling. Normally only systemd is allowed to create and remove cgroups, but with Delegate=yes, a service is allowed to create its own cgroup hierarchy at its point in the cgroup tree. cgroups=no-conmon plays into that as well: It separates the resources for conmon (the monitor) from the resources for the actual service.

support User= in systemd for running rootless services #20573

Replies: 81 comments · 33 replies

mheon Jan 10, 2022 Maintainer

Gchbg Jan 16, 2022 Author

vrothberg Jan 17, 2022 Maintainer

vrothberg Jan 17, 2022 Maintainer

Gchbg Jan 17, 2022 Author

vrothberg Jan 24, 2022 Maintainer

vrothberg Jan 31, 2022 Maintainer

Gchbg Jan 31, 2022 Author

rhatdan Nov 2, 2023 Maintainer

rhatdan Nov 9, 2023 Maintainer

rhatdan Nov 12, 2023 Maintainer

rhatdan Nov 12, 2023 Maintainer

rhatdan Nov 12, 2023 Maintainer

podman run --userns=auto --gidmap=1000:1000:1 alpine cat /proc/self/uid_map

giuseppe Nov 13, 2023 Maintainer

rhatdan Nov 13, 2023 Maintainer

rhatdan Mar 27, 2024 Maintainer

giuseppe Mar 27, 2024 Maintainer

mheon Mar 27, 2024 Maintainer

rhatdan Mar 27, 2024 Maintainer

podman run options

systemd unit options.

Interpretation

Summary

Replies: 81 comments 33 replies

mheon
Jan 10, 2022
Maintainer

Gchbg
Jan 16, 2022
Author

vrothberg
Jan 17, 2022
Maintainer

vrothberg
Jan 17, 2022
Maintainer

Gchbg
Jan 17, 2022
Author

vrothberg
Jan 24, 2022
Maintainer

vrothberg
Jan 31, 2022
Maintainer

Gchbg
Jan 31, 2022
Author

rhatdan
Nov 2, 2023
Maintainer

rhatdan Nov 9, 2023
Maintainer

rhatdan Nov 12, 2023
Maintainer

rhatdan
Nov 12, 2023
Maintainer

rhatdan
Nov 12, 2023
Maintainer

giuseppe Nov 13, 2023
Maintainer

rhatdan Nov 13, 2023
Maintainer

rhatdan Mar 27, 2024
Maintainer

giuseppe Mar 27, 2024
Maintainer

mheon Mar 27, 2024
Maintainer

rhatdan Mar 27, 2024
Maintainer