forked from systemd/systemd
-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathTODO
2870 lines (2239 loc) · 143 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Bugfixes:
* Many manager configuration settings that are only applicable to user
manager or system manager can be always set. It would be better to reject
them when parsing config.
* Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
External:
* Fedora: add an rpmlint check that verifies that all unit files in the RPM are listed in %systemd_post macros.
* dbus:
- natively watch for dbus-*.service symlinks (PENDING)
- teach dbus to activate all services it finds in /etc/systemd/services/org-*.service
* fedora: suggest auto-restart on failure, but not on success and not on coredump. also, ask people to think about changing the start limit logic. Also point people to RestartPreventExitStatus=, SuccessExitStatus=
* neither pkexec nor sudo initialize environ[] from the PAM environment?
* fedora: update policy to declare access mode and ownership of unit files to root:root 0644, and add an rpmlint check for it
* register catalog database signature as file magic
* zsh shell completion:
- <command> <verb> -<TAB> should complete options, but currently does not
- systemctl add-wants,add-requires
- systemctl reboot --boot-loader-entry=
* systemctl status should know about 'systemd-analyze calendar ... --iterations='
* If timer has just OnInactiveSec=..., it should fire after a specified time
after being started.
* write blog stories about:
- hwdb: what belongs into it, lsusb
- enabling dbus services
- how to make changes to sysctl and sysfs attributes
- remote access
- how to pass throw-away units to systemd, or dynamically change properties of existing units
- auto-restart
- how to develop against journal browsing APIs
- the journal HTTP iface
- non-cgroup resource management
- dynamic resource management with cgroups
- refreshed, longer missions statement
- calendar time events
- init=/bin/sh vs. "emergency" mode, vs. "rescue" mode, vs. "multi-user" mode, vs. "graphical" mode, and the debug shell
- how to create your own target
- instantiated apache, dovecot and so on
- hooking a script into various stages of shutdown/early boot
Regularly:
* look for close() vs. close_nointr() vs. close_nointr_nofail()
* check for strerror(r) instead of strerror(-r)
* pahole
* set_put(), hashmap_put() return values check. i.e. == 0 does not free()!
* use secure_getenv() instead of getenv() where appropriate
* link up selected blog stories from man pages and unit files Documentation= fields
Janitorial Clean-ups:
* rework mount.c and swap.c to follow proper state enumeration/deserialization
semantics, like we do for device.c now
* get rid of prefix_roota() and similar, only use chase() and related
calls instead.
* get rid of basename() and replace by path_extract_filename()
* Replace our fstype_is_network() with a call to libmount's mnt_fstype_is_netfs()?
Having two lists is not nice, but maybe it's now worth making a dependency on
libmount for something so trivial.
* drop set_free_free() and switch things over from string_hash_ops to
string_hash_ops_free everywhere, so that destruction is implicit rather than
explicit. Similar, for other special hashmap/set/ordered_hashmap destructors.
* generators sometimes apply C escaping and sometimes specifier escaping to
paths and similar strings they write out. Sometimes both. We should clean
this up, and should probably always apply both, i.e. introduce
unit_file_escape() or so, which applies both.
* xopenat() should pin the parent dir of the inode it creates before doing its
thing, so that it can create, open, label somewhat atomically.
* use CHASE_MUST_BE_DIRECTORY and CHASE_MUST_BE_REGULAR at more places (the
majority of places that currently employ chase() probably should use this)
Deprecations and removals:
* Remove any support for booting without /usr pre-mounted in the initrd entirely.
Update INITRD_INTERFACE.md accordingly.
* remove cgroups v1 support EOY 2023. As per
https://lists.freedesktop.org/archives/systemd-devel/2022-July/048120.html
and then rework cgroupsv2 support around fds, i.e. keep one fd per active
unit around, and always operate on that, instead of cgroup fs paths.
* drop support for getrandom()-less kernels. (GRND_INSECURE means once kernel
5.6 becomes our baseline). See
https://github.com/systemd/systemd/pull/24101#issuecomment-1193966468 for
details. Maybe before that: at taint-flags/warn about kernels that lack
getrandom()/environments where it is blocked.
* drop support for LOOP_CONFIGURE-less loopback block devices, once kernel
baseline is 5.8.
* drop fd_is_mount_point() fallback mess once we can rely on
STATX_ATTR_MOUNT_ROOT to exist i.e. kernel baseline 5.8
* Once baseline is 5.10, remove support or MS_NOSYMFOLLOW-less kernels
* Remove /dev/mem ACPI FPDT parsing when /sys/firmware/acpi/fpdt is ubiquitous.
That requires distros to enable CONFIG_ACPI_FPDT, and have kernels v5.12 for
x86 and v6.2 for arm.
* Once baseline is 4.13, remove support for INTERFACE_OLD= checks in "udevadm
trigger"'s waiting logic, since we can then rely on uuid-tagged uevents
* In v260: remove support for deprecated FactoryReset EFI variable in
systemd-repart, replaced by FactoryResetRequest.
Features:
* pcrextend: when we fail to measure, reboot the system (at least optionally).
important because certain measurements are supposed to "destroy" tpm object
access.
* pcrextend: after measuring get an immediate quote from the TPM, and validate
it. if it doesn't check out, i.e. the measurement we made doesn't appear in
the PCR then also reboot.
* cryptsetup: add boolean for disabling use of any password/recovery key slots.
* dissect: when mounting a file system, look into certain xattrs on / in them, and
if that exists, check if gpt partition flags + type uuid + uuid match the
data encoded therein, so that attackers cannot make us misuse our file
systems
* complete varlink introspection comments:
- io.systemd.BootControl
- io.systemd.Hostname
- io.systemd.Journal
- io.systemd.ManagedOOM
- io.systemd.MountFileSystem
- io.systemd.Network
- io.systemd.PCRExtend
- io.systemd.PCRLock
- io.systemd.Resolve.Monitor
- io.systemd.Resolve
- io.systemd.oom
- io.systemd.sysext
* dissect: instead of searching for root and /usr partitions first, look for
verity signature partitions first instead, then match up what we find with
locally available keys, and then use first that works.
* gpt-auto-root doesn't take image policy into account.
* maybe define a /etc/machine-info field for the ANSI color to associate with a
hostname. Then use it for the shell prompt to highlight the hostname. Maybe
even hash it from the hostname as a fallback, in a reasonable way.
* unify how blockdev_get_root() and sysupdate find the default root block device
* Maybe rename pkcs7 and public verbs of systemd-keyutil to be more verb like.
* add "homctl export" and "homectl import" that gets you an "atomic" snapshot
of your homedir, i.e. either a tarball or a snapshot of the underlying disk
(use FREEZE/THAW to make it consistent, btrfs snapshots)
* maybe introduce a new partition that we can store debug logs and similar at
the very last moment of shutdown. idea would be to store reference to block
device (major + minor + partition id + diskeq?) in /run somewhere, than use
that from systemd-shutdown, just write a raw JSON blob into the partition.
Include timestamp, boot id and such, plus kmsg. on next boot immediately
import into journal. maybe use timestamp for making clock more monotonic.
also use this to detect unclean shutdowns, boot into special target if
detected
* fix homed/homectl confusion around terminology, i.e. "home directory"
vs. "home" vs. "home area". Stick to one term for the concept, and it
probably shouldn't contain "area".
* sd-boot: do something useful if we find exactly zero entries (ignoring items
such as reboot/poweroff/factory reset). Show a help text or so.
* sd-boot: optionally ask for confirmation before executing certain operations
(e.g. factory resets, storagetm with world access, and so on)
* add field to bls type 1 and type 2 profiles that ensures an item is never
considered for automatic selection
* add "conditions" for bls type 1 and type 2 profiles that allow suppressing
them under various conditions: 1. if tpm2 is available or not available;
2. if sb is on or off; 3. if we are netbooted or not; …
* logind: invoke a service manager for "area" logins too. i.e. instantiate
[email protected] also for logins where XDG_AREA is set, in per-area fashion, and
ref count it properly. Benefit: graphical logins should start working with
the area logic.
* repart: introduce concept of "ghost" partitions, that we setup in almost all
ways like other partitions, but do not actually register in the actual gpt
table, but only tell the kernel about via BLKPG ioctl. These partitions are
disk backed (hence can be large), but not persistent (as they are invisible
on next boot). Could be used by live media and similar, to boot up as usual
but automatically start at zero on each boot. There should also be a way to
make ghost partitions properly persistent on request.
* repart: introduce MigrateFileSystem= or so which is a bit like
CopyFiles=/CopyBlocks= but operates via btrfs device logic: adds target as
new device then removes source from btrfs. Usecase: a live medium which uses
"ghost" partitions as suggested above, which can become persistent on request
on another device.
* make nspawn containers, portable services and vmspawn VMs optionally survive
soft reboot wholesale.
* Turn systemd-networkd-wait-online into a small varlink service that people
can talk to and specify exactly what to wait for via a method call, and get a
response back once that level of "online" is reached.
* introduce a small "systemd-installer" tool or so, that glues
systemd-repart-as-installer and bootctl-install into one. Would just
interactively ask user for target disk (with completion and so on), and then do
two varlink calls to the the two tools with the right parameters. To support
"offline" operation, optionally invoke the two tools directly as child
processes with varlink communication over socketpair(). This all should be
useful as blueprint for graphical installers which should do the same.
* make "systemd-vmspawn -n" work unprivileged properly, i.e. acquire tap netif
from nsresourced.
* Make run0 forward various signals to the forked process so that sending
signals to a child process works roughly the same regardless of whether the
child process is spawned via run0 or not.
* write a document explaining how to write correct udev rules. Mention things
such as:
1. do not do lists of vid/pid matches, use hwdb for that
2. add|change action matches are typically wrong, should be != remove
3. use GOTO, make rules short
4. people shouldn't try to make rules file non-world-readable
* make killing more debuggable: when we kill a service do so setting the
.si_code field with a little bit of info. Specifically, we can set a
recognizable value to first of all indicate that it's systemd that did the
killing. Secondly, we can give a reason for the killing, i.e. OOM or so, and
also the phase we are in, and which process we think we are killing (i.e.
main vs control process, useful in case of sd_notify() MAINPID= debugging).
Net result: people who try to debug why their process gets killed should have
some minimal, nice metadata directly on the signal event.
* sd-boot/sd-stub: install a uefi "handle" to a sidecar dir of bls type #1
entries with an "uki" or "uki-url" stanza, and make sd-stub look for
that. That way we can parameterize type #1 entries nicely.
* add a system-wide seccomp filter list for syscalls, kill "acct()" "@obsolete"
and a few other legacy syscalls that way.
* maybe introduce "@icky" as a seccomp filter group, which contains acct() and
certain other syscalls that aren't quite obsolete, but certainly icky.
* revisit how we pass fs images and initrd to the kernel. take uefi http boot
ramdisks as inspiration: for any confext/sysext/initrd erofs/DDI image simply
generate a fake pmem region in the UEFI memory tables, that Linux then turns
into /dev/pmemX. Then turn of cpio-based initrd logic in linux kernel,
instead let kernel boot directly into /dev/pmem0. In order to allow our usual
cpio-based parameterization, teach PID 1 to just uncompress cpio ourselves
early on, from another pmem device. (Related to this, maybe introduce a new
PE section .ramdisk that just synthesizes pmem devices from arbitrary
blobs. Could be particularly useful in add-ons)
* also parse out primary GPT disk label uuid from gpt partition device path at
boot and pass it as efi var to OS.
* maybe rework invocation of stub's inner PE payload: since we already parse PE
anyway, maybe jump directly into the image, after finding the linux UEFI
entrypoint. After all we invest quite some effort to disable
validation/measurement of the inner image, i.e. we want nothing from UEFI's
own image loading code paths. Given that everything's statically linked
anyway on UEFI it should be easy to just jump into the already loaded image.
* storagetm: maybe also serve the specified disk via HTTP? we have glue for
microhttpd anyway already. Idea would also be serve currently booted UKI as
separate HTTP resource, so that EFI http boot on another system could
directly boot from our system, with full access to the hdd.
* support specifying download hash sum in systemd-import-generator expression
to pin image/tarball.
* support boot into nvme-over-tcp: add generator that allows specifying nvme
devices on kernel cmdline + credentials. Also maybe add interactive mode
(where the user is prompted for nvme info), in order to boot from other
system's HDD.
* ptyfwd: use osc context information in vmspawn/nspawn/… to optionally only
listen to ^]]] key when no further vmspawn/nspawn context is allocated
* ptyfwd: usec osc context information to propagate status messages from
vmspawn/nspawn to service manager's "status" string, reporting what is
currently in the fg
* nspawn/vmspawn: define hotkey that one can hit on the primary interface to
ask for a friendly, acpi style shutdown.
* for better compat with major clouds: implement simple PTP device support in
timesyncd
* for better compat with major clouds: recognize clouds via hwdb on DMI device,
and add udev properties to it that help with handling IMDS, i.e. entrypoint
URL, which fields to find ip hostname, ssh key, …
* for better compat with major clouds: introduce imds mini client service that
sets up primary netif in a private netns (ipvlan?) to query imds without
affecting rest of the host. pick up literal credentials from there plus the
fields the hwdb reports for the other fields and turn them into credentials.
then write generator that used detected virtualization info and plugs this
service into the early boot, waiting for the DMI and network device to show
up.
* Add UKI profile conditioning so that profles are only available if secure
boot is turned off, or only on. similar, add conditions on TPM availability,
network boot, and other conditions.
* fix bug around run0 background color on ls in fresh terminal
* Reset TPM2 DA bit on each successful boot
* systemd-repart: add --installer or so, that will intractively ask for a
target disk, maybe ask for confirmation, and install something on disk. Then,
hook that into installer.target or so, so that it can be used to
install/replicate installs
* systemd-cryptenroll: add --firstboot or so, that will interactively ask user
whether recovery key shall be enrolled and do so
* bootctl: add tool for registering BootXXX entry that boots from some http
server of your choice (i.e. like kernel-bootcfg --add-uri=)
* maybe introduce [email protected] or so, to match
container-getty.service but skips authentication, so you get a shell prompt
directly. Usecase: wsl-like stuff (they have something pretty much like
that). Question: how to pick user for this. Instance parameter? somehow from
credential (would probably require some binary that converts credential to
User= parameter?
* sd-varlink should probably enforce a limit on queued outgoing replies
* systemd-firstboot: optionally install an ssh key for root for offline use.
* add a small tool that reads user records/group records from a credential, and
then places them in the userdb drop-in dirs (either /run/ or /var/). While
doing so it processes them:
- split privileged part from unprivileged part (the way userdb dropins want
it
- write out membership files based on the listed group memberships
- maybe: also allocate a UID if none is included.
* the ordering cycle log messages in transaction_verify_order_one() should
really be recognizable via a message id and come with an explanatory catalog
message
* introduce new ANSI sequence for communicating log level and structured error
metadata to terminals.
* in pid1: include ExecStart= cmdlines (and other Exec*= cmdlines) in polkit
request, so that policies can match against command lines.
* account number of units currently in activating/active/deactivating state in
each slice, and expose this as a property of the slice, given this is a key
metric of the resource management entity that a slice is. (maybe add a 2nd
metric for units assigned to the slice, that also includes those with pending
jobs)
* maybe allow putting a "soft" limit on the number concurrently active units in
a slice, as per the previous item. When the limit is reached delay further
activations until number is below the threshold again, leaving the unit in
the job queue. Thus, multiple resource intensive tasks can be scheduled as
units in the same slice and they will be executed with an upper limit on
concurrently running tasks.
* importd: introduce a per-user instance, that downloads into per-user DDI dirs
* sysupdated: similar
* portabled: similar
* machined: implement a per-user instance, that manages per-user DDI dirs for
images. systemd-nspawn/systemd-vmspawn should probably register with both the
system and the user scoped machined instance. The former to get the machine
name registered as hostname, and the latter so that the image stuff is nicely
per-user managed.
* resolved: make resolved process DNR DHCP info
* Teach systemd-ssh-generator to generated an /run/issue.d/ drop-in telling
users how to connect to the system via the AF_VSOCK, as per:
https://github.com/systemd/systemd/issues/35071#issuecomment-2462803142
* maybe introduce an OSC sequence that signals when we ask for a password, so
that terminal emulators can maybe connect a password manager or so, and
highlight things specially.
* start using STATX_SUBVOL in btrfs_is_subvol(). Also, make use of it
generically, so that image discovery recognizes bcachefs subvols too.
* "systemd-export tar" should reuse the libarchive export code from systemd-dissect
--archive.
* "systemd-import tar" should be moved to libarchive
* foreign uid:
- add support to export-fs, import-fs, import-tar, export-tar
- add tool for deleting foreign UID held container images
- systemd-dissect should learn mappings, too, when doing mtree and such
* format-table: introduce new cell type for strings with ansi sequences in
them. display them in regular output mode (via strip_tab_ansi()), but
suppress them in json mode.
* machined: when registering a machine, also take a relative cgroup path,
relative to the machine's unit. This is useful when registering unpriv
machines, as they might sit down the cgroup tree, below a cgroup delegation
boundary. Then, install an inotify watch on that cgroup to track when the
machine's local cgroup goes down.
* resolved: report ttl in resolution replies if we know it. This data is useful
for tools such as wireguard which want to periodically re-resolve DNS names,
and might want to use the TTL has hint for that.
* journald: beef up ClientContext logic to store pidfd_id of peer, to validate
we really use the right cache entry
* journald: log client's pidfd id as a new automatic field _PIDFDID= or so.
* journald: split up ClientContext cache in two: one cache keyed by pid/pidfdid
with process information, and another one keyed by cgroup path/cgroupid with
cgroup information. This way if a service consisting of many logging
processes can take benefit of the cgroup caching.
* system lsmbpf policy that prohibits creating files owned by "nobody"
system-wide
* system lsmpbf policy that prohibits creating or opening device nodes outside
of devtmpfs/tmpfs, except if they are the pseudo-devices /dev/null,
/dev/zero, /dev/urandom and so on.
* system lsmbpf policy that enforces that block device backed mounts may only
be established on top of dm-crypt or dm-verity devices, or an allowlist of
file systems (which should probably include vfat, for compat with the ESP)
* $LISTEN_PID, $SYSTEMD_EXECPID env vars that the service manager sets should
be augmented with $LISTEN_PIDFDID, and $SYSTEMD_EXECPIDFD (and similar for
other env vars we might send).
* port copy.c over to use LabelOps for all labelling.
* port remaining getmntent() users over to libmount. There are subtle
differences in the parsers (see #25371 for example), and it hence makes sense
if we stick to one set of parsers on this, not mix both.
* get rid of compat with libidn.so.11 (retain only for libidn.so.12)
* get rid of compat with libbpf.so.0 (retainly only for libbpf.so.1)
* define a generic "report" varlink interface, which services can implement to
provide health/statistics data about themselves. then define a dir somewhere
in /run/ where components can bind such sockets. Then make journald, logind,
and pid1 itself implement this and expose various stats on things there. Then
issue parallel calls to these interfaces from the systemd-report tool,
combine into one json document, and include measurement logs and tpm
quote. tpm quote should protect the json doc via the nonce field
studd. Allow shipping this off elsewhere for analyze.
* The bind(AF_UNSPEC) construct (for resetting sockets to their initial state)
should be blocked in many cases because it punches holes in many sandboxes.
* find a nice way to opt-in into auto-masking SIGCHLD on first
sd_event_add_child(), and then get rid of many more explicit sigprocmask()
calls.
* introduce new structure Tpm2CombinedPolicy, that combines the various TPm2
policy bits into one structure, i.e. public key info, pcr masks, pcrlock
stuff, pin and so on. Then pass that around in tpm2_seal() and tpm2_unseal().
* look at nsresourced, mountfsd, homed, importd, and try to come up with a way
how the forked off worker processes can be moved into transient services with
sandboxing, without breaking notify socket stuff and so on.
* replace all \x1b, \x1B, \033 C string escape sequences in our codebase with a
more readable \e. It's a GNU extension, but a ton more readable than the
others, and most importantly it doesn't result in confusing errors if you
suffix the escape sequence with one more decimal digit, because compilers
think you might actually specify a value outside the 8bit range with that.
* confext/sysext: instead of mounting the overlayfs directly on /etc/ + /usr/,
insert an intermediary bind mount on itself there. This has the benefit that
services where mount propagation from the root fs is off, an still have
confext/sysext propagated in.
* generic interface for varlink for setting log level and stuff that all our daemons can implement
* maybe teach repart.d/ dropins a new setting MakeMountNodes= or so, which is
just like MakeDirectories=, but uses an access mode of 0000 and sets the +i
chattr bit. This is useful as protection against early uses of /var/ or /tmp/
before their contents is mounted.
* go through all uses of table_new() in our codebase, and make sure we support
all three of:
1. --no-legend properly
2. --json= properly
3. --no-pager properly
* go through all --help texts in our codebases, and make sure:
1. the one sentence description of the tool is highlighted via ANSI how we
usually do it
2. If more than one or two commands are supported (as opposed to switches),
separate commands + switches from each other, using underlined --help sections.
3. If there are many switches, consider adding additional --help sections.
* go through our codebase, and convert "vertical tables" (i.e. things such as
"systemctl status") to use table_new_vertical() for output
* pcrlock: add support for multi-profile UKIs
* initrd: when transitioning from initrd to host, validate that
/lib/modules/`uname -r` exists, refuse otherwise
* signed bpf loading: to address need for signature verification for bpf
programs when they are loaded, and given the bpf folks don't think this is
realistic in kernel space, maybe add small daemon that facilitates this
loading on request of clients, validates signatures and then loads the
programs. This daemon should be the only daemon with privs to do load BPF on
the system. It might be a good idea to run this daemon already in the initrd,
and leave it around during the initrd transition, to continue serve requests.
Should then live in its own fs namespace that inherits from the initrd's
fs tree, not from the host, to isolate it properly. Should set
PR_SET_DUMPABLE so that it cannot be ptraced from the host. Should have
CAP_SYS_BPF as only service around.
* add a mechanism we can drop capabilities from pid1 *before* transitioning
from initrd to host. i.e. before we transition into the slightly lower trust
domain that is the host systems we might want to get rid of some caps.
Example: CAP_SYS_BPF in the signed bpf loading logic above. (We already have
CapabilityBoundingSet= in system.conf, but that is enforced when pid 1
initializes, rather then when it transitions to the next.)
* maybe add a new standard slice where process that are started in the initrd
and stick around for the whole system runtime (i.e. root fs storage daemons,
the bpf loader daemon discussed above, and such) are placed. maybe
protected.slice or so? Then write docs that suggest that services like this
set Slice=protected.sice, RefuseManualStart=yes, RefuseManualStop=yes and a
couple of other things.
* rough proposed implementation design for remote attestation infra: add a tool
that generates a quote of local PCRs and NvPCRs, along with synchronous log
snapshot. use "audit session" logic for that, so that we get read-outs and
signature in one step. Then turn this into a JSON object. Use the "TCG TSS 2.0
JSON Data Types and Policy Language" format to encode the signature. And CEL
for the measurement log.
* creds: add a new cred format that reused the JSON structures we use in the
LUKS header, so that we get the various newer policies for free.
* systemd-analyze: port "pcrs" verb to talk directly to TPM device, instead of
using sysfs interface (well, or maybe not, as that would require privileges?)
* pcrextend/tpm2-util: add a concept of "rotation" to event log. i.e. allow
trailing parts of the logs if time or disk space limit is hit. Protect the
boot-time measurements however (i.e. up to some point where things are
settled), since we need those for pcrlock measurements and similar. When
deleting entries for rotation, place an event that declares how many items
have been dropped, and what the hash before and after that.
* measure information about all DDIs as we activate them to an NvPCR. We
probably should measure the dm-verity root hash from the kernel side, but
DDI meta info from userspace.
* use name_to_handle_at() with AT_HANDLE_FID instead of .st_ino (inode
number) for identifying inodes, for example in copy.c when finding hard
links, or loop-util.c for tracking backing files, and other places.
* cryptenroll/cryptsetup/homed: add unlock mechanism that combines tpm2 and
fido2, as well as tpm2 + ssh-agent, inspired by ChromeOS' logic: encrypt the
volume key with the TPM, with a policy that insists that a nonce is signed by
the fido2 device's key or ssh-agent key. Thus, add unlock/login time the TPM
generates a nonce, which is sent as a challenge to the fido2/ssh-agent, which
returns a signature which is handed to the tpm, which then reveals the volume
key to the PC.
* cryptenroll/cryptsetup/homed: similar to this, implement TOTP backed by TPM.
* expose the handoff timestamp fully via the D-Bus properties that contain
ExecStatus information
* properly serialize the ExecStatus data from all ExecCommand objects
associated with services, sockets, mounts and swaps. Currently, the data is
flushed out on reload, which is quite a limitation.
* Clean up "reboot argument" handling, i.e. set it through some IPC service
instead of directly via /run/, so that it can be sensible set remotely.
* systemd-tpm2-support: add a some logic that detects if system is in DA
lockout mode, and queries the user for TPM recovery PIN then.
* systemd-repart should probably enable btrfs' "temp_fsid" feature for all file
systems it creates, as we have no interest in RAID for repart, and it should
make sure that we can mount them trivially everywhere.
* systemd-nspawn should get the same SSH key support that vmspawn now has.
* move documentation about our common env vars (SYSTEMD_LOG_LEVEL,
SYSTEMD_PAGER, …) into a man page of its own, and just link it from our
various man pages that so far embed the whole list again and again, in an
attempt to reduce clutter and noise a bid.
* vmspawn switch default swtpm PCR bank to SHA384-only (away from SHA256), at
least on 64bit archs, simply because SHA384 is typically double the hashing
speed than SHA256 on 64bit archs (since based on 64bit words unlike SHA256
which uses 32bit words).
* In vmspawn/nspawn/machined wait for X_SYSTEMD_UNIT_ACTIVE=ssh-active.target
and X_SYSTEMD_SIGNALS_LEVEL=2 as indication whether/when SSH and the POSIX
signals are available. Similar for D-Bus (but just use sockets.target for
that). Report as property for the machine.
* teach nspawn/machined a new bus call/verb that gets you a
shell in containers that have no sensible pid1, via joining the container,
and invoking a shell directly. Then provide another new bus call/vern that is
somewhat automatic: if we detect that pid1 is running and fully booted up we
provide a proper login shell, otherwise just a joined shell. Then expose that
as primary way into the container.
* make vmspawn/nspawn/importd/machined a bit more usable in a WSL-like
fashion. i.e. teach unpriv systemd-vmspawn/systemd-nspawn a reasonable
--bind-user= behaviour that mounts the calling user through into the
machine. Then, ship importd with a small database of well known distro images
along with their pinned signature keys. Then add some minimal glue that binds
this together: downloads a suitable image if not done so yet, starts it in
the bg via vmspawn/nspawn if not done so yet and then requests a shell inside
it for the invoking user.
* add a new specifier to unit files that figures out the DDI the unit file is
from, tracing through overlayfs, DM, loopback block device.
* importd/importctl
- port tar handling to libarchive
- complete varlink interface
- download images into .v/ dirs
* in os-release define a field that can be initialized at build time from
SOURCE_DATE_EPOCH (maybe even under that name?). Would then be used to
initialize the timestamp logic of ConditionNeedsUpdate=.
* nspawn/vmspawn/pid1: add ability to easily insert fully booted VMs/FOSC into
shell pipelines, i.e. add easy to use switch that turns off console status
output, and generates the right credentials for systemd-run-generator so that
a program is invoked, and its output captured, with correct EOF handling and
exit code propagation
* new systemd-analyze "join" verb or so, for debugging services. Would be
nsenter on steroids, i.e invoke a shell or command line in an environment as
close as we can make it for the MainPID of a service. Should be built around
pidfd, so that we can reasonably robustly do this. Would only cover the
execution environment like namespaces, but not the privilege settings.
* Introduce a CGroupRef structure, inspired by PidRef. Should contain cgroup
path, cgroup id, and cgroup fd. Use it to continuously pin all v2 cgroups via
a cgroup_ref field in the CGroupRuntime structure. Eventually switch things
over to do all cgroupfs access only via that structure's fd.
* Get rid of the symlinks in /run/systemd/units/* and exclusively use cgroupfs
xattrs to convey info about invocation ids, logging settings and so on.
support for cgroupfs xattrs in the "trusted." namespace was added in linux
3.7, i.e. which we don't pretend to support anymore.
* rewrite bpf-devices in libbpf/C code, rather than home-grown BPF assembly, to
match bpf-restrict-fs, bpf-restrict-ifaces, bpf-socket-bind
* ditto: rewrite bpf-firewall in libbpf/C code
* credentials: if we ever acquire a secure way to derive cgroup id of socket
peers (i.e. SO_PEERCGROUPID), then extend the "scoped" credential logic to
allow cgroup-scoped (i.e. app or service scoped) credentials. Then, as next
step use this to implement per-app/per-service encrypted directories, where
we set up fscrypt on the StateDirectory= with a randomized key which is
stored as xattr on the directory, encrypted as a credential.
* credentials: optionally include a per-user secret in scoped user-credential
encryption keys. should come from homed in some way, derived from the luks
volume key or fscrypt directory key.
* credentials: add a flag to the scoped credentials that if set require PK
reauthentication when unlocking a secret.
* extend the smbios11 logic for passing credentials so that instead of passing
the credential data literally it can also just reference an AF_VSOCK CID/port
to read them from. This way the data doesn't remain in the SMBIOS blob during
runtime, but only in the credentials fs.
* machined: optionally track nspawn unix-export/ runtime for each machined, and
then update systemd-ssh-proxy so that it can connect to that.
* add a new ExecStart= flag that inserts the configured user's shell as first
word in the command line. (maybe use character '.'). Usecase: tool such as
run0 can use that to spawn the target user's default shell.
* introduce mntid_t, and make it 64bit, as apparently the kernel switched to
64bit mount ids
* mountfsd/nsresourced
- userdb: maybe allow callers to map one uid to their own uid
- bpflsm: allow writes if resulting UID on disk would be userns' owner UID
- make encrypted DDIs work (password…)
- add API for creating a new file system from scratch (together with some
dm-integrity/HMAC key). Should probably work using systemd-repart (access
via varlink).
- add api to make an existing file "trusted" via dm-integry/HMAC key
- port: portabled
- port: tmpfiles, sysusers and similar
- lets see if we can make runtime bind mounts into unpriv nspawn work
* add a kernel cmdline switch (and cred?) for marking a system to be
"headless", in which case we never open /dev/console for reading, only for
writing. This would then mean: systemd-firstboot would process creds but not
ask interactively, getty would not be started and so on.
* cryptsetup: new crypttab option to auto-grow a luks device to its backing
partition size. new crypttab option to reencrypt a luks device with a new
volume key.
* we probably should have some infrastructure to acquire sysexts with
drivers/firmware for local hardware automatically. Idea: reuse the modalias
logic of the kernel for this: make the main OS image install a hwdb file
that matches against local modalias strings, and adds properties to relevant
devices listing names of sysexts needed to support the hw. Then provide some
tool that goes through all devices and tries to acquire/download the
specified images.
* repart + cryptsetup: support file systems that are encrypted and use verity
on top. Usecase: confexts that shall be signed by the admin but also be
confidential. Then, add a new --make-ddi=confext-encrypted for this.
* tmpfiles: add new line type for moving files from some source dir to some
target dir. then use that to move sysexts/confexts and stuff from initrd
tmpfs to /run/, so that host can pick things up.
* tiny varlink service that takes a fd passed in and serves it via http. Then
make use of that in networkd, and expose some EFI binary of choice for
DHCP/HTTP base EFI boot.
* bootctl: add reboot-to-disk which takes a block device name, and
automatically sets things up so that system reboots into that device next.
* maybe: in PID1, when we detect we run in an initrd, make superblock read-only
early on, but provide opt-out via kernel cmdline.
* systemd-pcrextend:
- support measuring to nvindex with PCR update semantics ("fake PCRs")
- add api for "allocating" such an nvindex
- once we have that start measuring every sysext we apply, every confext,
every RootImage= we apply, every nspawn and so on. All in separate fake
PCRs.
* vmspawn:
- run in scope unit when invoked from command line, and machined registration is off
- sd_notify support
- --ephemeral support
- --read-only support
- automatically suspend/resume the VM if the host suspends. Use logind
suspend inhibitor to implement this. request clean suspend by generating
suspend key presses.
- support for "real" networking via "-n" and --network-bridge=
- translate SIGTERM to clean ACPI shutdown event
- implement hotkeys ^]^]r and ^]^]p like nspawn
* systemd-pcrmachine should probably also measure the SMBIOS system UUID.
* sd-boot: allow synthesizing additional type1 entries via SMBIOS vendor strings
* storagetm:
- add USB mass storage device logic, so that all local disks are also exposed
as mass storage devices on systems that have a USB controller that can
operate in device mode
- add NVMe authentication
* add support for activating nvme-oF devices at boot automatically via kernel
cmdline, and maybe even support a syntax such as
root=nvme:<trtype>:<traddr>:<trsvcid>:<nqn>:<partition> to boot directly from
nvme-oF
* pcrlock:
- add kernel-install plugin that automatically creates UKI .pcrlock file when
UKI is installed, and removes it when it is removed again
- automatically install PE measurement of sd-boot on "bootctl install"
- pre-calc sysext + kernel cmdline measurements
- pre-calc cryptsetup root key measurement
- maybe make systemd-repart generate .pcrlock for old and new GPT header in
/run?
- Add support for more than 8 branches per PCR OR
- add "systemd-pcrlock lock-kernel-current" or so which synthesizes .pcrlock
policy from currently booted kernel/event log, to close gap for first boot
for pre-built images
* in sd-boot and sd-stub measure the SMBIOS vendor strings to some PCR (at
least some subset of them that look like systemd stuff), because apparently
some firmware does not, but systemd honours it. avoid duplicate measurement
by sd-boot and sd-stub by adding LoaderFeatures/StubFeatures flag for this,
so that sd-stub can avoid it if sd-boot already did it.
* cryptsetup: a mechanism that allows signing a volume key with some key that
has to be present in the kernel keyring, or similar, to ensure that confext
DDIs can be encrypted against the local SRK but signed with the admin's key
and thus can authenticated locally before they are decrypted.
* image policy should be extended to allow dictating *how* a disk is unlocked,
i.e. root=encrypted-tpm2+encrypted-fido2 would mean "root fs must be
encrypted and unlocked via fido2 or tpm2, but not otherwise"
* systemd-repart: add support for formatting dm-crypt + dm-integrity file
systems.
* homed: use systemd-storagetm to expose home dirs via nvme-tcp. Then,
teach homed/pam_systemd_homed with a user name such as
lennart%nvme_tcp_192.168.100.77_8787 to log in from any linux host with the
same home dir. Similar maybe for nbd, iscsi? this should then first ask for
the local root pw, to authenticate that logging in like this is ok, and would
then be followed by another password prompt asking for the user's own
password. Also, do something similar for CIFS: if you log in via
lennart%cifs-someserver_someshare, then set up the homed dir for it
automatically. The PAM module should update the user name used for login to
the short version once it set up the user. Some care should be taken, so that
the long version can be still be resolved via NSS afterwards, to deal with
PAM clients that do not support PAM sessions where PAM_USER changes half-way.
* redefine /var/lib/extensions/ as the dir one can place all three of sysext,
confext as well is multi-modal DDIs that qualify as both. Then introduce
/var/lib/sysexts/ which can be used to place only DDIs that shall be used as
sysext
* Varlinkification of the following command line tools, to open them up to
other programs via IPC:
- bootctl
- journalctl (allowing journal read access via IPC)
- coredumpcl
- systemd-bless-boot
- systemd-measure
- systemd-cryptenroll (to allow UIs to enroll FIDO2 keys and such)
- systemd-dissect
- systemd-sysupdate
- systemd-analyze
- kernel-install
- systemd-mount (with PK so that desktop environments could use it to mount disks)
* enumerate virtiofs devices during boot-up in a generator, and synthesize
mounts for rootfs, /usr/, /home/, /srv/ and some others from it, depending on
the "tag". (waits for: https://gitlab.com/virtio-fs/virtiofsd/-/issues/128)
* automatically mount one virtiofs during early boot phase to /run/host/,
similar to how we do that for nspawn, based on some clear tag.
* add some service that makes an atomic snapshot of PCR state and event log up
to that point available, possibly even with quote by the TPM.
* encode type1 entries in some UKI section to add additional entries to the
menu.
* Add ACL-based access management to .socket units. i.e. add AllowPeerUser= +
AllowPeerGroup= that installs additional user/group ACL entries on AF_UNIX
sockets.
* systemd-tpm2-setup should support a mode where we refuse booting if the SRK
changed. (Must be opt-in, to not break systems which are supposed to be
migratable between PCs)
* when systemd-sysext learns mutable /usr/ (and systemd-confext mutable /etc/)
then allow them to store the result in a .v/ versioned subdir, for some basic
snapshot logic
* add a new PE binary section ".mokkeys" or so which sd-stub will insert into
Mok keyring, by overriding/extending whatever shim sets in the EFI
var. Benefit: we can extend the kernel module keyring at ukify time,
i.e. without recompiling the kernel, taking an upstream OS' kernel and adding
a local key to it.
* PidRef conversion work:
- cg_pid_get_xyz()
- pid_from_same_root_fs()
- get_ctty_devnr()
- actually wait for POLLIN on pidref's pidfd in service logic
- openpt_allocate_in_namespace()
- unit_attach_pid_to_cgroup_via_bus()
- cg_attach() – requires new kernel feature
- journald's process cache
* ddi must be listed as block device fstype
* measure some string via pcrphase whenever we end up booting into emergency
mode.
* similar, measure some string via pcrphase whenever we resume from hibernate
* homed: add a basic form of secrets management to homed, that stores
secrets in $HOME somewhere, is protected by the accounts own authentication
mechanisms. Should implement something PKCS#11-like that can be used to
implement emulated FIDO2 in unpriv userspace on top (which should happen
outside of homed), emulated PKCS11, and libsecrets support. Operate with a
2nd key derived from volume key of the user, with which to wrap all
keys. maintain keys in kernel keyring if possible.
* use sd-event ratelimit feature optionally for journal stream clients that log
too much
* systemd-mount should only consider modern file systems when mounting, similar
to systemd-dissect
* add another PE section ".fname" or so that encodes the intended filename for
PE file, and validate that when loading add-ons and similar before using
it. This is particularly relevant when we load multiple add-ons and want to
sort them to apply them in a define order. The order should not be under
control of the attacker.
* also include packaging metadata (á la
https://systemd.io/PACKAGE_METADATA_FOR_EXECUTABLE_FILES/) in our UEFI PE
binaries, using the same JSON format.
* make "bootctl install" + "bootctl update" useful for installing shim too. For
that introduce new dir /usr/lib/systemd/efi/extra/ which we copy mostly 1:1
into the ESP at install time. Then make the logic smart enough so that we
don't overwrite bootx64.efi with our own if the extra tree already contains
one. Also, follow symlinks when copying, so that shim rpm can symlink their
stuff into our dir (which is safe since the target ESP is generally VFAT and
thus does not have symlinks anyway). Later, teach the update logic to look at
the ELF package metadata (which we also should include in all PE files, see
above) for version info in all *.EFI files, and use it to only update if
newer.
* in sd-stub: optionally add support for a new PE section .keyring or so that
contains additional certificates to include in the Mok keyring, extending
what shim might have placed there. why? let's say I use "ukify" to build +
sign my own fedora-based UKIs, and only enroll my personal lennart key via
shim. Then, I want to include the fedora keyring in it, so that kmods work.
But I might not want to enroll the fedora key in shim, because this would
also mean that the key would be in effect whenever I boot an archlinux UKI
built the same way, signed with the same lennart key.
* resolved: take possession of some IPv6 ULA address (let's say
fd00:5353:5353:5353:5353:5353:5353:5353), and listen on port 53 on it for the
local stubs, so that we can make the stub available via ipv6 too.
* Maybe add SwitchRootEx() as new bus call that takes env vars to set for new
PID 1 as argument. When adding SwitchRootEx() we should maybe also add a
flags param that allows disabling and enabling whether serialization is
requested during switch root.
* introduce a .acpitable section for early ACPI table override
* add proper .osrel matching for PE addons. i.e. refuse applying an addon
intended for a different OS. Take inspiration from how confext/sysext are
matched against OS.
* figure out what to do about credentials sealed to PCRs in kexec + soft-reboot
scenarios. Maybe insist sealing is done additionally against some keypair in
the TPM to which access is updated on each boot, for the next, or so?
* logind: when logging in, always take an fd to the home dir, to keep the dir
busy, so that autofs release can never happen. (this is generally a good
idea, and specifically works around the fact the autofs ignores busy by mount
namespaces)
* mount most file systems with a restrictive uidmap. e.g. mount /usr/ with a
uidmap that blocks out anything outside 0…1000 (i.e. system users) and similar.
* mount the root fs with MS_NOSUID by default, and then mount /usr/ without
both so that suid executables can only be placed there. Do this already in
the initrd. If /usr/ is not split out create a bind mount automatically.
* fix our various hwdb lookup keys to end with ":" again. The original idea was
that hwdb patterns can match arbitrary fields with expressions like
"*:foobar:*", to wildcard match both the start and the end of the string.
This only works safely for later extensions of the string if the strings
always end in a colon. This requires updating our udev rules, as well as
checking if the various hwdb files are fine with that.
* mount /tmp/ and /var/tmp with a uidmap applied that blocks out "nobody" user
among other things such as dynamic uid ranges for containers and so on. That
way no one can create files there with these uids and we enforce they are only
used transiently, never persistently.
* rework loopback support in fstab: when "loop" option is used, then