Complications arising from having a complex init

2017-05-31T12:00:13Z

Useful, secure, finished. Pick two.

I've just spent a long time writing about how systemd is the solution to all your process management reliability woes.

As with everything though, as I've alluded to in the subtitle, there are trade-offs.

What's the catch?

It is arguable that increasing the responsibilities for init, historically a very simple daemon, is a dangerous thing to do.

I believe these changes have been warranted, since the traditional UNIX process model assumes processes are well-written and benign.

Security updates

To accommodate the changing world, init is now sufficiently complicated that it requires security updates.

This is a problem because you can only have one init process, so you can't just kill the old version and start a new one.

systemd has to work around this by re-executing /sbin/init when it, or any of its dependent libraries, have been updated.

This mechanism should not be relied upon, since it can fail and if it does fail recovery requires a reboot, so if you need to be prepared to reboot on update, why not just reboot the system when an update is required?

Rebooting woes

Rebooting is also further complicated by init being extended.

If a library that a process depends on is removed as part of an update then the running process may keep a copy of it open until the process re-executes or terminates.

This means file systems will refuse to be remounted as read-only until the process stops using certain files. This is hugely problematic if the filesystem is the root file system and the process is init, since init will want to remount the file system before terminating and the file system will want init to terminate before remounting.

Previously the approach would be to shut-down without remounting the file system read-only, but this doesn't cleanly unmount the file system so was a source of file system corruption.

The solution to this employed by systemd is for the init process to execute a systemd-shutdown binary.

So why not move the complicated bits out of PID 1?

PID 1 is complex, and this is a problem. Therefore either systemd's developers don't consider the problems important or there are good reasons why it can't be otherwise.

So, what responsibilities does PID 1 have, and why do they have to be in PID 1?

Process reaping

When a process terminates before reaping its child subprocesses, all those subprocesses are adopted by PID 1, which is then responsible for reaping them.

PR_SET_CHILD_SUBREAPER was added to prctl(2) which allows a different process subreaper in the process hierarchy, so that gets to reap orphaned subprocesses instead of PID 1.

However PID 1 still neads to be able to reap subreapers, so PID 1 needs the same reaping logic, and both implementations need to be either shared or maintained, at which point it's less difficult to just rely on PID 1 doing it.

Traditional init systems perform this function, so it is not controversial for systemd to perform this.

Spawning processes

There are no special requirements necessary to spawn subprocesses, so a separate process could be started to spawn subprocesses.

Unfortunately this has the same bootstrapping problem, where PID 1 needs the same logic for starting its helpers as needs to be used for arbitrary code in the rest of the system.

Traditional init systems perform this function, so it is not controversial for systemd to perform this.

Managing cgroups

Because processes can't be trusted to not escape, cgroups are required to contain them.

A single process is required to manage them.

If services started by init are to be contained by cgroups, then the cgroup management service must either be the init process or must be started by the init process and have special logic to contain itself first.

This is tractable, but if it's a separate process, then some form of IPC is required, which adds extra latency, complexity and points of failure.

A similar concern exists in the form of journald, which is a separate service that systemd needs to communicate with to get it to log the output of new services to a file.

This complexity already causes systemd trouble, as a crashing journald can bring the whole system to a halt, so similar complications should be avoided.

Communicating via DBus

The init process needs some form of IPC to instruct it to do things.

Historically this was just telinit writing to the /dev/initctl FIFO, so was a pretty trivial form of IPC.

However we've established that init now requires new responsibilities, so requires a much richer form of IPC.

Rather than inventing some bespoke IPC mechanism, systemd uses DBus.

systemd also participates in the system bus, once the DBus daemon has been started, which adds extra complexity since the DBus daemon is started by systemd.

This is handled by systemd also handling point-to-point DBus, though attempts have been made to move DBus into the kernel in the form of AF_BUS, kdbus and most recently bus1, and there has also been discussion of whether systemd should be a DBus daemon to break this circular dependency.

Summary

The traditional UNIX process model wasn't designed to support a complex init, because it assumed that programs would be benign and well written.

Because you can't trust processes to clean up after themselves properly you need to make init more complicated to cope with it.

Because init is complicated it needs to be able to be updated.

Because the UNIX process model doesn't have a way to safely replace init you have to allow for it failing and needing a reboot, so you can't safely perform live updates.

Alternative ways of structuring init would make it even more complex so more opportunity for things to go wrong.

Is your process running 4 - Linux-specific approaches

2017-05-10T12:00:12Z

We previously discussed the traditional UNIX mechanisms for service management, and how they assumed benign and well written software.

Fortunately Linux provides more than just traditional UNIX system calls, so offers some features that can be used to track processes more completely.

Intercepting processes with ptrace(2)

If you could run some code when a process creates a subprocess or exits then you could use this to track which processes are active and where they came from.

Debuggers like gdb(1) also need to know this information since you might want to set a breakpoint for subprocesses too.

So it would be possible to do this using the same mechanism as debuggers.

This is what Upstart does to work out which process to track for double-forking daemons.

Unfortunately a process cannot be traced by multiple processes, so if Upstart is tracing a process to track its subprocesses then a debugger cannot be attached to the process.

For Upstart it detaches the debugger after it has worked out the main PID, so it's a small window where it is undebuggable, so it's only a problem for debugging faults during startup, but detaching after the double-fork means it can't trace any further subprocesses.

Continuing to trace subprocesses adds a noticeable performance impact though, so it's for the best that it stops tracing after the double-fork.

Store process in a cgroup

cgroups are a Linux virtual filesystem that lets you create hierarchies to organise processes, and apply resource controls at each level.

cgroups were created to handle the deficiency of traditional UNIX resource control system calls such as setrlimit(2), which only apply to a single process and can be thwarted by creating subprocesses, since while a process inherits limits of its parent process it does not share them with it.

Subprocesses of a process in a cgroup on the other hand are part of the same cgroup and share the same resource limits.

In each cgroup directory there is a cgroup.procs virtual file, which lists the process IDs of every process in the cgroup, making it effectively a kernel-maintained PIDfile.

This is what systemd uses for its services, and you can request a cgroup for your own processes by asking systemd (via systemd-run(1) or the DBus interface) or cgmanager (via cgm(1) or the DBus interface) to do so on your behalf.

Why can't I mount my own cgroupfs?

Unfortunately you can only safely have 1 process using a cgroup tree at a time, and you can only have one cgroupfs mounted at a time, so you always need to ask some daemon to manage cgroups on your behalf.

See Changes coming for systemd and control groups for why a single writer and a single hierarchy are required.

Conclusion

It is necessary to track all the subprocesses of a service somehow, using ptrace(2) prevents it being used for debugging, cgroups are an interface designed for this purpose but technical limitations mean you need to ask another service to do it.

So I would recommend writing a systemd service if your processes are a per-system or per-user service, or to use the DBus API to create cgroups if not.

Thus cgroups allow us to know our processes are running, and currently the best way to use cgroups is via systemd. The implications of relying on systemd to do this are best served as a subject of another article.

If you are interested in learning more about cgroups, I recommend reading Neil Brown's excellent series on LWN .

pages tagged cgroup