pages tagged cgroupyakkinghttp://yakking.branchable.com/tags/cgroup/yakkingikiwiki2017-05-31T12:00:13ZComplications arising from having a complex inithttp://yakking.branchable.com/posts/complexity-in-systemd/Richard Maw2017-05-31T12:00:13Z2017-05-31T12:00:05Z
<p>Useful, secure, finished. Pick two.</p>
<p>I've just spent a long time writing
about how <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> is the solution
to all your process management reliability woes.</p>
<p>As with everything though,
as I've alluded to in the subtitle,
there are trade-offs.</p>
<h2>What's the catch?</h2>
<p>It is arguable that increasing the responsibilities for <a href="https://en.wikipedia.org/wiki/Init">init</a>,
historically a very simple daemon,
is a dangerous thing to do.</p>
<p>I believe these changes have been warranted,
since the traditional UNIX process model
assumes processes are well-written and benign.</p>
<h3>Security updates</h3>
<p>To accommodate the changing world,
<a href="https://en.wikipedia.org/wiki/Init">init</a> is now sufficiently complicated that it requires security updates.</p>
<p>This is a problem because you can only have one <a href="https://en.wikipedia.org/wiki/Init">init</a> process,
so you can't just kill the old version and start a new one.</p>
<p><a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> has to work around this by re-executing <code>/sbin/init</code>
when it, or any of its dependent libraries, have been updated.</p>
<p>This mechanism should not be relied upon,
since it can fail and if it does fail recovery requires a reboot,
so if you need to be prepared to reboot on update,
why not just reboot the system when an update is required?</p>
<h3>Rebooting woes</h3>
<p>Rebooting is also further complicated by <a href="https://en.wikipedia.org/wiki/Init">init</a> being extended.</p>
<p>If a library that a process depends on is removed as part of an update
then the running process may keep a copy of it open
until the process re-executes or terminates.</p>
<p>This means file systems will refuse to be remounted as read-only
until the process stops using certain files.
This is hugely problematic if the filesystem is the root file system
and the process is init,
since init will want to remount the file system before terminating
and the file system will want init to terminate before remounting.</p>
<p>Previously the approach would be to shut-down
without remounting the file system read-only,
but this doesn't cleanly unmount the file system
so was a source of file system corruption.</p>
<p>The solution to this employed by <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a>
is for the init process to execute a <a href="https://www.freedesktop.org/software/systemd/man/systemd-halt.service.html">systemd-shutdown</a> binary.</p>
<h2>So why not move the complicated bits out of PID 1?</h2>
<p>PID 1 is complex, and this is a problem.
Therefore either <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a>'s developers don't consider the problems important
or there are good reasons why it can't be otherwise.</p>
<p>So, what responsibilities does PID 1 have,
and why do they have to be in PID 1?</p>
<h3>Process reaping</h3>
<p>When a process terminates before reaping its child subprocesses,
all those subprocesses are adopted by PID 1,
which is then responsible for reaping them.</p>
<p><code>PR_SET_CHILD_SUBREAPER</code> was added to <a href="http://man7.org/linux/man-pages/man2/prctl.2.html">prctl(2)</a>
which allows a different process subreaper in the process hierarchy,
so that gets to reap orphaned subprocesses instead of PID 1.</p>
<p>However PID 1 still neads to be able to reap subreapers,
so PID 1 needs the same reaping logic,
and both implementations need to be either shared or maintained,
at which point it's less difficult to just rely on PID 1 doing it.</p>
<p>Traditional init systems perform this function,
so it is not controversial for <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> to perform this.</p>
<h3>Spawning processes</h3>
<p>There are no special requirements necessary to spawn subprocesses,
so a separate process could be started to spawn subprocesses.</p>
<p>Unfortunately this has the same bootstrapping problem,
where PID 1 needs the same logic for starting its helpers
as needs to be used for arbitrary code in the rest of the system.</p>
<p>Traditional init systems perform this function,
so it is not controversial for <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> to perform this.</p>
<h3>Managing cgroups</h3>
<p>Because processes can't be trusted to not escape,
<a href="https://en.wikipedia.org/wiki/Cgroups">cgroups</a> are required to contain them.</p>
<p>A single process is required to manage them.</p>
<p>If services started by init are to be contained by cgroups,
then the cgroup management service
must either be the init process
or must be started by the init process
and have special logic to contain itself first.</p>
<p>This is tractable, but if it's a separate process,
then some form of IPC is required,
which adds extra latency, complexity and points of failure.</p>
<p>A similar concern exists in the form of <a href="https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html">journald</a>,
which is a separate service that <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> needs to communicate with
to get it to log the output of new services to a file.</p>
<p>This complexity already causes <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> trouble,
as a crashing <a href="https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html">journald</a> can bring the whole system to a halt,
so similar complications should be avoided.</p>
<h3>Communicating via <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a></h3>
<p>The init process needs some form of IPC to instruct it to do things.</p>
<p>Historically this was just <code>telinit</code>
writing to the <code>/dev/initctl</code> FIFO,
so was a pretty trivial form of IPC.</p>
<p>However we've established that init now requires new responsibilities,
so requires a much richer form of IPC.</p>
<p>Rather than inventing some bespoke IPC mechanism,
<a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> uses <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a>.</p>
<p><a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> also participates in the system bus,
once the <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a> daemon has been started,
which adds extra complexity since the <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a> daemon is started by <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a>.</p>
<p>This is handled by <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> also handling point-to-point <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a>,
though attempts have been made to move <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a> into the kernel
in the form of <a href="https://lwn.net/Articles/504970/">AF_BUS</a>, <a href="https://www.freedesktop.org/wiki/Software/systemd/kdbus/">kdbus</a> and most recently <a href="http://www.bus1.org/">bus1</a>,
and there has also been discussion
of whether <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> should be a <a href="https://www.freedesktop.org/wiki/Software/dbus/">DBus</a> daemon
to break this circular dependency.</p>
<h1>Summary</h1>
<p>The traditional UNIX process model wasn't designed to support a complex init,
because it assumed that programs would be benign and well written.</p>
<p>Because you can't trust processes to clean up after themselves properly
you need to make init more complicated to cope with it.</p>
<p>Because init is complicated it needs to be able to be updated.</p>
<p>Because the UNIX process model doesn't have a way to safely replace init
you have to allow for it failing and needing a reboot,
so you can't safely perform live updates.</p>
<p>Alternative ways of structuring init would make it even more complex
so more opportunity for things to go wrong.</p>
Is your process running 4 - Linux-specific approacheshttp://yakking.branchable.com/posts/procrun-4-linux/Richard Maw2017-05-10T12:00:12Z2017-05-10T12:00:06Z
<p>We previously discussed the traditional UNIX mechanisms for service management,
and how they assumed benign and well written software.</p>
<p>Fortunately Linux provides more than just traditional UNIX system calls,
so offers some features that can be used to track processes more completely.</p>
<h1>Intercepting processes with <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html">ptrace(2)</a></h1>
<p>If you could run some code when a process creates a subprocess or exits
then you could use this to track which processes are active
and where they came from.</p>
<p>Debuggers like <a href="http://man7.org/linux/man-pages/man1/gdb.1.html">gdb(1)</a> also need to know this information
since you might want to set a breakpoint for subprocesses too.</p>
<p>So it would be possible to do this using the same mechanism as debuggers.</p>
<p>This is what <a href="http://upstart.ubuntu.com/">Upstart</a> does to work out which process to track
for double-forking daemons.</p>
<p>Unfortunately a process cannot be traced by multiple processes,
so if <a href="http://upstart.ubuntu.com/">Upstart</a> is tracing a process to track its subprocesses
then a debugger cannot be attached to the process.</p>
<p>For <a href="http://upstart.ubuntu.com/">Upstart</a> it detaches the debugger after it has worked out the main PID,
so it's a small window where it is undebuggable,
so it's only a problem for debugging faults during startup,
but detaching after the double-fork means
it can't trace any further subprocesses.</p>
<p>Continuing to trace subprocesses adds a noticeable performance impact though,
so it's for the best that it stops tracing after the double-fork.</p>
<h1>Store process in a <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroup</a></h1>
<p><a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups</a> are a Linux <a href="https://en.wikipedia.org/wiki/Virtual_file_system">virtual filesystem</a>
that lets you create hierarchies to organise processes,
and apply resource controls at each level.</p>
<p><a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups</a> were created to handle the deficiency
of traditional UNIX resource control system calls
such as <a href="http://man7.org/linux/man-pages/man2/setrlimit.2.html">setrlimit(2)</a>,
which only apply to a single process
and can be thwarted by creating subprocesses,
since while a process inherits limits of its parent process
it does not share them with it.</p>
<p>Subprocesses of a process in a <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroup</a> on the other hand
are part of the same <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroup</a> and share the same resource limits.</p>
<p>In each <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroup</a> directory there is a <code>cgroup.procs</code> virtual file,
which lists the process IDs of every process in the <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroup</a>,
making it effectively a kernel-maintained PIDfile.</p>
<p>This is what <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> uses for its services,
and you can request a <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroup</a> for your own processes
by asking <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> (via <a href="http://man7.org/linux/man-pages/man1/systemd-run.1.html">systemd-run(1)</a> or the DBus interface)
or <a href="https://linuxcontainers.org/cgmanager/introduction/">cgmanager</a> (via <a href="http://manpages.ubuntu.com/manpages/yakkety/man1/cgm.1.html">cgm(1)</a> or the DBus interface)
to do so on your behalf.</p>
<h2>Why can't I mount my own cgroupfs?</h2>
<p>Unfortunately you can only safely have 1 process using a cgroup tree at a time,
and you can only have one cgroupfs mounted at a time,
so you always need to ask some daemon to manage cgroups on your behalf.</p>
<p>See <a href="https://lwn.net/Articles/555920">Changes coming for systemd and control groups</a>
for why a single writer and a single hierarchy are required.</p>
<h1>Conclusion</h1>
<p>It is necessary to track all the subprocesses of a service somehow,
using <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html">ptrace(2)</a> prevents it being used for debugging,
<a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups</a> are an interface designed for this purpose
but technical limitations mean you need to ask another service to do it.</p>
<p>So I would recommend writing a <a href="https://www.freedesktop.org/software/systemd/man/systemd.service.html">systemd service</a>
if your processes are a per-system or per-user service,
or to use the <a href="https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/#theapis">DBus API</a> to create <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups</a> if not.</p>
<p>Thus <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups</a> allow us to know our processes are running,
and currently the best way to use cgroups is via <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a>.
The implications of relying on <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> to do this
are best served as a subject of another article.</p>
<p>If you are interested in learning more about <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">cgroups</a>,
I recommend reading <a href="https://lwn.net/Articles/604609/">Neil Brown's excellent series on LWN
</a>.</p>