Daniel Silverstone Principle of least surprise

In many things in life, but particularly in the world of user interfaces and software behaviours, good programmers subscribe to the principle of least surprise. This means that we, as programmers, attempt to ensure that the ways we accept input, produce output, and perform work, are as obvious to the user as they can be, as consistent as possible across software projects and therefore as unlikely as possible to surprise the user.

Unlike when we're receiving gifts or attending a pot-luck dinner, when we're writing a configuration file, or running a fundamental tool, surprise is a bad thing. Users simply do not like it when something they expect to work doesn't and the principle of least astonishment. This leads to a consistency and uniformity of interface which users appreciate and programmers find handy. For example, many applications use INI file style configuration so that users can be confident that they understand the syntax of the file they're editing.

Programs often use similar key bindings to navigate content. Anyone who has played with Vi/Vim will know about hjkl (bless you). It might therefore surprise you to know that XMonad defaults to using M-h and M-l as "move boundary left/right" and M-j and M-k as focus "up" or focus "down" in the window list. Or that in G+ (yes, Google Plus, and in fact in gmail too) you can navigate between entries using j and k (as does Tweetdeck, Twitter, Facebook and a myriad others).

As an example of where the principle breaks down is in the use of the home and end keys. I personally expect home to be start of document and end to be end of document so when someone else tries to operate my editor and home doesn't take them to the start of the line they're very confused. Equally my fingers are used to C-w being 'delete word' in a shell, but some terminal emulators take that to mean close window which can be a tad confusing.

So next time you're developing a piece of software the user interacts with in any way (even if only on the commandline) -- take a look at how other programs do the sorts of things your UI will have and try and ensure no user will be surprised with how you choose to do things.

Posted Wed Jan 7 12:00:07 2015

One of the elegant parts of Unix is that the shell expands wildcards, so that they work the same way for every command. As a result, each of the following will work on the same set of files:

ls *.bak
rm *.bak
mv *.bak /tmp/wastebasket

However, sometimes they don't work. And sometimes ls and rm work, but the mv doesn't. That happens when you one of the hidden internal limits of a Unix system: the maximum size of a command line when executing a new program.

In Unix, to execute a new program, you use one of the exec family of system calls. These replace the program running in the current process with a new one, and as part of that they pass in a set of command line arguments. The kernel has a fixed maximum size for the command line arguments, which in Linux is 128 KiB.

In other words, a command such as rm *.bak may fail, if the pattern matches so many files that the combined length of their filenames exceeds 128 KiB. That didn't used to be much of a problem, but as disks have grown and people have more files, it happens more often now.

You have to find ways around the limit. One common trick is to run multiple commands with more specific patterns to limit the command line arguments for each run. This can be quite a bit of tedious work, and can be quite error prone. If only there was a way to automate it.

Of course there is.

The xargs command does exactly that. Here's an example:

ls *.bak | xargs rm

xargs reads its standard input to get a list of filenames, and breaks down the list into chunks of 128 KiB or less, and runs the command given to it for each chunk. Thus, you can remove all the files more easily than by to find filename patterns manually.

Ah, but this example doesn't work, does it? It still runs ls *.bak which runs into the problem of the command line length limit.

The find tool helps here. It finds files that match some criteria, and writes the pathnames of each matching file to its stdout. If we feed that list to xargs, we get something better:

find -name '*.bak' | xargs rm

This will work better, but it's still got a gotcha. xargs reads filenames delimited by any whitespace, by default, including plain old space characters. That means that it will get somewhat confused when you have a filename such as 001 March of Cambreadth.flac in your music collection.

The solution here is to use a delimiter that can't ever be part of a pathname, and the only such character (byte) is the NUL one. Handily, find and xargs have options to deal with that:

find -name '*.bak' -print0 | xargs -0 rm

find can run commands itself:

find -name '*.bak' -exec rm '{}' ';'

find replaces {} with the pathname of a file it's found. This way, xargs isn't needed at all. However, in the above example, find will run rm for each file it finds. If you replace the semicolon in the example above with a plus sign (+), find will group files into larger groups, just like xargs. (Beware if you need portability: this may be a feature available only in relatively recent versions of the GNU implementation of find.)

find can delete files directly, as well, but that's a special case, so we'll skip an example. Check the manual page.

Perhaps you need to do something more complicated than removing files, for example compressing files. You may want to compress several files at the sime time, to make better use of the multiple CPU cores you have available. For this, you may probably want to use the parallel tool. There's at least two implementations of this, one in moreutils, and also a GNU parallel.

find -name '*.bak' -print0 | xargs -0 parallel gzip --

This example takes a bit of unravelling:

  • find writes the pathnames of matching files, delmited by NUL bytes, to its stdout
  • xargs reads files from its stdin, and assumes NUL delimiters
  • the command to run is parallel gzip --
  • the -- tells parallel that it should run gzip on any arguments following the --, or in other words, the -- separates the command to be run from the filenames to give the command as arguments
  • parallel starts an instances of the command for each CPU core, and gives each instance the next filename argument; when an instance terminates, it starts a new instance with the next filename argument, until it's run the command for each argument

This should be much more efficient than running one gzip at a time. The example combines find and xargs rather than using find -exec, just for kicks. Simplification is left as an exercise to the reader.

find, exec, and parallel are a very powerful set of tools. If you work on the Unix command line more than a little, it pays to read their manual pages and becoming familiar with them, as they can save you a ton of work, when used properly. (Let's not worry about the time and effort spent on debugging complex invocations. We all write perfect code the first time.)

They are also a good example of Unix tools that are designed to be combined in powerful ways to achieve things that might otherwise require writing a lot of custom code.

Posted Wed Jan 14 12:00:08 2015 Tags:

It is a well known joke that "C programmers can write C in any programming language", though we can abstract this joke away to "$LANGUAGE programmers can write $LANGUAGE in any programming language".

This is usually language snobbery, poking fun at newcomers with a background in a different programming paradigm.

As well as being rude, unhelpful, insular and likely to drive newcomers away, it's wrong in suggesting that it's a bad thing.

Yes, there may be better ways of accomplishing what the newcomer seeks to do, but the important part is that it's a learning experience, which they may bootstrap themselves up from to a proper understanding of the language; a process which you can assist.

Ranting aside, I have been trying to teach myself Haskell. I come from an imperative and object-oriented programming background, with mutable objects and strict evaluation, so Haskell is somewhat of a culture shock, being functional, immutable and lazily evaluated.

A refuge in the storm of unfamiliarity is that the syntax resembles that of shell, so I'm starting from being able to plug existing shell commands together, until I can understand the equivalent native operation.

This is not a bad idea since it allows you to be more productive while learning; and since one advantage of shell is that you have a lot of commands available to re-use, these are applicable to many programming environments after you work out how to execute subprocesses.

One of the nice ways you can do this in Haskell is the Shelly package.

In this example, we have a git foreach command, that runs a git command in every git repository under the current directory.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ExtendedDefaultRules #-}
{-# OPTIONS_GHC -fno-warn-type-defaults #-}

import Control.Monad
import System.Environment

import Shelly
import Data.Text as T
default (T.Text)

git_foreach args = shelly $ do
    contents <- ls "."
    forM_ contents $ \entry -> do
        isDir <- test_d $ entry </> ".git"
        if isDir
            then do
                chdir entry $ cmd "git" args
        else return ()

main = getArgs >>= git_foreach

More detail on how this works can be found in the excellent article by the author.

Posted Wed Jan 21 12:00:08 2015 Tags:
Daniel Silverstone Virtualised systems

Virtualisation, when applied as a term to full systems, is the process by which a full computer is emulated such that an operating system can run as though it were on real hardware. That is, of course, a simplification - though it holds. Virtualisation has been around for a very long time in one form or another, but as it is used these days, it tends only to apply to the above definition.

I must warn you now that this article will only be dealing with virtualisation technologies which exist for Linux and I will only be touching on some of the very many options out there. Linux has a number of built-in virtualisation technologies already and more are being developed, and while many if not all of them are available in some form on other *NIX operating systems (and indeed Windows), I am only going to consider the forms they take on Linux.

At the "lightest" level of virtualisation are chroots. These are very simple and offer very little in the way of protection but they are extremely lightweight and allow you to have multiple operating systems present and run software in effectively isolated instances of those operating systems with little effort. In particular, chroots still share process trees, mount namespaces, file descriptors etc, between themselves and as such are not a way to effectively prevent attacks from code of unknown provenance in and of themselves.

The next level "up" in virtualisation are Linux containers where the various aspects of the *NIX system such as PIDs, FDs, mount namespaces, etc can be unshared. These containers, managed by tools such as Docker are somewhat more isolated than chroots, but they still share the same kernel and as such are still risky endeavours since only the one piece of software (the kernel) stands between an attacker and the rest of the system(s).

Beyond that we have already discussed, we get into "proper" system virtualisation where a full computer is emulated in some fashion. This might be partial or full virtualisation and there are a number of ways of achieving it on a modern Linux system. There is a built-in virtualisation mechanism called kvm which along with some user-land software allows a Linux kernel to be a hypervisor for a theoretically only resource-limited number of virtualised systems. KVM itself isn't particularly friendly to use, but there is a project called libvirt which abstracts kvm (and other virtualisation mechanisms) which can help with that. The kvm system supports a method of providing efficiently virtualised emulated hardware to the guest systems called virtio. In this manner, kvm can provide both partial or full virtualisation, although typically only other Linux guests will use virtio.

If you require something a little more cross-platform (for example if you want to run the same virtualisation system on your Linux systems as your colleagues and friends could run on their Windows or Mac OS systems) then there is Oracle's VirtualBox. VirtualBox has a GUI to help you organise your systems a little like libvirt can. It works on other platforms as well as Linux and it is consistent in its behaviour across those platforms, in all of the features it offers to guests, users, and command-line programs. VirtualBox has a feature called 'Guest Additions' which provides partial virtualisation features where the guest and host collude to provide features such as mounting parts of the host filesystem into the guest, or accelerating video driver access. VirtualBox is, for the most part, open source software.

If you are, however, interested in pay-for virtualisation solutions then the one option which springs to most people's minds is VMWare Workstation. VMWare is proprietary and costs a non-trivial amount of money but perhaps offers the best results if you wish to virtualise Windows to avoid having to run it natively.

And as if all the above wasn't enough, it's possible to nest these techniques to form nested virtualisation at various levels. Some of the above techniques can also be accelerated by hardware. Indeed if you have an x86 system and you wish to play with any of kvm, VirtualBox or VMWare then you should probably pop into your BIOS (or whatever you have) and check that the VT bit is enabled for your CPUs. (In theory there are attack vectors which can leak data which take advantage of VT, and also it's a way to potentially fool you, so it is usually turned off by default.)


As you can tell, if you read up on the above software and the concepts and technologies referenced in the very first link of this article, virtualisation is a powerful technique which can be applied at various levels to provide security or isolation to you and to software you wish to run. Nesting various virtualisation mechanisms at various points in your software architecture can result in it being significantly harder for an attacker to break out and do harm to your systems.

Posted Wed Jan 28 12:00:09 2015 Tags: