Whitespace is the set of blank characters, commonly defined as space, tab, newline and possibly carriage return.

Its significance in shell scripts is that command line arguments are separated by whitespace, unless the arguments are quoted.

For illustration, we have a shell script called printargs, which writes each of the arguments we give it on a new line.

$ install /dev/stdin printargs <<'EOF'
#!/bin/sh
for arg; do
    echo "$arg"
done
EOF
$ ./printargs foo bar baz
foo
bar
baz

You can see that it behaves as expected for the case of simple, one word strings for the arguments.

However, we want to print the string hello world. If we were to write it in as normal, it would print hello and world on different lines.

$ ./printargs hello world
hello
world

This is where quoting comes in, if you surround a string with quotation marks, i.e. ' or ", then it is treated as a single argument.

$ ./printargs "hello world"
hello world
$ ./printargs 'hello world'
hello world

Alternatively, special characters can be escaped with a \ (backslash).

$ ./printargs hello\ world
hello world

However, this looks ugly.

Similarly if you wanted to put a " in a string that was quoted with double quotes you could escape it them, or use the other quoting style.

$ ./printargs "hello \"material\" world"
hello "material" world
$ ./printargs 'hello "material" world'
hello "material" world
$ ./printargs "hello 'material' world"
hello 'material' world

The equivalent for ' is very ugly, since the only thing that terminates a singly quoted sequence is a single quote, escaping is not permitted.

$ ./printargs 'hello '\''material'\'' world'
hello 'material' world

Having read that, you may wonder how people make whitespace errors in shell commands, but it becomes less obvious when variables are involved.

$ var="hello \"material\" world"
$ ./printargs $var
hello
"material"
world

This goes wrong because $var is expanded in the command line, and looks like ./printargs hello \"material\" world to your shell.

This can be prevented by quoting the variable substitution.

$ ./printargs "$var"
hello "material" world

You may wonder why the shell behaves this way. It's mostly historical, since that's how shells have always done it, and it's kept that way for backwards compatibility, though some, like zsh, break with backwards compatibility in favour of a more sensible default.

It does occasionally come in useful, when strings aren't whitespace sensitive.

$ names="what who why"
$ for name in $names; do
    echo my name is $name
done; \
echo Slim shady
my name is what
my name is who
my name is why
Slim shady

However, if you're dealing with filenames, this is entirely inappropriate.

$ mkdir temp
$ cd temp
$ touch "foo bar" baz
$ for fn in `find . -type f`; do rm "$fn"; done
rm: cannot remove `./foo': No such file or directory
rm: cannot remove `bar': No such file or directory
$ ls
foo bar

Admittedly, this example is a little contrived, but it can be the difference between cleaning up your temporary files and deleting your music collection.

$ ls ~
temp music.mp3
$ ls -1 ~/temp
not music.mp3
scrap
$ cd ~
$ for fn in `find ~/temp -type f`; do rm "$f"; done
rm: cannot remove `~/temp/not': No such file or directory
$ ls
temp

There are a few ways this could have been avoided.

  1. Using arrays
  2. Process the files directly with find
  3. Have find pass which files to process on to another program which handles whitespace better
  4. Handle whitespace yourself

Using arrays

I mentioned $* and $@ when I first talked about shell. These are used for expanding variables, either the command line arguments directly, or other array variables as ${array[@]}.

They behave identically unless quoted with ", in which case $@ splits each argument into a different word, while $* becomes a single string, with each argument separated by a space.

Behold!

$ set -- "foo bar" baz qux
$ ./printargs $@
foo
bar
baz
qux
$ ./printargs "$@"
foo bar
baz
qux
$ ./printargs "$*"
foo bar baz qux

The previous example used set -- to use the command line argument array, since every shell has one. If you are using bash, you can have other arrays.

$ array=( "foo bar" baz qux )
$ ./printargs "${array[@]}"
foo bar
baz
qux
$ array+=(badger)
$ ./printargs "${array[@]}"
foo bar
baz
qux
badger

Glob expressions work in arrays too, so you can have an array of all files in a directory.

$ toremove=( ~/temp/* )
$ rm "${toremove[@]}"

Unfortunately, there is not a built-in way of recursively reading the contents of a directory into an array, so one of the later techniques will be required in conjunction.

$ declare -a toremove
$ while read f; do toremove+=("$f"); done < <(find ~/temp -type f)
$ rm "${toremove[@]}"

Removing directly with find

Find can remove the files itself if you use the -delete option. It would be used like this:

$ find ~/temp -type f -delete

Though this is specific to deleting the file. We may instead want to do something else, like ensure it can't be executed. This could be done with chmod a-x and find's -exec option.

$ find ~/temp -type f -exec chmod a-x {} +

Removing a file is similarly achieved, just by using rm instead of chmod.

$ find ~/temp -type f -exec rm {} +

More complicated operations are possible with an inline shell command. The following makes a backup of all the files as their name with .bak suffixed.

$ find ~/temp -type f -exec sh -c 'cp "$1" "$1.bak"' - {} \;

This is pretty ugly for anything complicated, difficult to remember, and requires you to remember how to invoke a command in a subshell the hard way.

For more details on how this works, refer to the find(1) man page, looking for the alternative form of -exec, and the bash(1) man page for what -c means and how to pass extra arguments to a shell command.

Passing commands to xargs

If you have a deficient find command, or want to minimise the number of commands run, you can use xargs.

This can be used similarly to the previous find commands:

$ find ~/temp -type f | xargs rm

For people with a wide knowledge of shell programming (i.e. knows lots of commands and how to put them together, rather than all the options of each command) this is more readable, however it is not yet whitespace safe, since find separates its arguments with a newline character, so if you had a filename with a newline in it, it would get it wrong.

This can be solved by find's -print0 argument and xargs' -0 argument.

$ find ~/temp -type f -print0 | xargs -0 rm

What this will do is instead of separating file paths with newline characters, it will separate them with NUL bytes (i.e. all bits 0). Since NUL is not a valid character in filenames, it cannot misinterpret the strings.

This is if anything, slightly more ugly than using -exec with find for renaming files.

$ find ~/temp -type f -print0 | xargs -0 -n 1 sh -c 'cp "$1" "$1.bak"' -

Handling whitespace yourself

xargs is not the only command that can be used to handle input. If you're using bash, you can do it in shell directly with read -d.

$ find ~/temp -type f -print0 | while read -d $'\0' fn; do
    cp "$fn" "$fn.bak"
done

read is a shell builtin command, which reads lines from standard input, or another file. -d is an option to change what the input line delimiter is, we change it to NUL to match what find produces. $'\0' is a bashism which is a shortcut to printf. $'\0' says to provide a NUL byte as the delimiter for read.

Summary

  1. Quote your variables.

    Always. Then when you do occasionally need to do it un-quoted you'll think about it.

  2. Use NUL delimited input when possible.

    Most commands that can take a list of files as input from another file will have an option to allow NUL termination.

    xargs has -0, tar -T <file> has --null, cpio has both.

    If a command has its command-line structured such that an arbitrary number of files can be passed, use it with xargs.

    If an argument needs further processing between input and passing to a command, it can be more readable to pipe it to a while read loop.

  3. Use arrays when possible.

    You can use the command line arguments array of any shell with set

    $ set -- "foo bar" baz qux
    $ ./printargs "$@"
    foo bar
    baz
    qux
    

    You can initialize a new array in bash with

    $ array=("foo bar" baz qux)
    $ ./printargs "${array[@]}"
    foo bar
    baz
    qux