One of the elegant parts of Unix is that the shell expands wildcards, so that they work the same way for every command. As a result, each of the following will work on the same set of files:
ls *.bak
rm *.bak
mv *.bak /tmp/wastebasket
However, sometimes they don't work. And sometimes ls
and rm
work,
but the mv
doesn't. That happens when you one of the hidden internal
limits of a Unix system: the maximum size of a command line when
executing a new program.
In Unix, to execute a new program, you use one of the exec family of system calls. These replace the program running in the current process with a new one, and as part of that they pass in a set of command line arguments. The kernel has a fixed maximum size for the command line arguments, which in Linux is 128 KiB.
In other words, a command such as rm *.bak
may fail, if the pattern
matches so many files that the combined length of their filenames
exceeds 128 KiB. That didn't used to be much of a problem, but as
disks have grown and people have more files, it happens more often
now.
You have to find ways around the limit. One common trick is to run multiple commands with more specific patterns to limit the command line arguments for each run. This can be quite a bit of tedious work, and can be quite error prone. If only there was a way to automate it.
Of course there is.
The xargs command does exactly that. Here's an example:
ls *.bak | xargs rm
xargs
reads its standard input to get a list of filenames, and
breaks down the list into chunks of 128 KiB or less, and runs the
command given to it for each chunk. Thus, you can remove all the files
more easily than by to find filename patterns manually.
Ah, but this example doesn't work, does it? It still runs ls *.bak
which runs into the problem of the command line length limit.
The find tool helps here. It finds files that match some criteria,
and writes the pathnames of each matching file to its stdout. If we
feed that list to xargs
, we get something better:
find -name '*.bak' | xargs rm
This will work better, but it's still got a gotcha. xargs
reads
filenames delimited by any whitespace, by default, including plain old
space characters. That means that it will get somewhat confused when
you have a filename such as 001 March of Cambreadth.flac
in your
music collection.
The solution here is to use a delimiter that can't ever be part of a
pathname, and the only such character (byte) is the NUL one. Handily,
find
and xargs
have options to deal with that:
find -name '*.bak' -print0 | xargs -0 rm
find
can run commands itself:
find -name '*.bak' -exec rm '{}' ';'
find
replaces {}
with the pathname of a file it's found. This way,
xargs
isn't needed at all. However, in the above example, find
will run rm
for each file it finds. If you replace the semicolon in
the example above with a plus sign (+
), find
will group files into
larger groups, just like xargs
. (Beware if you need portability:
this may be a feature available only in relatively recent versions of
the GNU implementation of find
.)
find
can delete files directly, as well, but that's a special case,
so we'll skip an example. Check the manual page.
Perhaps you need to do something more complicated than removing files,
for example compressing files. You may want to compress several files
at the sime time, to make better use of the multiple CPU cores you
have available. For this, you may probably want to use the parallel
tool. There's at least two implementations of this, one in
moreutils, and also a GNU parallel.
find -name '*.bak' -print0 | xargs -0 parallel gzip --
This example takes a bit of unravelling:
find
writes the pathnames of matching files, delmited by NUL bytes, to its stdoutxargs
reads files from its stdin, and assumes NUL delimiters- the command to run is
parallel gzip --
- the
--
tellsparallel
that it should rungzip
on any arguments following the--
, or in other words, the--
separates the command to be run from the filenames to give the command as arguments parallel
starts an instances of the command for each CPU core, and gives each instance the next filename argument; when an instance terminates, it starts a new instance with the next filename argument, until it's run the command for each argument
This should be much more efficient than running one gzip
at a time.
The example combines find
and xargs
rather than using find
-exec
, just for kicks. Simplification is left as an exercise to the
reader.
find
, exec
, and parallel
are a very powerful set of tools. If
you work on the Unix command line more than a little, it pays to read
their manual pages and becoming familiar with them, as they can save
you a ton of work, when used properly. (Let's not worry about the time
and effort spent on debugging complex invocations. We all write
perfect code the first time.)
They are also a good example of Unix tools that are designed to be combined in powerful ways to achieve things that might otherwise require writing a lot of custom code.