find, xargs, and parallel: fun toys

← Writing shell in any language | posts | Principle of least surprise →

Non-article content is copyrighted to the regular contributors. Articles are copyright to their authors. Comments on articles are copyright the comment's author.

By contributing a comment to the site you implicitly grant the same licence to your comment as is shared by the rest of the site.

Yakking Blog by Yakking Team is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

One of the elegant parts of Unix is that the shell expands wildcards, so that they work the same way for every command. As a result, each of the following will work on the same set of files:

ls *.bak
rm *.bak
mv *.bak /tmp/wastebasket

However, sometimes they don't work. And sometimes ls and rm work, but the mv doesn't. That happens when you one of the hidden internal limits of a Unix system: the maximum size of a command line when executing a new program.

In Unix, to execute a new program, you use one of the exec family of system calls. These replace the program running in the current process with a new one, and as part of that they pass in a set of command line arguments. The kernel has a fixed maximum size for the command line arguments, which in Linux is 128 KiB.

In other words, a command such as rm *.bak may fail, if the pattern matches so many files that the combined length of their filenames exceeds 128 KiB. That didn't used to be much of a problem, but as disks have grown and people have more files, it happens more often now.

You have to find ways around the limit. One common trick is to run multiple commands with more specific patterns to limit the command line arguments for each run. This can be quite a bit of tedious work, and can be quite error prone. If only there was a way to automate it.

Of course there is.

The xargs command does exactly that. Here's an example:

ls *.bak | xargs rm

xargs reads its standard input to get a list of filenames, and breaks down the list into chunks of 128 KiB or less, and runs the command given to it for each chunk. Thus, you can remove all the files more easily than by to find filename patterns manually.

Ah, but this example doesn't work, does it? It still runs ls *.bak which runs into the problem of the command line length limit.

The find tool helps here. It finds files that match some criteria, and writes the pathnames of each matching file to its stdout. If we feed that list to xargs, we get something better:

find -name '*.bak' | xargs rm

This will work better, but it's still got a gotcha. xargs reads filenames delimited by any whitespace, by default, including plain old space characters. That means that it will get somewhat confused when you have a filename such as 001 March of Cambreadth.flac in your music collection.

The solution here is to use a delimiter that can't ever be part of a pathname, and the only such character (byte) is the NUL one. Handily, find and xargs have options to deal with that:

find -name '*.bak' -print0 | xargs -0 rm

find can run commands itself:

find -name '*.bak' -exec rm '{}' ';'

find replaces {} with the pathname of a file it's found. This way, xargs isn't needed at all. However, in the above example, find will run rm for each file it finds. If you replace the semicolon in the example above with a plus sign (+), find will group files into larger groups, just like xargs. (Beware if you need portability: this may be a feature available only in relatively recent versions of the GNU implementation of find.)

find can delete files directly, as well, but that's a special case, so we'll skip an example. Check the manual page.

Perhaps you need to do something more complicated than removing files, for example compressing files. You may want to compress several files at the sime time, to make better use of the multiple CPU cores you have available. For this, you may probably want to use the parallel tool. There's at least two implementations of this, one in moreutils, and also a GNU parallel.

find -name '*.bak' -print0 | xargs -0 parallel gzip --

This example takes a bit of unravelling:

find writes the pathnames of matching files, delmited by NUL bytes, to its stdout
xargs reads files from its stdin, and assumes NUL delimiters
the command to run is parallel gzip --
the -- tells parallel that it should run gzip on any arguments following the --, or in other words, the -- separates the command to be run from the filenames to give the command as arguments
parallel starts an instances of the command for each CPU core, and gives each instance the next filename argument; when an instance terminates, it starts a new instance with the next filename argument, until it's run the command for each argument

This should be much more efficient than running one gzip at a time. The example combines find and xargs rather than using find -exec, just for kicks. Simplification is left as an exercise to the reader.

find, exec, and parallel are a very powerful set of tools. If you work on the Unix command line more than a little, it pays to read their manual pages and becoming familiar with them, as they can save you a ton of work, when used properly. (Let's not worry about the time and effort spent on debugging complex invocations. We all write perfect code the first time.)

They are also a good example of Unix tools that are designed to be combined in powerful ways to achieve things that might otherwise require writing a lot of custom code.

←	Oct 2025					→
S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31