cut

cut(1) is used to extract data from delimiter separated fields. This can be useful for processing tabulated data.

$ cat >testdata <<EOF
1,2,3
4,5,6
7,8,9
EOF
$ cut -d, -f3 testdata
3
6
9

The -d option specifies which delimiter to use, defaulting to tab. The -f option specifies which fields to include. This can be a list of fields or ranges.

$ cat >testdata <<EOF
1:2:3:4
5:6:7:8
9:0:a:b
c:d:e:f
EOF
$ cut -d: -f1,3-4 testdata
1:3:4
5:7:8
9:a:b
c:e:f

When combined with head and tail, it is possible to extract data by row too.

$ head -n 3 testdata | tail -n 2 | cut -d: -f2-3
6:7
0:a

Spaces can be used for delimiters too, but cut(1) only supports formats that have one delimiter per field.

$ cat >testdata <<EOF
a  b c
EOF
$ cut -d ' ' -f1-2 <testdata #expects a b
a 

If you want different behaviour, a different tool is required, such as shell's read built-in, or awk(1).

$ read a b c <testdata
$ echo "$a $b"
a b

$ awk <testdata '{ print $1 $2 }'
ab

It is not possible to define a character sequence as a delimiter.

$ cat >testdata <<EOF
a->b->c
EOF
$ cut -d -> -f 2
cut: the delimiter must be a single character
Try `cut --help' for more information.

For this more complicated tools need to be used.

$ awk <testdata '{ split($0, A, /->/); print A[2] }'
b

paste

paste(1) joins two delimiter separated files together. This can be used to move columns of a file around.

$ cat >testdata1 <<EOF
1,2
3,4
EOF
$ cat >testdata2 <<EOF
a,b
c,d
EOF
$ paste -d, testdata1 testdata2
1,2,a,b
3,4,c,d

When combined with cut(1), fields can be rearranged.

$ cat >testdata <<EOF
1:a:e:5
2:b:f:6
3:c:g:7
4:d:h:8
EOF
$ paste -d: <(cut -d: -f1,4 testdata) <(cut -d: -f2,3 testdata)
1:5:a:e
2:6:b:f
3:7:c:g
4:8:d:h

join

join(1) merges two files that share a field. Lines with the same value for that field are combined like paste(1).

$ cat >names <<EOF
1:Richard
2:Jonathan
3:Zwingbor the terrible of planet Flarg
EOF
$ cat >colours <<EOF
1:Red
2:Blue
3:Putrescent Green
EOF
$ join -t: -j1 names colours
1:Richard:Red
2:Jonathan:Blue
3:Zwingbor the terrible of planet Flarg:Putrescent Green

split

split(1) splits a file into multiple smaller files. This could be useful for splitting up text based data when files become too long.

$ seq 1000 >lines
split -e -n2 lines split-lines-
$ wc -l split-lines-aa
513 split-lines-aa
$ wc -l split-lines-ab
487 split-lines-ab

fold and fmt

fold(1) is a primitive tool to wrap lines. The width to wrap at can be specified with -w.

$ cat >text <<EOF
hello world, I am a long line, that needs to be shortened
EOF
$ fold -w 40 text
hello world, I am a long line, that need
s to be shortened

When to break a line can be tweaked, -s will break lines at spaces, rather than in the middle of words.

$ fold -s -w 40 text
hello world, I am a long line, that 
needs to be shortened

fmt(1) is a more advanced tool for wrapping lines. As well as splitting lines at whitespace when they become too long, it will re-flow paragraphs when the lines are too short.

$ cat >text <<EOF
Hello world.
I am text.
I need to be a paragraph.
EOF
$ fmt text
Hello world.  I am text.  I need to be a paragraph.

nl

nl(1) puts line numbers before each line in a file.

$ cat >text <<EOF
Hello
World
EOF
$ nl text
     1  Hello
     2  World

sort and uniq

sort(1) can be used to re-order the lines of a file based on various criteria, defaulting to ASCIIbetical.

$ cat >data <<EOF
2:Jonathan:Blue
1:Richard:Red
3:Zwingbor the terrible of planet Flarg:Putrescent Green
EOF
$ sort data
1:Richard:Red
2:Jonathan:Blue
3:Zwingbor the terrible of planet Flarg:Putrescent Green

sort(1) can sort by field.

$ sort -t: -k3 data
2:Jonathan:Blue
3:Zwingbor the terrible of planet Flarg:Putrescent Green
1:Richard:Red

The sort order can be reversed.

$ sort -r data
3:Zwingbor the terrible of planet Flarg:Putrescent Green
2:Jonathan:Blue
1:Richard:Red

uniq(1) removes duplicate lines. Its algorithm expects the data to be sorted, since it removes consequtive, idential lines.

$ cat >data <<EOF
1
1
1
2
3
EOF
$ uniq data
1
2
3

Since data is rarely sorted, this usually means that the command you need to run is sort data | uniq.

cat >data <<EOF
1
2
1
5
6
1
2
5
EOF
$ sort data | uniq
1
2
5
6

However, since this is such a common operation, and executing a separate subprocess would be wasteful, GNU's sort(1) accepts a -u paramter which does this.

$ sort -u data
1
2
5
6

comm

comm(1) will tell you which lines are _comm_on between two sorted files.

$ cat >file1 <<EOF
1
2
3
EOF
$ cat >file2 <<EOF
2
3
4
EOF
$ comm file1 file2
1
        2
        3
    4

The first column is lines unique to the first file, the second column is lines unique to the second, and the third is lines common to both.

The options of comm(1) are a little odd. You can pass -1, -2, or -3 to remove that field. This is a bit of an odd decision, given the common operation is to use the flags to get only one column you were interested in. So you would pass -12 to only get lines that were common to both files.

$ comm file1 file2
2
3

Hi, thanks for this great blog!

Regarding the cut example: the last line says "For this more complicated tools need to be used. $ awk /); print A[2] }' b", I think something is missing. Regarding the sort/uniq paragraph: the option I find more useful of uniq is "-c" for counting occourrences.

Thank you again, Riccardo

Comment by riccio Sat Nov 18 10:58:40 2017

Thanks for your comment.

Yes it's missing content because of a formatting error in the source (there should be an empty line before the code block). The output that it was supposed to have is:

$ awk <testdata '{ split($0, A, /->/); print A[2] }'
b

I'll get it fixed.

Comment by Richard Maw Mon Nov 20 13:33:51 2017