cut
cut(1) is used to extract data from delimiter separated fields. This can be useful for processing tabulated data.
$ cat >testdata <<EOF
1,2,3
4,5,6
7,8,9
EOF
$ cut -d, -f3 testdata
3
6
9
The -d
option specifies which delimiter to use, defaulting to tab. The
-f
option specifies which fields to include. This can be a list of
fields or ranges.
$ cat >testdata <<EOF
1:2:3:4
5:6:7:8
9:0:a:b
c:d:e:f
EOF
$ cut -d: -f1,3-4 testdata
1:3:4
5:7:8
9:a:b
c:e:f
When combined with head
and tail
, it is possible to extract data by
row too.
$ head -n 3 testdata | tail -n 2 | cut -d: -f2-3
6:7
0:a
Spaces can be used for delimiters too, but cut(1) only supports formats that have one delimiter per field.
$ cat >testdata <<EOF
a b c
EOF
$ cut -d ' ' -f1-2 <testdata #expects a b
a
If you want different behaviour, a different tool is required, such as
shell's read
built-in, or awk(1).
$ read a b c <testdata
$ echo "$a $b"
a b
$ awk <testdata '{ print $1 $2 }'
ab
It is not possible to define a character sequence as a delimiter.
$ cat >testdata <<EOF
a->b->c
EOF
$ cut -d -> -f 2
cut: the delimiter must be a single character
Try `cut --help' for more information.
For this more complicated tools need to be used.
$ awk <testdata '{ split($0, A, /->/); print A[2] }'
b
paste
paste(1) joins two delimiter separated files together. This can be used to move columns of a file around.
$ cat >testdata1 <<EOF
1,2
3,4
EOF
$ cat >testdata2 <<EOF
a,b
c,d
EOF
$ paste -d, testdata1 testdata2
1,2,a,b
3,4,c,d
When combined with cut(1), fields can be rearranged.
$ cat >testdata <<EOF
1:a:e:5
2:b:f:6
3:c:g:7
4:d:h:8
EOF
$ paste -d: <(cut -d: -f1,4 testdata) <(cut -d: -f2,3 testdata)
1:5:a:e
2:6:b:f
3:7:c:g
4:8:d:h
join
join(1) merges two files that share a field. Lines with the same value for that field are combined like paste(1).
$ cat >names <<EOF
1:Richard
2:Jonathan
3:Zwingbor the terrible of planet Flarg
EOF
$ cat >colours <<EOF
1:Red
2:Blue
3:Putrescent Green
EOF
$ join -t: -j1 names colours
1:Richard:Red
2:Jonathan:Blue
3:Zwingbor the terrible of planet Flarg:Putrescent Green
split
split(1) splits a file into multiple smaller files. This could be useful for splitting up text based data when files become too long.
$ seq 1000 >lines
split -e -n2 lines split-lines-
$ wc -l split-lines-aa
513 split-lines-aa
$ wc -l split-lines-ab
487 split-lines-ab
fold and fmt
fold(1) is a primitive tool to wrap lines. The width to wrap at can
be specified with -w
.
$ cat >text <<EOF
hello world, I am a long line, that needs to be shortened
EOF
$ fold -w 40 text
hello world, I am a long line, that need
s to be shortened
When to break a line can be tweaked, -s
will break lines at spaces,
rather than in the middle of words.
$ fold -s -w 40 text
hello world, I am a long line, that
needs to be shortened
fmt(1) is a more advanced tool for wrapping lines. As well as splitting lines at whitespace when they become too long, it will re-flow paragraphs when the lines are too short.
$ cat >text <<EOF
Hello world.
I am text.
I need to be a paragraph.
EOF
$ fmt text
Hello world. I am text. I need to be a paragraph.
nl
nl(1) puts line numbers before each line in a file.
$ cat >text <<EOF
Hello
World
EOF
$ nl text
1 Hello
2 World
sort and uniq
sort(1) can be used to re-order the lines of a file based on various criteria, defaulting to ASCIIbetical.
$ cat >data <<EOF
2:Jonathan:Blue
1:Richard:Red
3:Zwingbor the terrible of planet Flarg:Putrescent Green
EOF
$ sort data
1:Richard:Red
2:Jonathan:Blue
3:Zwingbor the terrible of planet Flarg:Putrescent Green
sort(1) can sort by field.
$ sort -t: -k3 data
2:Jonathan:Blue
3:Zwingbor the terrible of planet Flarg:Putrescent Green
1:Richard:Red
The sort order can be reversed.
$ sort -r data
3:Zwingbor the terrible of planet Flarg:Putrescent Green
2:Jonathan:Blue
1:Richard:Red
uniq(1) removes duplicate lines. Its algorithm expects the data to be sorted, since it removes consequtive, idential lines.
$ cat >data <<EOF
1
1
1
2
3
EOF
$ uniq data
1
2
3
Since data is rarely sorted, this usually means that the command you
need to run is sort data | uniq
.
cat >data <<EOF
1
2
1
5
6
1
2
5
EOF
$ sort data | uniq
1
2
5
6
However, since this is such a common operation, and executing a separate
subprocess would be wasteful, GNU's sort(1) accepts a -u
paramter
which does this.
$ sort -u data
1
2
5
6
comm
comm(1) will tell you which lines are _comm_on between two sorted files.
$ cat >file1 <<EOF
1
2
3
EOF
$ cat >file2 <<EOF
2
3
4
EOF
$ comm file1 file2
1
2
3
4
The first column is lines unique to the first file, the second column is lines unique to the second, and the third is lines common to both.
The options of comm(1) are a little odd. You can pass -1
, -2
,
or -3
to remove that field. This is a bit of an odd decision, given
the common operation is to use the flags to get only one column you
were interested in. So you would pass -12
to only get lines that were
common to both files.
$ comm file1 file2
2
3
Hi, thanks for this great blog!
Regarding the cut example: the last line says "For this more complicated tools need to be used. $ awk /); print A[2] }' b", I think something is missing. Regarding the sort/uniq paragraph: the option I find more useful of uniq is "-c" for counting occourrences.
Thank you again, Riccardo
Thanks for your comment.
Yes it's missing content because of a formatting error in the source (there should be an empty line before the code block). The output that it was supposed to have is:
I'll get it fixed.