hn-classics/_stories/2008/13249834.md

642 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
created_at: '2016-12-24T14:17:27.000Z'
title: Set Operations in the Unix Shell (2008)
url: http://www.catonmat.net/blog/set-operations-in-unix-shell/
author: polygot
points: 71
story_text:
comment_text:
num_comments: 10
story_id:
story_title:
story_url:
parent_id:
created_at_i: 1482589047
_tags:
- story
- author_polygot
- story_13249834
objectID: '13249834'
year: 2008
---
![set
operations](http://www.catonmat.net/blog/wp-content/uploads/2008/10/set-union-intersect-complement.jpg)
A while ago I wrote about how I solved the [Google Treasure Hunt Puzzle
Nr. 4](http://www.catonmat.net/blog/solving-google-treasure-hunt-prime-number-problem-four/)
about prime numbers. I took an unusual approach and solved this problem
entirely from the Unix shell. The solution involved finding the
intersection between a bunch of files containing numbers. This lead me
to an idea to write a post about how to do various set operations from
the shell by using common utilities such as sort, uniq, diff, grep,
head, tail, comm, and others.
I'll cover the following set operations in this article:
- **Set Membership**. Test if an element belongs to a set.
- **Set Equality**. Test if two sets contain the same elements.
- **Set Cardinality**. Return the number of elements in the set.
- **Subset Test**. Test if a given set is a subset of another set.
- **Set Union**. Find union of two sets.
- **Set Intersection**. Find intersection of two sets.
- **Set Complement**. Given two sets **A** and **B**, find all
elements in **A** that are not in **B**.
- **Set Symmetric Difference**. Find symmetric difference of two sets.
- **Power Set**. Generate all subsets of a set.
- **Set Cartesian Product**. Find **A** x **B**.
- **Disjoint Set Test**. Test if two sets are disjoint.
- **Empty Set Test**. Test if a given set is empty.
- **Minimum**. Find the smallest element of a set.
- **Maximum**. Find the largest element of a set.
**Update**: I wrote [another
post](http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/)
about these operations and created a cheat sheet.
Download cheat sheet: [set operations in unix shell
(.txt)](/download/setops.txt "Download \"set operations in unix shell (.txt)\"")
To illustrate these operations, I created a few random sets to work
with. Each set is represented as a file with one element per line. The
elements are positive numbers.
First I created two sets **A** and **B** with 5 elements each so that I
could easily check that the operations really work.
Sets **A** and **B** are hand crafted. It's easy to see that only
elements 1, 2 and 3 are in common:
$ cat A $ cat B
3 11
5 1
1 12
2 3
4 2
I also created a set **Asub** which is a subset of set **A** and
**Anotsub** which is not a subset of A (to test the Subset Test
operation):
$ cat Asub $ cat Anotsub
3 6
2 7
5 8
Next I created two equal sets **Aequal** and **Bequal** again with 5
elements each:
$ cat Aequal $ cat Bequal
103 100
102 101
101 102
104 103
100 104
Then I created two huge sets **Abig** and **Bbig** with 100,000 elements
(some of them are repeated, but that's ok).
The easiest way to generate sets Abig and Bbig is to take natural
numbers from /dev/urandom. There are two shell commands that can easily
do that. The first is "**od**" and the second is "**hexdump**".
Here is how to create two files with 100,000 natural numbers with both
commands.
With hexdump:
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig
The "-e" switch specifies a hand-crafted output format. It says take 1
element of size 4 bytes and output it as an unsigned integer. The "-n"
switch specifies how many bytes to read, in this case 400000 (400000
bytes / 4 bytes per element = 100000 elements).
With od:
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig
The "-An" switch specifies that no line address is necessary. The "-w4"
switch specifies number of bytes to output per line. The "-tu4" says to
output unsigned 4-byte numbers and "-N400000" limits the output to
400000 bytes (400000/4 = 100000 elements). The output from od has to be
filtered through sed to drop the leading whitespace characters.
Okay, now let's look at various set operations.
## Set Membership
The set membership operation tests if an element belongs to a set. We
write **a****A**, if element **a** belongs to set **A**, and we write
**a****A**, if it does not.
The easiest way to test if an element is in a set is to use "**grep**"
command. Grep searches the file for lines matching a pattern:
$ grep -xc 'element' set
The "-c" flag outputs number of elements in the set. If it is not a
multi-set, the number of elements should be 0 or 1. The "-x" option
specifies to match the whole line only (no partial matches).
Here is an example of this operation run on set A:
$ grep -xc '4' A
1
$ grep -xc '999' A
0
That's correct. Set A contains element 4 but does not contain element
999.
If the membership operation has to be used from a shell script, the
return code from grep can be used instead. Unix commands succeed if the
return code is 0, and fail otherwise:
$ grep -xq 'element' set
# returns 0 if element ∈ set
# returns 1 if element ∉ set
The "-q" flag makes sure that grep does not output the element if it is
in the set.
## Set Equality
The set equality operation tests if two sets are the same, i.e., contain
the same elements. We write **A** = **B** if sets **A** and **B** are
equal and **A****B** if they are not.
The easiest way to test if two sets are equal is to use "**diff**"
command. Diff command compares two files for differences. It will find
that the order of lines differ, so the files have to be sorted first. If
they are multi-sets, the output of sort has to be run through "uniq"
command to eliminate duplicate elements:
$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
# returns 0 if set1 = set2
# returns 1 if set1 ≠ set2
The "-q" flag quiets the output of diff command.
Let's test this operation on sets A, B, Aequal and Bequal:
$ diff -q <(sort A | uniq) <(sort B | uniq)
# return code 1 -- sets A and B are not equal
$ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq)
# return code 0 -- sets A and B are equal
If you have already sorted sets, then just run:
$ diff -q set1 set2
## Set Cardinality
The set cardinality operations returns the number of elements in the
set. We write |**A**| to denote the cardinality of the set **A**.
The simplest way to count the number of elements in a set is to use
"**wc**" command. Wc command counts the number of characters, words or
lines in a file. Since each element in the set appears on a new line,
counting the number of lines in the file will return the cardinality of
the set:
$ wc -l set | cut -d' ' -f1
Cut command is necessary because "wc -l" also outputs the name of the
file it was ran on. The cut command outputs the first field which is
number of lines in the file.
We can actually get rid of cut:
$ wc -l < set
Let's test if on sets A and Abig:
$ wc -l A | cut -d' ' -f1
5
$ wc -l Abig | cut -d' ' -f1
100000
$ wc -l < A
5
$ wc -l < Abig
100000
## Subset Test
The subset test tests if the given set is a subset of another set. We
write **S** **A** if **S** is a subset of **A**, and **S** **A**, if
it's not.
I found a very easy way to do it using the "**comm**" utility. Comm
compares two sorted files line by line. It may be run in such a way that
it outputs lines that appear only in the first specified file. If the
first file is subset of the second, then all the lines in the 1st file
also appear in the 2nd, so no output is produced:
$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# comm returns no output if subset set
# comm outputs something if subset set
Please remember that if you have a numeric set, then sort must take "-n"
option.
Let's test if Asub is a subset of A:
$ comm -23 <(sort -n Asub|uniq) <(sort -n A|uniq) | head -1
# no output - yes, Asub A
Now let's test if Anotsub is a subset of A:
$ comm -23 <(sort -n Anotsub|uniq) <(sort -n A|uniq) | head -1
6 # has output - no, Anotsub A
If you want to use it from a shell script, you'd have to test if the
output from this command was empty or not.
## Set Union
The set union operation unions two sets, i.e., join them into one set.
We write **C** = **A** **B** to denote union of sets **A** and **B**
which produces set **C**.
Set union is extremely easy to create. Just use the "**cat**" utility to
concatenate two files:
$ cat set1 set2
If the duplicates (elements which are both in set1 and set2) are not
welcome, then the output of cat can be filtered via **awk**:
```
$ cat set1 set2 | awk '!found[$1]++'
# we can also get rid of cat by just using awk:
$ awk '!found[$1]++' set1 set2
```
If we don't want to use awk, which is a whole-blown programming
language, then we can **sort** the output of cat and filter it via
**uniq**:
$ cat set1 set2 | sort | uniq
# we can get rid of cat by specifying arguments to sort:
$ sort set1 set2 | uniq
# finally we can get rid of uniq by specifying -u flag to sort
$ sort -u set1 set2
If the sets set1 and set2 are already sorted, then the union operation
can be made much faster by specifying the "-m" command line option,
which merges the files (like the final step of merge-sort algorithm):
$ sort -m set1 set2 | uniq
# or
$ set -um set1 set2
Let's test this operation on sets A and B:
$ cat A B # with duplicates
3
5
1
2
4
11
1
12
3
2
$ awk '!found[$1]++' # without dupes
3
5
1
2
4
11
12
$ sort -n A B | uniq # with sort && uniq
1
2
3
4
5
11
12
## Set Intersection
The set intersection operation finds elements that are in both sets at
the same time. We write **C** = **A** **B** to denote the intersection
of sets **A** and **B**, which produces the set **C**.
There are many ways to do set intersection. The first way that I am
going to show you uses "comm":
$ comm -12 <(sort set1) <(sort set2)
The "-12" option to comm directs it to suppress output of lines
appearing just in the 1st and the 2nd file and makes it output lines
appearing in both 1st and 2nd, which is the intersection of two sets.
Please remember that if you have a numeric set, then sort must take "-n"
option.
Another way to do it is to use "grep" utility. I actually found about
this method as I was writing this article:
$ grep -xF -f set1 set2
The "-x" option forces grep to match the whole lines (no partial
matches). The "-f set1" specifies the patterns to use for searching. The
"-F" option makes grep interpret the given patterns literally (no
regexes). It works by matching all lines of set1 in set2. The lines that
appear just in set1 or just in set2 are never output.
The next way to find intersection is by using "sort" and "uniq":
$ sort set1 set2 | uniq -d
The "-d" option to uniq forces it to print only the duplicate lines.
Obviously, if a line appears in set1 and set2, after sorting there will
be two consecutive equal lines in the output. The "uniq -d" command
prints such repeated lines (but only 1 copy of it), thus it's the
intersection operation.
Just a few minutes before publishing this article I found another way to
do intersection with "join" command. Join command joins files on a
common field:
$ join <(sort -n A) <(sort -n B)
Here is a test run:
$ sort -n A B | uniq -d
1
2
3
$ grep -xF -f A B
1
3
2
$ comm -12 <(sort -n A) <(sort -n B)
1
2
3
## Set Complement
The set complement operation finds elements that are in one set but not
the other. We write **A** - **B** or **A** \\ **B** to denote set's
**B** complement in set **A**.
**Comm** has become a pretty useful command for operating on sets. It
can be applied to implement set complement operation as well:
$ comm -23 <(sort set1) <(sort set2)
The option "-23" specifies that comm should not print elements that
appear just in set2 and that are common to both. It leaves comm to print
elements which are just in set1 (and not in set2).
The "**grep**" command can also be used to implement this operation:
$ grep -vxF -f set2 set1
Notice that the order of sets has been reversed from that of comm.
That's because we are searching those elements in set1, which are not in
set2.
Another way to do it is, of course, with "**sort**" and "**uniq**":
$ sort set2 set2 set1 | uniq -u
This is a pretty tricky command. Suppose that a line appears in set1 but
does not appear in set2. Then it will be output just once and will not
get removed by uniq. All other lines get removed.
Let's put these commands to test:
$ comm -23 <(sort -n A) <(sort -n B)
4
5
$ grep -vxF -f B A
5
4
$ sort -n B B A | uniq -u
4
5
## Set Symmetric Difference
The set symmetric difference operation finds elements that are in one
set, or in the other but not both. We write **A** Δ **B** to denote
symmetric difference of sets **A** and **B**.
The operation can be implemented very easily with "**comm**" utility:
$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'
# sed can be replaced with tr
$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'
Here comm is instructed via "-3" not to output fields that are common to
both files, but to output fields that are just in set1 and just in set2.
Sed is necessary because comm outputs two columns of data and some of it
is right padded with a \\t tab character.
It can also be done with "**sort**" and "**uniq**":
$ sort set1 set2 | uniq -u
We can use mathematics and derive a few formulas involving previously
used operations for symmetric difference: **A** Δ **B** = (**A** -
**B**) (**B** - **A**). Now we can use grep:
$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
# does (B - A) (A - B)
# this can be simplified
$ grep -vxF -f set1 set2; grep -vxF -f set2 set1
Let's test it:
$ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'
11
12
4
5
$ sort -n A B | uniq -u
4
5
11
12
$ cat <(grep -vxF -f B A) <(grep -vxF -f A B)
5
4
11
12
## Power Set
The power set operation generates a power-set of a set. What's a power
set? It's a set that contains all subsets of the set. We write P(**A**)
or 2**A** to denote all subsets of A. For a set with n elements, the
power set contains 2n elements.
For example, the power-set of the set { a, b, c } contains 23 = 8
elements. The power-set is { {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c},
{a, b, c} }.
It's not easy to do that with simple Unix tools. I could not think of
anything better than a silly Perl solution:
$ perl -le '
sub powset {
return [[]] unless @_;
my $head = shift;
my $list = &powset;
[@$list, map { [$head, @$_] } @$list]
}
chomp(my @e = <>);
for $p (@{powset(@e)}) {
print @$p;
}' set
Can you think of a way to do it with Unix tools?
## Set Cartesian Product
The set Cartesian product operation produces produces a new set that
contains all possible pairs of elements from one set and the other. The
notation for Cartesian product of sets **A** and **B** is **A** x **B**.
For example, if set **A** = { a, b, c } and set **B** = { 1, 2 } then
the Cartesian product **A** x **B** = { (a, 1), (a, 2), (b, 1), (b, 2),
(c, 1), (c, 2) }.
I can't think of a great solution. I have a very silly solution in
bash:
$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2
Can you think of other solutions?
## Disjoint Set Test
The disjoint set test operation finds if two sets are disjoint, i.e.,
they do not contain common elements.
Two sets are disjoint if their intersection is the empty set. Any of the
set intersection commands (mentioned earlier) can be applied on the sets
and the output can be tested for emptiness. If it is empty, then the
sets are disjoint, if it is not, then the sets are not disjoint.
Another way to test if two sets are disjoint is to use awk:
$ awk '{ if (++seen[$0]==2) exit 1 }' set1 set2
# returns 0 if sets are disjoint
# returns 1 if sets are not disjoint
It works by counting seen elements in set1 and then set2. If any of the
elements appear both in set1 and set2, seen count for that element would
be 2 and awk would quit with exit code 1.
## Empty Set Test
The empty set test tests if the set is empty, i.e., contains no
elements. The empty set is usually written as Ø.
It's very easy to test if the set is empty. The cardinality of an empty
set is 0:
$ wc -l set | cut -d' ' -f1
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty
Getting rid of cut:
$ wc -l < set
# outputs 0 if the set is empty
# outputs > 0 if the set is not empty
## Minimum
The minimum operation returns the smallest number in the set. We write
min(**A**) to denote the minimum operation on the set **A**.
The minimum element of a set can be found by first sorting it in
ascending order and then taking the first element. The first element can
be taken with "head" Unix command which outputs the first part of the
file:
$ head -1 <(sort set)
The "-1" option specifies to output the first line only.
If the set is already sorted, then it's even simpler:
$ head -1 set
Remember to use "sort -n" command if the set contains numeric data.
Example of running minimum operation on sets A and Abig:
$ head -1 <(sort -n A)
1
$ head -1 <(sort -n Abig)
2798
## Maximum
The maximum operation returns the biggest number in the set. We write
max(**A**) to denote the maximum operation on the set **A**.
The maximum element of a set can be found by first sorting it in
ascending order and then taking the last element. The last element can
be taken with "tail" Unix command which outputs the last part of the
file:
$ tail -1 <(sort set)
The "-1" option specifies to output the last line only.
If the set is already sorted, then it's even simpler:
$ tail -1 set
Remember to use "sort -n" command if the set contains numeric data.
Example of running maximum operation on sets A and Abig:
$ tail -1 <(sort -n A)
5
$ head -1 <(sort -n Abig)
4294906714
## Have Fun\!
Have fun working with these set operations\! Thanks to lhunath and
waldner from \#bash for helping. :)