2018-03-03 09:35:28 +00:00
|
|
|
|
---
|
|
|
|
|
created_at: '2016-12-24T14:17:27.000Z'
|
|
|
|
|
title: Set Operations in the Unix Shell (2008)
|
|
|
|
|
url: http://www.catonmat.net/blog/set-operations-in-unix-shell/
|
|
|
|
|
author: polygot
|
|
|
|
|
points: 71
|
|
|
|
|
story_text:
|
|
|
|
|
comment_text:
|
|
|
|
|
num_comments: 10
|
|
|
|
|
story_id:
|
|
|
|
|
story_title:
|
|
|
|
|
story_url:
|
|
|
|
|
parent_id:
|
|
|
|
|
created_at_i: 1482589047
|
|
|
|
|
_tags:
|
|
|
|
|
- story
|
|
|
|
|
- author_polygot
|
|
|
|
|
- story_13249834
|
|
|
|
|
objectID: '13249834'
|
2018-06-08 12:05:27 +00:00
|
|
|
|
year: 2008
|
2018-03-03 09:35:28 +00:00
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
![set
|
|
|
|
|
operations](http://www.catonmat.net/blog/wp-content/uploads/2008/10/set-union-intersect-complement.jpg)
|
|
|
|
|
|
|
|
|
|
A while ago I wrote about how I solved the [Google Treasure Hunt Puzzle
|
|
|
|
|
Nr. 4](http://www.catonmat.net/blog/solving-google-treasure-hunt-prime-number-problem-four/)
|
|
|
|
|
about prime numbers. I took an unusual approach and solved this problem
|
|
|
|
|
entirely from the Unix shell. The solution involved finding the
|
|
|
|
|
intersection between a bunch of files containing numbers. This lead me
|
|
|
|
|
to an idea to write a post about how to do various set operations from
|
|
|
|
|
the shell by using common utilities such as sort, uniq, diff, grep,
|
|
|
|
|
head, tail, comm, and others.
|
|
|
|
|
|
|
|
|
|
I'll cover the following set operations in this article:
|
|
|
|
|
|
|
|
|
|
- **Set Membership**. Test if an element belongs to a set.
|
|
|
|
|
- **Set Equality**. Test if two sets contain the same elements.
|
|
|
|
|
- **Set Cardinality**. Return the number of elements in the set.
|
|
|
|
|
- **Subset Test**. Test if a given set is a subset of another set.
|
|
|
|
|
- **Set Union**. Find union of two sets.
|
|
|
|
|
- **Set Intersection**. Find intersection of two sets.
|
|
|
|
|
- **Set Complement**. Given two sets **A** and **B**, find all
|
|
|
|
|
elements in **A** that are not in **B**.
|
|
|
|
|
- **Set Symmetric Difference**. Find symmetric difference of two sets.
|
|
|
|
|
- **Power Set**. Generate all subsets of a set.
|
|
|
|
|
- **Set Cartesian Product**. Find **A** x **B**.
|
|
|
|
|
- **Disjoint Set Test**. Test if two sets are disjoint.
|
|
|
|
|
- **Empty Set Test**. Test if a given set is empty.
|
|
|
|
|
- **Minimum**. Find the smallest element of a set.
|
|
|
|
|
- **Maximum**. Find the largest element of a set.
|
|
|
|
|
|
|
|
|
|
**Update**: I wrote [another
|
|
|
|
|
post](http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/)
|
|
|
|
|
about these operations and created a cheat sheet.
|
|
|
|
|
Download cheat sheet: [set operations in unix shell
|
|
|
|
|
(.txt)](/download/setops.txt "Download \"set operations in unix shell (.txt)\"")
|
|
|
|
|
|
|
|
|
|
To illustrate these operations, I created a few random sets to work
|
|
|
|
|
with. Each set is represented as a file with one element per line. The
|
|
|
|
|
elements are positive numbers.
|
|
|
|
|
|
|
|
|
|
First I created two sets **A** and **B** with 5 elements each so that I
|
|
|
|
|
could easily check that the operations really work.
|
|
|
|
|
|
|
|
|
|
Sets **A** and **B** are hand crafted. It's easy to see that only
|
|
|
|
|
elements 1, 2 and 3 are in common:
|
|
|
|
|
|
|
|
|
|
$ cat A $ cat B
|
|
|
|
|
3 11
|
|
|
|
|
5 1
|
|
|
|
|
1 12
|
|
|
|
|
2 3
|
|
|
|
|
4 2
|
|
|
|
|
|
|
|
|
|
I also created a set **Asub** which is a subset of set **A** and
|
|
|
|
|
**Anotsub** which is not a subset of A (to test the Subset Test
|
|
|
|
|
operation):
|
|
|
|
|
|
|
|
|
|
$ cat Asub $ cat Anotsub
|
|
|
|
|
3 6
|
|
|
|
|
2 7
|
|
|
|
|
5 8
|
|
|
|
|
|
|
|
|
|
Next I created two equal sets **Aequal** and **Bequal** again with 5
|
|
|
|
|
elements each:
|
|
|
|
|
|
|
|
|
|
$ cat Aequal $ cat Bequal
|
|
|
|
|
103 100
|
|
|
|
|
102 101
|
|
|
|
|
101 102
|
|
|
|
|
104 103
|
|
|
|
|
100 104
|
|
|
|
|
|
|
|
|
|
Then I created two huge sets **Abig** and **Bbig** with 100,000 elements
|
|
|
|
|
(some of them are repeated, but that's ok).
|
|
|
|
|
|
|
|
|
|
The easiest way to generate sets Abig and Bbig is to take natural
|
|
|
|
|
numbers from /dev/urandom. There are two shell commands that can easily
|
|
|
|
|
do that. The first is "**od**" and the second is "**hexdump**".
|
|
|
|
|
|
|
|
|
|
Here is how to create two files with 100,000 natural numbers with both
|
|
|
|
|
commands.
|
|
|
|
|
|
|
|
|
|
With hexdump:
|
|
|
|
|
|
|
|
|
|
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig
|
|
|
|
|
$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig
|
|
|
|
|
|
|
|
|
|
The "-e" switch specifies a hand-crafted output format. It says take 1
|
|
|
|
|
element of size 4 bytes and output it as an unsigned integer. The "-n"
|
|
|
|
|
switch specifies how many bytes to read, in this case 400000 (400000
|
|
|
|
|
bytes / 4 bytes per element = 100000 elements).
|
|
|
|
|
|
|
|
|
|
With od:
|
|
|
|
|
|
|
|
|
|
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig
|
|
|
|
|
$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig
|
|
|
|
|
|
|
|
|
|
The "-An" switch specifies that no line address is necessary. The "-w4"
|
|
|
|
|
switch specifies number of bytes to output per line. The "-tu4" says to
|
|
|
|
|
output unsigned 4-byte numbers and "-N400000" limits the output to
|
|
|
|
|
400000 bytes (400000/4 = 100000 elements). The output from od has to be
|
|
|
|
|
filtered through sed to drop the leading whitespace characters.
|
|
|
|
|
|
|
|
|
|
Okay, now let's look at various set operations.
|
|
|
|
|
|
|
|
|
|
## Set Membership
|
|
|
|
|
|
|
|
|
|
The set membership operation tests if an element belongs to a set. We
|
|
|
|
|
write **a** ∈ **A**, if element **a** belongs to set **A**, and we write
|
|
|
|
|
**a** ∉ **A**, if it does not.
|
|
|
|
|
|
|
|
|
|
The easiest way to test if an element is in a set is to use "**grep**"
|
|
|
|
|
command. Grep searches the file for lines matching a pattern:
|
|
|
|
|
|
|
|
|
|
$ grep -xc 'element' set
|
|
|
|
|
|
|
|
|
|
The "-c" flag outputs number of elements in the set. If it is not a
|
|
|
|
|
multi-set, the number of elements should be 0 or 1. The "-x" option
|
|
|
|
|
specifies to match the whole line only (no partial matches).
|
|
|
|
|
|
|
|
|
|
Here is an example of this operation run on set A:
|
|
|
|
|
|
|
|
|
|
$ grep -xc '4' A
|
|
|
|
|
1
|
|
|
|
|
$ grep -xc '999' A
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
|
That's correct. Set A contains element 4 but does not contain element
|
|
|
|
|
999.
|
|
|
|
|
|
|
|
|
|
If the membership operation has to be used from a shell script, the
|
|
|
|
|
return code from grep can be used instead. Unix commands succeed if the
|
|
|
|
|
return code is 0, and fail otherwise:
|
|
|
|
|
|
|
|
|
|
$ grep -xq 'element' set
|
|
|
|
|
# returns 0 if element ∈ set
|
|
|
|
|
# returns 1 if element ∉ set
|
|
|
|
|
|
|
|
|
|
The "-q" flag makes sure that grep does not output the element if it is
|
|
|
|
|
in the set.
|
|
|
|
|
|
|
|
|
|
## Set Equality
|
|
|
|
|
|
|
|
|
|
The set equality operation tests if two sets are the same, i.e., contain
|
|
|
|
|
the same elements. We write **A** = **B** if sets **A** and **B** are
|
|
|
|
|
equal and **A** ≠ **B** if they are not.
|
|
|
|
|
|
|
|
|
|
The easiest way to test if two sets are equal is to use "**diff**"
|
|
|
|
|
command. Diff command compares two files for differences. It will find
|
|
|
|
|
that the order of lines differ, so the files have to be sorted first. If
|
|
|
|
|
they are multi-sets, the output of sort has to be run through "uniq"
|
|
|
|
|
command to eliminate duplicate elements:
|
|
|
|
|
|
|
|
|
|
$ diff -q <(sort set1 | uniq) <(sort set2 | uniq)
|
|
|
|
|
# returns 0 if set1 = set2
|
|
|
|
|
# returns 1 if set1 ≠ set2
|
|
|
|
|
|
|
|
|
|
The "-q" flag quiets the output of diff command.
|
|
|
|
|
|
|
|
|
|
Let's test this operation on sets A, B, Aequal and Bequal:
|
|
|
|
|
|
|
|
|
|
$ diff -q <(sort A | uniq) <(sort B | uniq)
|
|
|
|
|
# return code 1 -- sets A and B are not equal
|
|
|
|
|
|
|
|
|
|
$ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq)
|
|
|
|
|
# return code 0 -- sets A and B are equal
|
|
|
|
|
|
|
|
|
|
If you have already sorted sets, then just run:
|
|
|
|
|
|
|
|
|
|
$ diff -q set1 set2
|
|
|
|
|
|
|
|
|
|
## Set Cardinality
|
|
|
|
|
|
|
|
|
|
The set cardinality operations returns the number of elements in the
|
|
|
|
|
set. We write |**A**| to denote the cardinality of the set **A**.
|
|
|
|
|
|
|
|
|
|
The simplest way to count the number of elements in a set is to use
|
|
|
|
|
"**wc**" command. Wc command counts the number of characters, words or
|
|
|
|
|
lines in a file. Since each element in the set appears on a new line,
|
|
|
|
|
counting the number of lines in the file will return the cardinality of
|
|
|
|
|
the set:
|
|
|
|
|
|
|
|
|
|
$ wc -l set | cut -d' ' -f1
|
|
|
|
|
|
|
|
|
|
Cut command is necessary because "wc -l" also outputs the name of the
|
|
|
|
|
file it was ran on. The cut command outputs the first field which is
|
|
|
|
|
number of lines in the file.
|
|
|
|
|
|
|
|
|
|
We can actually get rid of cut:
|
|
|
|
|
|
|
|
|
|
$ wc -l < set
|
|
|
|
|
|
|
|
|
|
Let's test if on sets A and Abig:
|
|
|
|
|
|
|
|
|
|
$ wc -l A | cut -d' ' -f1
|
|
|
|
|
5
|
|
|
|
|
|
|
|
|
|
$ wc -l Abig | cut -d' ' -f1
|
|
|
|
|
100000
|
|
|
|
|
|
|
|
|
|
$ wc -l < A
|
|
|
|
|
5
|
|
|
|
|
|
|
|
|
|
$ wc -l < Abig
|
|
|
|
|
100000
|
|
|
|
|
|
|
|
|
|
## Subset Test
|
|
|
|
|
|
|
|
|
|
The subset test tests if the given set is a subset of another set. We
|
|
|
|
|
write **S** ⊆ **A** if **S** is a subset of **A**, and **S** ⊊ **A**, if
|
|
|
|
|
it's not.
|
|
|
|
|
|
|
|
|
|
I found a very easy way to do it using the "**comm**" utility. Comm
|
|
|
|
|
compares two sorted files line by line. It may be run in such a way that
|
|
|
|
|
it outputs lines that appear only in the first specified file. If the
|
|
|
|
|
first file is subset of the second, then all the lines in the 1st file
|
|
|
|
|
also appear in the 2nd, so no output is produced:
|
|
|
|
|
|
|
|
|
|
$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
|
|
|
|
|
# comm returns no output if subset ⊆ set
|
|
|
|
|
# comm outputs something if subset ⊊ set
|
|
|
|
|
|
|
|
|
|
Please remember that if you have a numeric set, then sort must take "-n"
|
|
|
|
|
option.
|
|
|
|
|
|
|
|
|
|
Let's test if Asub is a subset of A:
|
|
|
|
|
|
|
|
|
|
$ comm -23 <(sort -n Asub|uniq) <(sort -n A|uniq) | head -1
|
|
|
|
|
# no output - yes, Asub ⊆ A
|
|
|
|
|
|
|
|
|
|
Now let's test if Anotsub is a subset of A:
|
|
|
|
|
|
|
|
|
|
$ comm -23 <(sort -n Anotsub|uniq) <(sort -n A|uniq) | head -1
|
|
|
|
|
6 # has output - no, Anotsub ⊊ A
|
|
|
|
|
|
|
|
|
|
If you want to use it from a shell script, you'd have to test if the
|
|
|
|
|
output from this command was empty or not.
|
|
|
|
|
|
|
|
|
|
## Set Union
|
|
|
|
|
|
|
|
|
|
The set union operation unions two sets, i.e., join them into one set.
|
|
|
|
|
We write **C** = **A** ∪ **B** to denote union of sets **A** and **B**
|
|
|
|
|
which produces set **C**.
|
|
|
|
|
|
|
|
|
|
Set union is extremely easy to create. Just use the "**cat**" utility to
|
|
|
|
|
concatenate two files:
|
|
|
|
|
|
|
|
|
|
$ cat set1 set2
|
|
|
|
|
|
|
|
|
|
If the duplicates (elements which are both in set1 and set2) are not
|
|
|
|
|
welcome, then the output of cat can be filtered via **awk**:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
$ cat set1 set2 | awk '!found[$1]++'
|
|
|
|
|
|
|
|
|
|
# we can also get rid of cat by just using awk:
|
|
|
|
|
|
|
|
|
|
$ awk '!found[$1]++' set1 set2
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If we don't want to use awk, which is a whole-blown programming
|
|
|
|
|
language, then we can **sort** the output of cat and filter it via
|
|
|
|
|
**uniq**:
|
|
|
|
|
|
|
|
|
|
$ cat set1 set2 | sort | uniq
|
|
|
|
|
|
|
|
|
|
# we can get rid of cat by specifying arguments to sort:
|
|
|
|
|
|
|
|
|
|
$ sort set1 set2 | uniq
|
|
|
|
|
|
|
|
|
|
# finally we can get rid of uniq by specifying -u flag to sort
|
|
|
|
|
|
|
|
|
|
$ sort -u set1 set2
|
|
|
|
|
|
|
|
|
|
If the sets set1 and set2 are already sorted, then the union operation
|
|
|
|
|
can be made much faster by specifying the "-m" command line option,
|
|
|
|
|
which merges the files (like the final step of merge-sort algorithm):
|
|
|
|
|
|
|
|
|
|
$ sort -m set1 set2 | uniq
|
|
|
|
|
|
|
|
|
|
# or
|
|
|
|
|
|
|
|
|
|
$ set -um set1 set2
|
|
|
|
|
|
|
|
|
|
Let's test this operation on sets A and B:
|
|
|
|
|
|
|
|
|
|
$ cat A B # with duplicates
|
|
|
|
|
3
|
|
|
|
|
5
|
|
|
|
|
1
|
|
|
|
|
2
|
|
|
|
|
4
|
|
|
|
|
11
|
|
|
|
|
1
|
|
|
|
|
12
|
|
|
|
|
3
|
|
|
|
|
2
|
|
|
|
|
|
|
|
|
|
$ awk '!found[$1]++' # without dupes
|
|
|
|
|
3
|
|
|
|
|
5
|
|
|
|
|
1
|
|
|
|
|
2
|
|
|
|
|
4
|
|
|
|
|
11
|
|
|
|
|
12
|
|
|
|
|
|
|
|
|
|
$ sort -n A B | uniq # with sort && uniq
|
|
|
|
|
1
|
|
|
|
|
2
|
|
|
|
|
3
|
|
|
|
|
4
|
|
|
|
|
5
|
|
|
|
|
11
|
|
|
|
|
12
|
|
|
|
|
|
|
|
|
|
## Set Intersection
|
|
|
|
|
|
|
|
|
|
The set intersection operation finds elements that are in both sets at
|
|
|
|
|
the same time. We write **C** = **A** ∩ **B** to denote the intersection
|
|
|
|
|
of sets **A** and **B**, which produces the set **C**.
|
|
|
|
|
|
|
|
|
|
There are many ways to do set intersection. The first way that I am
|
|
|
|
|
going to show you uses "comm":
|
|
|
|
|
|
|
|
|
|
$ comm -12 <(sort set1) <(sort set2)
|
|
|
|
|
|
|
|
|
|
The "-12" option to comm directs it to suppress output of lines
|
|
|
|
|
appearing just in the 1st and the 2nd file and makes it output lines
|
|
|
|
|
appearing in both 1st and 2nd, which is the intersection of two sets.
|
|
|
|
|
|
|
|
|
|
Please remember that if you have a numeric set, then sort must take "-n"
|
|
|
|
|
option.
|
|
|
|
|
|
|
|
|
|
Another way to do it is to use "grep" utility. I actually found about
|
|
|
|
|
this method as I was writing this article:
|
|
|
|
|
|
|
|
|
|
$ grep -xF -f set1 set2
|
|
|
|
|
|
|
|
|
|
The "-x" option forces grep to match the whole lines (no partial
|
|
|
|
|
matches). The "-f set1" specifies the patterns to use for searching. The
|
|
|
|
|
"-F" option makes grep interpret the given patterns literally (no
|
|
|
|
|
regexes). It works by matching all lines of set1 in set2. The lines that
|
|
|
|
|
appear just in set1 or just in set2 are never output.
|
|
|
|
|
|
|
|
|
|
The next way to find intersection is by using "sort" and "uniq":
|
|
|
|
|
|
|
|
|
|
$ sort set1 set2 | uniq -d
|
|
|
|
|
|
|
|
|
|
The "-d" option to uniq forces it to print only the duplicate lines.
|
|
|
|
|
Obviously, if a line appears in set1 and set2, after sorting there will
|
|
|
|
|
be two consecutive equal lines in the output. The "uniq -d" command
|
|
|
|
|
prints such repeated lines (but only 1 copy of it), thus it's the
|
|
|
|
|
intersection operation.
|
|
|
|
|
|
|
|
|
|
Just a few minutes before publishing this article I found another way to
|
|
|
|
|
do intersection with "join" command. Join command joins files on a
|
|
|
|
|
common field:
|
|
|
|
|
|
|
|
|
|
$ join <(sort -n A) <(sort -n B)
|
|
|
|
|
|
|
|
|
|
Here is a test run:
|
|
|
|
|
|
|
|
|
|
$ sort -n A B | uniq -d
|
|
|
|
|
1
|
|
|
|
|
2
|
|
|
|
|
3
|
|
|
|
|
|
|
|
|
|
$ grep -xF -f A B
|
|
|
|
|
1
|
|
|
|
|
3
|
|
|
|
|
2
|
|
|
|
|
|
|
|
|
|
$ comm -12 <(sort -n A) <(sort -n B)
|
|
|
|
|
1
|
|
|
|
|
2
|
|
|
|
|
3
|
|
|
|
|
|
|
|
|
|
## Set Complement
|
|
|
|
|
|
|
|
|
|
The set complement operation finds elements that are in one set but not
|
|
|
|
|
the other. We write **A** - **B** or **A** \\ **B** to denote set's
|
|
|
|
|
**B** complement in set **A**.
|
|
|
|
|
|
|
|
|
|
**Comm** has become a pretty useful command for operating on sets. It
|
|
|
|
|
can be applied to implement set complement operation as well:
|
|
|
|
|
|
|
|
|
|
$ comm -23 <(sort set1) <(sort set2)
|
|
|
|
|
|
|
|
|
|
The option "-23" specifies that comm should not print elements that
|
|
|
|
|
appear just in set2 and that are common to both. It leaves comm to print
|
|
|
|
|
elements which are just in set1 (and not in set2).
|
|
|
|
|
|
|
|
|
|
The "**grep**" command can also be used to implement this operation:
|
|
|
|
|
|
|
|
|
|
$ grep -vxF -f set2 set1
|
|
|
|
|
|
|
|
|
|
Notice that the order of sets has been reversed from that of comm.
|
|
|
|
|
That's because we are searching those elements in set1, which are not in
|
|
|
|
|
set2.
|
|
|
|
|
|
|
|
|
|
Another way to do it is, of course, with "**sort**" and "**uniq**":
|
|
|
|
|
|
|
|
|
|
$ sort set2 set2 set1 | uniq -u
|
|
|
|
|
|
|
|
|
|
This is a pretty tricky command. Suppose that a line appears in set1 but
|
|
|
|
|
does not appear in set2. Then it will be output just once and will not
|
|
|
|
|
get removed by uniq. All other lines get removed.
|
|
|
|
|
|
|
|
|
|
Let's put these commands to test:
|
|
|
|
|
|
|
|
|
|
$ comm -23 <(sort -n A) <(sort -n B)
|
|
|
|
|
4
|
|
|
|
|
5
|
|
|
|
|
|
|
|
|
|
$ grep -vxF -f B A
|
|
|
|
|
5
|
|
|
|
|
4
|
|
|
|
|
|
|
|
|
|
$ sort -n B B A | uniq -u
|
|
|
|
|
4
|
|
|
|
|
5
|
|
|
|
|
|
|
|
|
|
## Set Symmetric Difference
|
|
|
|
|
|
|
|
|
|
The set symmetric difference operation finds elements that are in one
|
|
|
|
|
set, or in the other but not both. We write **A** Δ **B** to denote
|
|
|
|
|
symmetric difference of sets **A** and **B**.
|
|
|
|
|
|
|
|
|
|
The operation can be implemented very easily with "**comm**" utility:
|
|
|
|
|
|
|
|
|
|
$ comm -3 <(sort set1) <(sort set2) | sed 's/\t//g'
|
|
|
|
|
|
|
|
|
|
# sed can be replaced with tr
|
|
|
|
|
|
|
|
|
|
$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'
|
|
|
|
|
|
|
|
|
|
Here comm is instructed via "-3" not to output fields that are common to
|
|
|
|
|
both files, but to output fields that are just in set1 and just in set2.
|
|
|
|
|
Sed is necessary because comm outputs two columns of data and some of it
|
|
|
|
|
is right padded with a \\t tab character.
|
|
|
|
|
|
|
|
|
|
It can also be done with "**sort**" and "**uniq**":
|
|
|
|
|
|
|
|
|
|
$ sort set1 set2 | uniq -u
|
|
|
|
|
|
|
|
|
|
We can use mathematics and derive a few formulas involving previously
|
|
|
|
|
used operations for symmetric difference: **A** Δ **B** = (**A** -
|
|
|
|
|
**B**) ∪ (**B** - **A**). Now we can use grep:
|
|
|
|
|
|
|
|
|
|
$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
|
|
|
|
|
# does (B - A) ∪ (A - B)
|
|
|
|
|
|
|
|
|
|
# this can be simplified
|
|
|
|
|
|
|
|
|
|
$ grep -vxF -f set1 set2; grep -vxF -f set2 set1
|
|
|
|
|
|
|
|
|
|
Let's test it:
|
|
|
|
|
|
|
|
|
|
$ comm -3 <(sort -n A) >(sort -n B) | sed 's/\t//g'
|
|
|
|
|
11
|
|
|
|
|
12
|
|
|
|
|
4
|
|
|
|
|
5
|
|
|
|
|
|
|
|
|
|
$ sort -n A B | uniq -u
|
|
|
|
|
4
|
|
|
|
|
5
|
|
|
|
|
11
|
|
|
|
|
12
|
|
|
|
|
|
|
|
|
|
$ cat <(grep -vxF -f B A) <(grep -vxF -f A B)
|
|
|
|
|
5
|
|
|
|
|
4
|
|
|
|
|
11
|
|
|
|
|
12
|
|
|
|
|
|
|
|
|
|
## Power Set
|
|
|
|
|
|
|
|
|
|
The power set operation generates a power-set of a set. What's a power
|
|
|
|
|
set? It's a set that contains all subsets of the set. We write P(**A**)
|
|
|
|
|
or 2**A** to denote all subsets of A. For a set with n elements, the
|
|
|
|
|
power set contains 2n elements.
|
|
|
|
|
|
|
|
|
|
For example, the power-set of the set { a, b, c } contains 23 = 8
|
|
|
|
|
elements. The power-set is { {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c},
|
|
|
|
|
{a, b, c} }.
|
|
|
|
|
|
|
|
|
|
It's not easy to do that with simple Unix tools. I could not think of
|
|
|
|
|
anything better than a silly Perl solution:
|
|
|
|
|
|
|
|
|
|
$ perl -le '
|
|
|
|
|
sub powset {
|
|
|
|
|
return [[]] unless @_;
|
|
|
|
|
my $head = shift;
|
|
|
|
|
my $list = &powset;
|
|
|
|
|
[@$list, map { [$head, @$_] } @$list]
|
|
|
|
|
}
|
|
|
|
|
chomp(my @e = <>);
|
|
|
|
|
for $p (@{powset(@e)}) {
|
|
|
|
|
print @$p;
|
|
|
|
|
}' set
|
|
|
|
|
|
|
|
|
|
Can you think of a way to do it with Unix tools?
|
|
|
|
|
|
|
|
|
|
## Set Cartesian Product
|
|
|
|
|
|
|
|
|
|
The set Cartesian product operation produces produces a new set that
|
|
|
|
|
contains all possible pairs of elements from one set and the other. The
|
|
|
|
|
notation for Cartesian product of sets **A** and **B** is **A** x **B**.
|
|
|
|
|
|
|
|
|
|
For example, if set **A** = { a, b, c } and set **B** = { 1, 2 } then
|
|
|
|
|
the Cartesian product **A** x **B** = { (a, 1), (a, 2), (b, 1), (b, 2),
|
|
|
|
|
(c, 1), (c, 2) }.
|
|
|
|
|
|
|
|
|
|
I can't think of a great solution. I have a very silly solution in
|
|
|
|
|
bash:
|
|
|
|
|
|
|
|
|
|
$ while read a; do while read b; do echo "$a, $b"; done < set1; done < set2
|
|
|
|
|
|
|
|
|
|
Can you think of other solutions?
|
|
|
|
|
|
|
|
|
|
## Disjoint Set Test
|
|
|
|
|
|
|
|
|
|
The disjoint set test operation finds if two sets are disjoint, i.e.,
|
|
|
|
|
they do not contain common elements.
|
|
|
|
|
|
|
|
|
|
Two sets are disjoint if their intersection is the empty set. Any of the
|
|
|
|
|
set intersection commands (mentioned earlier) can be applied on the sets
|
|
|
|
|
and the output can be tested for emptiness. If it is empty, then the
|
|
|
|
|
sets are disjoint, if it is not, then the sets are not disjoint.
|
|
|
|
|
|
|
|
|
|
Another way to test if two sets are disjoint is to use awk:
|
|
|
|
|
|
|
|
|
|
$ awk '{ if (++seen[$0]==2) exit 1 }' set1 set2
|
|
|
|
|
# returns 0 if sets are disjoint
|
|
|
|
|
# returns 1 if sets are not disjoint
|
|
|
|
|
|
|
|
|
|
It works by counting seen elements in set1 and then set2. If any of the
|
|
|
|
|
elements appear both in set1 and set2, seen count for that element would
|
|
|
|
|
be 2 and awk would quit with exit code 1.
|
|
|
|
|
|
|
|
|
|
## Empty Set Test
|
|
|
|
|
|
|
|
|
|
The empty set test tests if the set is empty, i.e., contains no
|
|
|
|
|
elements. The empty set is usually written as Ø.
|
|
|
|
|
|
|
|
|
|
It's very easy to test if the set is empty. The cardinality of an empty
|
|
|
|
|
set is 0:
|
|
|
|
|
|
|
|
|
|
$ wc -l set | cut -d' ' -f1
|
|
|
|
|
# outputs 0 if the set is empty
|
|
|
|
|
# outputs > 0 if the set is not empty
|
|
|
|
|
|
|
|
|
|
Getting rid of cut:
|
|
|
|
|
|
|
|
|
|
$ wc -l < set
|
|
|
|
|
# outputs 0 if the set is empty
|
|
|
|
|
# outputs > 0 if the set is not empty
|
|
|
|
|
|
|
|
|
|
## Minimum
|
|
|
|
|
|
|
|
|
|
The minimum operation returns the smallest number in the set. We write
|
|
|
|
|
min(**A**) to denote the minimum operation on the set **A**.
|
|
|
|
|
|
|
|
|
|
The minimum element of a set can be found by first sorting it in
|
|
|
|
|
ascending order and then taking the first element. The first element can
|
|
|
|
|
be taken with "head" Unix command which outputs the first part of the
|
|
|
|
|
file:
|
|
|
|
|
|
|
|
|
|
$ head -1 <(sort set)
|
|
|
|
|
|
|
|
|
|
The "-1" option specifies to output the first line only.
|
|
|
|
|
|
|
|
|
|
If the set is already sorted, then it's even simpler:
|
|
|
|
|
|
|
|
|
|
$ head -1 set
|
|
|
|
|
|
|
|
|
|
Remember to use "sort -n" command if the set contains numeric data.
|
|
|
|
|
|
|
|
|
|
Example of running minimum operation on sets A and Abig:
|
|
|
|
|
|
|
|
|
|
$ head -1 <(sort -n A)
|
|
|
|
|
1
|
|
|
|
|
$ head -1 <(sort -n Abig)
|
|
|
|
|
2798
|
|
|
|
|
|
|
|
|
|
## Maximum
|
|
|
|
|
|
|
|
|
|
The maximum operation returns the biggest number in the set. We write
|
|
|
|
|
max(**A**) to denote the maximum operation on the set **A**.
|
|
|
|
|
|
|
|
|
|
The maximum element of a set can be found by first sorting it in
|
|
|
|
|
ascending order and then taking the last element. The last element can
|
|
|
|
|
be taken with "tail" Unix command which outputs the last part of the
|
|
|
|
|
file:
|
|
|
|
|
|
|
|
|
|
$ tail -1 <(sort set)
|
|
|
|
|
|
|
|
|
|
The "-1" option specifies to output the last line only.
|
|
|
|
|
|
|
|
|
|
If the set is already sorted, then it's even simpler:
|
|
|
|
|
|
|
|
|
|
$ tail -1 set
|
|
|
|
|
|
|
|
|
|
Remember to use "sort -n" command if the set contains numeric data.
|
|
|
|
|
|
|
|
|
|
Example of running maximum operation on sets A and Abig:
|
|
|
|
|
|
|
|
|
|
$ tail -1 <(sort -n A)
|
|
|
|
|
5
|
|
|
|
|
$ head -1 <(sort -n Abig)
|
|
|
|
|
4294906714
|
|
|
|
|
|
|
|
|
|
## Have Fun\!
|
|
|
|
|
|
|
|
|
|
Have fun working with these set operations\! Thanks to lhunath and
|
|
|
|
|
waldner from \#bash for helping. :)
|