They are small and can do small things well.
Today, I am going to demonstrate a very useful linux command join
.
join
is used to join two files based on a common "ID" column.
This is a common task when you have information stored in two files
and have a common column to link them together.
I first created two dummy tab delimited files: A.txt and B.txt
Let's have a look at the files:
In [1]:
cat A.txt
In [2]:
cat B.txt
column 3 in file A and column 1 in file B are common IDs that link these two files together.
Let's do an inner join:
In [3]:
join -1 3 -2 1 A.txt B.txt
-1 means file 1, which is A file. 3 means the third column.
-2 means file 2, which is B file. 1 means the first column.
The output is not we expected. A file and B file should have 4
rows linked together.
It turns out files need to be sorted first according to the common
field for join to work. Let's sort them first and join them:
In [4]:
sort -k3,3 A.txt > A.sorted.txt
sort -k1,1 B.txt > B.sorted.txt
join -1 3 -2 1 A.sorted.txt B.sorted.txt
we can also do right outer join (-a 1) :
In [5]:
join -1 3 -2 1 -a 1 A.sorted.txt B.sorted.txt
The default input seprator is space or tab, and output seprator is space.
we can change the output seprator to tab:
In [6]:
gjoin -t$'\t' -1 3 -2 1 -a 1 -o auto A.sorted.txt B.sorted.txt
Let's do a left outer join (-a 2) and fill the empty field with NA.
Note that only GNU join has the auto flag.
In [7]:
gjoin -t$'\t' -e "NA" -1 3 -2 1 -a 2 -o auto A.sorted.txt B.sorted.txt
You may think the intermediate sorted file is annoying. One can avoid them using
process substitution:
In [8]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A.txt) <(sort -k1,1 B.txt)
Wow! It is so magic! Using process substitution can speed up your workflow.
So far, the files we have are without headers. What if they contain headers?
Let's make some dummy files first:
In [9]:
printf "name\tstate\tid\n" | cat - A.txt > A_header.txt
printf "id\tsalary\n" | cat - B.txt > B_header.txt
In [10]:
cat A_header.txt
In [11]:
cat B_header.txt
Let's join them together:
In [12]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A_header.txt) <(sort -k1,1 B_header.txt)
Unfortunately, after sorting, the header went to the bottom of the file. Many times,
the header may just go to the middle of the file. We can always delete the header first,
join the file and then add the header back, but we will need to have intermediate files.
I will use body
function which will ignore the header.
In [13]:
join -1 3 -2 1 -t $'\t' <(cat A_header.txt | body sort -k3,3) <(cat B_header.txt | body sort -k1,1)