linux commands are very powerful.

They are small and can do small things well.
Today, I am going to demonstrate a very useful linux command join.
join is used to join two files based on a common "ID" column.
This is a common task when you have information stored in two files
and have a common column to link them together.

I first created two dummy tab delimited files: A.txt and B.txt
Let's have a look at the files:


In [1]:
cat A.txt


Jack	TX	5
John	CA	3
Luke	WA	4
Tammy	FL	2
Tommy	FL	1

In [2]:
cat B.txt


1	450
2	300
3	400
5	150
6	600

column 3 in file A and column 1 in file B are common IDs that link these two files together.
Let's do an inner join:


In [3]:
join -1 3 -2 1 A.txt B.txt


5 Jack TX 150

-1 means file 1, which is A file. 3 means the third column.
-2 means file 2, which is B file. 1 means the first column.
The output is not we expected. A file and B file should have 4 rows linked together.
It turns out files need to be sorted first according to the common
field for join to work.
Let's sort them first and join them:


In [4]:
sort -k3,3 A.txt > A.sorted.txt
sort -k1,1 B.txt > B.sorted.txt
join -1 3 -2 1 A.sorted.txt B.sorted.txt


1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150

we can also do right outer join (-a 1) :


In [5]:
join -1 3 -2 1 -a 1 A.sorted.txt B.sorted.txt


1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
Luke WA 4
5 Jack TX 150

The default input seprator is space or tab, and output seprator is space.
we can change the output seprator to tab:


In [6]:
gjoin -t$'\t' -1 3 -2 1 -a 1 -o auto A.sorted.txt B.sorted.txt


1	Tommy	FL	450
2	Tammy	FL	300
3	John	CA	400
4	Luke	WA	
5	Jack	TX	150

Let's do a left outer join (-a 2) and fill the empty field with NA.
Note that only GNU join has the auto flag.


In [7]:
gjoin -t$'\t' -e "NA" -1 3 -2 1 -a 2 -o auto A.sorted.txt B.sorted.txt


1	Tommy	FL	450
2	Tammy	FL	300
3	John	CA	400
5	Jack	TX	150
6	NA	NA	600

You may think the intermediate sorted file is annoying. One can avoid them using
process substitution:


In [8]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A.txt) <(sort -k1,1 B.txt)


1	Tommy	FL	450
2	Tammy	FL	300
3	John	CA	400
5	Jack	TX	150

Wow! It is so magic! Using process substitution can speed up your workflow.

So far, the files we have are without headers. What if they contain headers?
Let's make some dummy files first:


In [9]:
printf "name\tstate\tid\n" | cat - A.txt > A_header.txt
printf "id\tsalary\n" | cat - B.txt > B_header.txt




In [10]:
cat A_header.txt


name	state	id
Jack	TX	5
John	CA	3
Luke	WA	4
Tammy	FL	2
Tommy	FL	1

In [11]:
cat B_header.txt


id	salary
1	450
2	300
3	400
5	150
6	600

Let's join them together:


In [12]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A_header.txt) <(sort -k1,1 B_header.txt)


1	Tommy	FL	450
2	Tammy	FL	300
3	John	CA	400
5	Jack	TX	150
id	name	state	salary

Unfortunately, after sorting, the header went to the bottom of the file. Many times, the header may just go to the middle of the file. We can always delete the header first,
join the file and then add the header back, but we will need to have intermediate files.
I will use body function which will ignore the header.


In [13]:
join -1 3 -2 1 -t $'\t' <(cat A_header.txt | body sort -k3,3) <(cat B_header.txt | body sort -k1,1)


id	name	state	salary
1	Tommy	FL	450
2	Tammy	FL	300
3	John	CA	400
5	Jack	TX	150