AWK is a programming language named after the initials of its three inventors: Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is incredibly powerful at processing files, particularly column-based files, which are commonplace in Bioinformatics. For example, BED, GFF, and SAM files.
Although long programs, put into a separate file, can be written using AWK, we will use it directly on the command line. Effectively, these are very short AWK programs, often called "one-liners".
Before we start, change into the Unix/awk
directory and double check that the following commands gives you a similar output:
In [ ]:
pwd
ls
In [ ]:
cat genes.gff
The columns in the GFF file are separated by tabs and have the following meanings:
The score, strand, and frame can be set to '.' if it is not relevant for that feature. The final column 9 may or may not be present and could contain any number of key, value pairs.
We can use awk
to just print the first column of the file. awk
calls the columns $1
, $2
, ... etc, and the complete line is called $0
.
In [ ]:
awk -F"\t" '{print $1}' genes.gff
A little explanation is needed.
-F"\t"
was needed to tell awk
that the columns are separated by tabs (more on this later).awk
does what is inside the curly brackets. In this case, we simply print the first column.The repeated chromosome names are not nice. It is more likely to want to know just the unique names, which can be found by piping into the Unix command sort
.
In [ ]:
awk -F"\t" '{print $1}' genes.gff | sort -u
Similarly to grep
, awk
can be used to filter out lines of a file. However, since awk
is column-based, it makes it easy to filter based on properties of any columns of interest. The filtering criteria can be added before the braces. For example, the following extracts just chromosome 1 from the file.
In [ ]:
awk -F"\t" '$1=="chr1" {print $0}' genes.gff
There are two important things to note from the above command:
$1=="chr1"
means that column 1 must be exactly equal to "chr1". This means that "chr10" is not found.{print $0}
" part only happens when the first column is equal to "chr1", otherwise awk
does nothing (the line gets ignored).Awk commands are made up of two parts, a pattern (e.g. $1=="chr1"
) and an action (e.g. print $0
) which is contained in curly braces. The pattern defines which lines the action is applied to.
In fact, the action (the part in curly braces) can be omitted in this example. awk
assumes that you want to print the whole line, unless it is told otherwise. This gives a simple method of filtering based on columns.
In [ ]:
awk -F"\t" '$1=="chr1"' genes.gff
You might remember using another of awk's defaults in a previous exercise. In that example we supplied an action but no pattern. In this case, awk assumes that you want to apply the action to every line in the file. For example:
In [ ]:
awk -F"\t" '{print $1}' genes.gff
Multiple patterns can be combined using "&&
" to mean "and". For example, to find just the genes from chromosome 1:
In [ ]:
awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff
The entire line need not be printed (remember, if not specified, awk
assumes a print $0
). Suppose we want only the sources of the genes on chromosome 1:
In [ ]:
awk -F"\t" '$1=="chr1" && $3=="gene" {print $2}' genes.gff | sort -u
Similarly to using "&&
" for "and", there is "||
" to mean "or". To find features that are repeats or made by the tool "source2":
In [ ]:
awk -F"\t" '$2=="source2" || $3=="repeat"' genes.gff
So far, we have only used strings for the filtering. Numbers can also be used. We could ask awk
to return all the genes on chromosome 1 that start before position 1100:
In [ ]:
awk -F"\t" '$1=="chr1" && $3=="gene" && $4 < 1100' genes.gff
Instead of looking for exact matches to strings, regular expressions can be used. The symbol "~
" is used instead of "==
". For example, to find all the autosomes, we need to use a regular expression for matches to the first column. The regular expression is written between forward slashes.
In [ ]:
awk -F"\t" '$1 ~ /^chr[0-9]+$/' genes.gff
Like with grep
, matches can be inverted. grep
has the option -v
, but with awk
we use "!~
" to mean "does not match". This inverts the previous example:
In [ ]:
awk -F"\t" '$1 !~ /^chr[0-9]+$/' genes.gff
If we do not specify a column, awk
looks for a match anywhere in the whole line (it assumes we wrote $0 ~ /regex/
). So, in some sense, awk
can be used as a replacement for grep
:
In [ ]:
awk '/repeat/' genes.gff
(the -F"\t"
was omitted because the match is to the whole line, so how the columns are separated is not relevant.)
In [ ]:
grep repeat genes.gff
However, with awk
we can easily pull out information from the matching lines. Suppose we want to know which chromosomes have repeats. It is easy with awk
.
In [ ]:
awk -F"\t" '/repeat/ {print $1}' genes.gff | sort -u
In [ ]:
awk -F"\t" '$3=="gene" && !($7 == "+" || $7 == "-")' genes.gff
Something went wrong when this file was made: gene3 has an unknown strand.
Do the start and end coordinates of all the features make sense?
In [ ]:
awk -F"\t" '$5 < $4' genes.gff
According to the file, this gene starts at position 10000 and ends at position 1200, which does not make sense. Also, it has no name (the final optional column is empty). We could check if there are any other genes with no name. One way to do this is to use the special variable "NF
", which is the number of columns (fields) in the current line. Since the final column is optional, each line might have 8 or 9 columns. We need to write a command that will check:
In [ ]:
awk -F"\t" '$3=="gene" && (NF<9 || $NF !~/name/)' genes.gff
Note the distinction between NF
(the number of columns) and "$NF
" (the contents of the final column).
As promised earlier, we now consider the relevance of the option "-F"\t"
", to tell awk
that the columns in the input file are separated with tab characters. If we forgot to use this option, then awk
will use its default behaviour, which is to separate on any whitespace (which usually means tabs and/or spaces). However, consider the final column of the file - it can contain whitespace, which means that messy things happen. Suppose we try to extract the optional extra final column of the file, when it is present. Compare the effect of running awk
with and without "-F"\t"
".
In [ ]:
awk -F"\t" 'NF>8 {print $NF}' genes.gff
In [ ]:
awk 'NF>8 {print $NF}' genes.gff
One more sanity check: each line should have 8 or 9 columns (remembering to use -F"\t"
!)
In [ ]:
awk -F"\t" 'NF<8 || NF>9' genes.gff
There was no output, which means that every line does indeed have 8 or 9 columns.
Every value in a column could be changed to something else, for example suppose we want to change the source column (column number 2) to something else.
In [ ]:
awk -F"\t" '{$2="new_source"; print $0}' genes.gff
This is close, but look carefully at the output. What happened? The output is not tab-separated, but is instead separated with spaces. To restore the tabs, we need to use another special variable called "OFS
" (Output Field Separator), and change it before awk
does any processing of the input file. This can be achieved by adding "BEGIN{OFS="\t"}
", as in the next example. Before awk
reads any lines of the file it runs the BEGIN
block of code, which in this case changes OFS
to be a tab character.
In [ ]:
awk -F"\t" 'BEGIN{OFS="\t"} {$2="new_source"; print $0}' genes.gff
In [ ]:
awk -F"\t" '$3=="repeat" {print $5 - $4 + 1}' genes.gff | sort -n
Perhaps we would like to know the total length of the repeats. We need to use a variable to add up the total lengths and print the final total. In the same way that awk
has a BEGIN
block, it can also be given an END
block that is only run when awk
has finished reading all lines of the input file.
In [ ]:
awk -F"\t" 'BEGIN{sum=0} $3=="repeat" \
{sum = sum + $5 - $4 + 1} \
END{print sum}' genes.gff
The total repeat length was stored in a variable called sum
. The previous awk
command can be broken down into three parts:
BEGIN{sum=0}
sets sum
to zero before any lines of the file are read.awk
reads each line of the file. Each time a repeat is found, the length of that repeat is added to sum
.awk
runs the END
block: END{print sum}
. This prints the value of sum
.In fact, the command can be shortened a little. Adding a number to a variable is so common, that there is a shorthand way to write it. Instead of
sum = sum + $5 - $4 + 1
we can use
sum += $5 - $4 + 1
to get the same result.
In [ ]:
awk -F"\t" 'BEGIN{sum=0} \
$3=="repeat" {sum += $5 - $4 + 1} \
END{print sum}' genes.gff
Maybe we would like to know the mean score of the genes. We need to calculate the total score, and divide this by the number of genes. To keep track of the number of genes, we use a variable called count
. Each time a new gene is found, 1 must be added to count
. This could be done by writing
count = count + 1
but instead we will use the shorthand
count++
In [ ]:
awk -F"\t" 'BEGIN{sum=0; count=0} \
$3=="gene" {sum += $6; count++} \
END{print sum/count}' genes.gff
Finally, awk
has a default behaviour that means we do not even need the BEGIN
block. It can be completely omitted in this example because we are setting sum
and count
to zero. The first time awk
sees a variable being used, it will set it to zero by default. For example, when awk
reads the first line of the file, the piece of code
count++
tells awk
to add 1 to count
. However, if awk
has not encountered the variable count
before, it assumes it is zero (as if we had written BEGIN{count=0}
), then adds 1 to it. The result is that count
is equal to 1. Similar comments apply to the variable sum
.
In [ ]:
awk -F"\t" '$3=="gene" {sum += $6; count++} \
END{print sum/count}' genes.gff
If this confuses you, then be explicit and use the BEGIN
block of code. The result is the same.
The following exercises all use the BED file exercises.bed
. Before starting the exercises, open a new terminal and navigate to the awk/
directory, which contains exercises.bed
.
Use awk
to find the answers to the following questions about the file exercises.bed
. Many questions will require using pipes (eg "awk ... | sort -u
" for question 1).