AWK is a programming language named after the initials of its three inventors: Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is incredibly powerful at processing files, particularly column-based files, which are commonplace in Bioinformatics. For example, BED, GFF, and SAM files.
Although long programs, put into a separate file, can be written using AWK, we will use it directly on the command line. Effectively, these are very short AWK programs, often called "one-liners".
Before we start, change into the Unix/awk directory:
In [ ]:
cd awk
In [ ]:
cat genes.gff
The columns in the GFF file are separated by tabs and have the following meanings:
The score, strand, and frame can be set to '.' if it is not relevant for that feature. The final column 9 may or may not be present and could contain any number of key, value pairs.
We can use awk to just print the first column of the file. awk calls the columns $1, $2, ... etc, and the complete line is called $0. Try
In [ ]:
awk -F"\t" '{print $1}' genes.gff
A little explanation is needed.
-F"\t" was needed to tell awk that the columns are separated by tabs (more on this later).awk does what is inside the curly brackets. In this case, we simply print the first column.The repeated chromosome names are not nice. It is more likely to want to know just the unique names, which can be found by piping into the Unix command sort.
In [ ]:
awk -F"\t" '{print $1}' genes.gff | sort -u
Similarly to grep, awk can be used to filter out lines of a file. However, since awk is column-based, it makes it easy to filter based on properties of any columns of interest. The filtering criteria can be added before the braces. For example, the following extracts just chromosome 1 from the file.
In [ ]:
awk -F"\t" '$1=="chr1" {print $0}' genes.gff
There are two important things to note from the above command:
$1=="chr1" means that column 1 must be exactly equal to "chr1". This means that "chr10" is not found.{print $0}" part only happens when the first column is equal to "chr1", otherwise awk does nothing (the line gets ignored).Awk commands are made up of two parts, a pattern (e.g. $1=="chr1") and an action (e.g. print $0) which is contained in curly braces. The pattern defines which lines the action is applied to.
In fact, the action (the part in curly braces) can be omitted in this example. awk assumes that you want to print the whole line, unless it is told otherwise. This gives a simple method of filtering based on columns.
In [ ]:
awk -F"\t" '$1=="chr1"' genes.gff
You might remember using another of awk's defaults in a previous exercise. In that example we supplied an action but no pattern. In this case, awk assumes that you want to apply the action to every line in the file. For example:
In [ ]:
awk -F"\t" '{print $1}' genes.gff
Multiple patterns can be combined using "&&" to mean "and". For example, to find just the genes from chromosome 1:
In [ ]:
awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff
The entire line need not be printed (remember, if not specified, awk assumes a print $0). Suppose we want only the sources of the genes on chromosome 1:
In [ ]:
awk -F"\t" '$1=="chr1" && $3=="gene" {print $2}' genes.gff | sort -u
Similarly to using "&&" for "and", there is "||" to mean "or". To find features that are repeats or made by the tool "source2":
In [ ]:
awk -F"\t" '$2=="source2" || $3=="repeat"' genes.gff
So far, we have only used strings for the filtering. Numbers can also be used. We could ask awk to return all the genes on chromosome 1 that start before position 1100:
In [ ]:
awk -F"\t" '$1=="chr1" && $3=="gene" && $4 < 1100' genes.gff
Instead of looking for exact matches to strings, regular expressions can be used. The symbol "~" is used instead of "==". For example, to find all the autosomes, we need to use a regular expression for matches to the first column. The regular expression is written between forward slashes.
In [ ]:
awk -F"\t" '$1 ~ /^chr[0-9]+$/' genes.gff
Like with grep, matches can be inverted. grep has the option -v, but with awk we use "!~" to mean "does not match". This inverts the previous example:
In [ ]:
awk -F"\t" '$1 !~ /^chr[0-9]+$/' genes.gff
If we do not specify a column, awk looks for a match anywhere in the whole line (it assumes we wrote $0 ~ /regex/). So, in some sense, awk can be used as a replacement for grep:
In [ ]:
awk '/repeat/' genes.gff
(the -F"\t" was omitted because the match is to the whole line, so how the columns are separated is not relevant.)
In [ ]:
grep repeat genes.gff
However, with awk we can easily pull out information from the matching lines. Suppose we want to know which chromosomes have repeats. It is easy with awk.
In [ ]:
awk -F"\t" '/repeat/ {print $1}' genes.gff | sort -u
In [ ]:
awk -F"\t" '$3=="gene" && !($7 == "+" || $7 == "-")' genes.gff
Something went wrong when this file was made: gene3 has an unknown strand.
Do the start and end coordinates of all the features make sense?
In [ ]:
awk -F"\t" '$5 < $4' genes.gff
According to the file, this gene starts at position 10000 and ends at position 1200, which does not make sense. Also, it has no name (the final optional column is empty). We could check if there are any other genes with no name. One way to do this is to use the special variable "NF", which is the number of columns (fields) in the current line. Since the final column is optional, each line might have 8 or 9 columns. We need to write a command that will check:
In [ ]:
awk -F"\t" '$3=="gene" && (NF<9 || $NF !~/name/)' genes.gff
Note the distinction between NF (the number of columns) and "$NF" (the contents of the final column).
As promised earlier, we now consider the relevance of the option "-F"\t"", to tell awk that the columns in the input file are separated with tab characters. If we forgot to use this option, then awk will use its default behaviour, which is to separate on any whitespace (which usually means tabs and/or spaces). However, consider the final column of the file - it can contain whitespace, which means that messy things happen. Suppose we try to extract the optional extra final column of the file, when it is present. Compare the effect of running awk with and without "-F"\t"".
In [ ]:
awk -F"\t" 'NF>8 {print $NF}' genes.gff
In [ ]:
awk 'NF>8 {print $NF}' genes.gff
One more sanity check: each line should have 8 or 9 columns (remembering to use -F"\t"!)
In [ ]:
awk -F"\t" 'NF<8 || NF>9' genes.gff
There was no output, which means that every line does indeed have 8 or 9 columns.
Every value in a column could be changed to something else, for example suppose we want to change the source column (column number 2) to something else.
In [ ]:
awk -F"\t" '{$2="new_source"; print $0}' genes.gff
This is close, but look carefully at the output. What happened? The output is not tab-separated, but is instead separated with spaces. To restore the tabs, we need to use another special variable called "OFS" (Output Field Separator), and change it before awk does any processing of the input file. This can be achieved by adding "BEGIN{OFS="\t"}", as in the next example. Before awk reads any lines of the file it runs the BEGIN block of code, which in this case changes OFS to be a tab character.
In [ ]:
awk -F"\t" 'BEGIN{OFS="\t"} {$2="new_source"; print $0}' genes.gff
In [ ]:
awk -F"\t" '$3=="repeat" {print $5 - $4 + 1}' genes.gff | sort -n
Perhaps we would like to know the total length of the repeats. We need to use a variable to add up the total lengths and print the final total. In the same way that awk has a BEGIN block, it can also be given an END block that is only run when awk has finished reading all lines of the input file.
In [ ]:
awk -F"\t" 'BEGIN{sum=0} $3=="repeat" \
{sum = sum + $5 - $4 + 1} \
END{print sum}' genes.gff
The total repeat length was stored in a variable called sum. The previous awk command can be broken down into three parts:
BEGIN{sum=0} sets sum to zero before any lines of the file are read.awk reads each line of the file. Each time a repeat is found, the length of that repeat is added to sum.awk runs the END block: END{print sum}. This prints the value of sum.In fact, the command can be shortened a little. Adding a number to a variable is so common, that there is a shorthand way to write it. Instead of
sum = sum + $5 - $4 + 1
we can use
sum += $5 - $4 + 1
to get the same result.
In [ ]:
awk -F"\t" 'BEGIN{sum=0} \
$3=="repeat" {sum += $5 - $4 + 1} \
END{print sum}' genes.gff
Maybe we would like to know the mean score of the genes. We need to calculate the total score, and divide this by the number of genes. To keep track of the number of genes, we use a variable called count. Each time a new gene is found, 1 must be added to count. This could be done by writing
count = count + 1
but instead we will use the shorthand
count++
In [ ]:
awk -F"\t" 'BEGIN{sum=0; count=0} \
$3=="gene" {sum += $6; count++} \
END{print sum/count}' genes.gff
Finally, awk has a default behaviour that means we do not even need the BEGIN block. It can be completely omitted in this example because we are setting sum and count to zero. The first time awk sees a variable being used, it will set it to zero by default. For example, when awk reads the first line of the file, the piece of code
count++
tells awk to add 1 to count. However, if awk has not encountered the variable count before, it assumes it is zero (as if we had written BEGIN{count=0}), then adds 1 to it. The result is that count is equal to 1. Similar comments apply to the variable sum.
In [ ]:
awk -F"\t" '$3=="gene" {sum += $6; count++} \
END{print sum/count}' genes.gff
If this confuses you, then be explicit and use the BEGIN block of code. The result is the same.
The following exercises all use the BED file exercises.bed. Before starting the exercises, open a new terminal and navigate to the awk/ directory, which contains exercises.bed.
Use awk to find the answers to the following questions about the file exercises.bed. Many questions will require using pipes (eg "awk ... | sort -u" for question 1).
In [ ]:
Now go to the next part of the tutorial, BASH scripts.
You can also return to the index or revisit the previous section.