File processing with AWK

AWK is a programming language named after the initials of its three inventors: Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is incredibly powerful at processing files, particularly column-based files, which are commonplace in Bioinformatics. For example, BED, GFF, and SAM files.

Although long programs, put into a separate file, can be written using AWK, we will use it directly on the command line. Effectively, these are very short AWK programs, often called "one-liners".

Before we start, change into the Unix/awk directory and double check that the following commands gives you a similar output:



In [ ]:

    
pwd
ls

Extracting columns from files

awk reads a file line-by-line, splitting each line into columns. This makes it easy to do simple things like extract a column from a file. We will use the following GFF file for our examples.



In [ ]:

    
cat genes.gff

The columns in the GFF file are separated by tabs and have the following meanings:

Sequence name
Source - the name of the program that made the feature
Feature - the type of feature, for example gene or CDS
Start position
Stop position
Score
Strand (+ or -)
Frame (0, 1, or 2)
Optional extra information, in the form key1=value1;key2=value2;...

The score, strand, and frame can be set to '.' if it is not relevant for that feature. The final column 9 may or may not be present and could contain any number of key, value pairs.

We can use awk to just print the first column of the file. awk calls the columns $1, $2, ... etc, and the complete line is called $0.



In [ ]:

    
awk -F"\t" '{print $1}' genes.gff

A little explanation is needed.

The option -F"\t" was needed to tell awk that the columns are separated by tabs (more on this later).
For each line of the file, awk does what is inside the curly brackets. In this case, we simply print the first column.

The repeated chromosome names are not nice. It is more likely to want to know just the unique names, which can be found by piping into the Unix command sort.



In [ ]:

    
awk -F"\t" '{print $1}' genes.gff | sort -u

Filtering the input file

Similarly to grep, awk can be used to filter out lines of a file. However, since awk is column-based, it makes it easy to filter based on properties of any columns of interest. The filtering criteria can be added before the braces. For example, the following extracts just chromosome 1 from the file.



In [ ]:

    
awk -F"\t" '$1=="chr1" {print $0}' genes.gff

There are two important things to note from the above command:

$1=="chr1" means that column 1 must be exactly equal to "chr1". This means that "chr10" is not found.
The "{print $0}" part only happens when the first column is equal to "chr1", otherwise awk does nothing (the line gets ignored).

Awk commands are made up of two parts, a pattern (e.g. $1=="chr1") and an action (e.g. print $0) which is contained in curly braces. The pattern defines which lines the action is applied to.

In fact, the action (the part in curly braces) can be omitted in this example. awk assumes that you want to print the whole line, unless it is told otherwise. This gives a simple method of filtering based on columns.



In [ ]:

    
awk -F"\t" '$1=="chr1"' genes.gff

You might remember using another of awk's defaults in a previous exercise. In that example we supplied an action but no pattern. In this case, awk assumes that you want to apply the action to every line in the file. For example:



In [ ]:

    
awk -F"\t" '{print $1}' genes.gff

Multiple patterns can be combined using "&&" to mean "and". For example, to find just the genes from chromosome 1:



In [ ]:

    
awk -F"\t" '$1=="chr1" && $3=="gene"' genes.gff

The entire line need not be printed (remember, if not specified, awk assumes a print $0). Suppose we want only the sources of the genes on chromosome 1:



In [ ]:

    
awk -F"\t" '$1=="chr1" && $3=="gene" {print $2}' genes.gff | sort -u

Similarly to using "&&" for "and", there is "||" to mean "or". To find features that are repeats or made by the tool "source2":



In [ ]:

    
awk -F"\t" '$2=="source2" || $3=="repeat"' genes.gff

So far, we have only used strings for the filtering. Numbers can also be used. We could ask awk to return all the genes on chromosome 1 that start before position 1100:



In [ ]:

    
awk -F"\t" '$1=="chr1" && $3=="gene" && $4 < 1100' genes.gff

Instead of looking for exact matches to strings, regular expressions can be used. The symbol "~" is used instead of "==". For example, to find all the autosomes, we need to use a regular expression for matches to the first column. The regular expression is written between forward slashes.



In [ ]:

    
awk -F"\t" '$1 ~ /^chr[0-9]+$/' genes.gff

Like with grep, matches can be inverted. grep has the option -v, but with awk we use "!~" to mean "does not match". This inverts the previous example:



In [ ]:

    
awk -F"\t" '$1 !~ /^chr[0-9]+$/' genes.gff

If we do not specify a column, awk looks for a match anywhere in the whole line (it assumes we wrote $0 ~ /regex/). So, in some sense, awk can be used as a replacement for grep:



In [ ]:

    
awk '/repeat/' genes.gff

(the -F"\t" was omitted because the match is to the whole line, so how the columns are separated is not relevant.)



In [ ]:

    
grep repeat genes.gff

However, with awk we can easily pull out information from the matching lines. Suppose we want to know which chromosomes have repeats. It is easy with awk.



In [ ]:

    
awk -F"\t" '/repeat/ {print $1}' genes.gff | sort -u

Sanity checking files

Never, ever trust the contents of Bioinformatics files (even if you made them!). We now have enough skills to do some basic sanity checking of a GFF file. For example, to check that every gene has been assigned a strand:



In [ ]:

    
awk -F"\t" '$3=="gene" && !($7 == "+" || $7 == "-")' genes.gff

Something went wrong when this file was made: gene3 has an unknown strand.

Do the start and end coordinates of all the features make sense?



In [ ]:

    
awk -F"\t" '$5 < $4' genes.gff

According to the file, this gene starts at position 10000 and ends at position 1200, which does not make sense. Also, it has no name (the final optional column is empty). We could check if there are any other genes with no name. One way to do this is to use the special variable "NF", which is the number of columns (fields) in the current line. Since the final column is optional, each line might have 8 or 9 columns. We need to write a command that will check:

If the feature is a gene, and if it is:
check if the number of columns is less than 9. When there are 9 columns, check if there is a name defined.



In [ ]:

    
awk -F"\t" '$3=="gene" && (NF<9 || $NF !~/name/)' genes.gff

Note the distinction between NF (the number of columns) and "$NF" (the contents of the final column).

As promised earlier, we now consider the relevance of the option "-F"\t"", to tell awk that the columns in the input file are separated with tab characters. If we forgot to use this option, then awk will use its default behaviour, which is to separate on any whitespace (which usually means tabs and/or spaces). However, consider the final column of the file - it can contain whitespace, which means that messy things happen. Suppose we try to extract the optional extra final column of the file, when it is present. Compare the effect of running awk with and without "-F"\t"".



In [ ]:

    
awk -F"\t" 'NF>8 {print $NF}' genes.gff



In [ ]:

    
awk 'NF>8 {print $NF}' genes.gff

One more sanity check: each line should have 8 or 9 columns (remembering to use -F"\t"!)



In [ ]:

    
awk -F"\t" 'NF<8 || NF>9' genes.gff

There was no output, which means that every line does indeed have 8 or 9 columns.

Changing the output

In addition to filtering, awk can be used to change the output.

Every value in a column could be changed to something else, for example suppose we want to change the source column (column number 2) to something else.



In [ ]:

    
awk -F"\t" '{$2="new_source"; print $0}' genes.gff

This is close, but look carefully at the output. What happened? The output is not tab-separated, but is instead separated with spaces. To restore the tabs, we need to use another special variable called "OFS" (Output Field Separator), and change it before awk does any processing of the input file. This can be achieved by adding "BEGIN{OFS="\t"}", as in the next example. Before awk reads any lines of the file it runs the BEGIN block of code, which in this case changes OFS to be a tab character.



In [ ]:

    
awk -F"\t" 'BEGIN{OFS="\t"} {$2="new_source"; print $0}' genes.gff

Processing the data

More in-depth processing is possible. For example, we could print the length of each repeat (and then sort the results numerically)



In [ ]:

    
awk -F"\t" '$3=="repeat" {print $5 - $4 + 1}' genes.gff | sort -n

Perhaps we would like to know the total length of the repeats. We need to use a variable to add up the total lengths and print the final total. In the same way that awk has a BEGIN block, it can also be given an END block that is only run when awk has finished reading all lines of the input file.



In [ ]:

    
awk -F"\t" 'BEGIN{sum=0} $3=="repeat" \
            {sum = sum + $5 - $4 + 1} \
            END{print sum}' genes.gff

The total repeat length was stored in a variable called sum. The previous awk command can be broken down into three parts:

The BEGIN{sum=0} sets sum to zero before any lines of the file are read.
awk reads each line of the file. Each time a repeat is found, the length of that repeat is added to sum.
Once all lines of the file have been read, awk runs the END block: END{print sum}. This prints the value of sum.

In fact, the command can be shortened a little. Adding a number to a variable is so common, that there is a shorthand way to write it. Instead of

sum = sum + $5 - $4 + 1

we can use

sum += $5 - $4 + 1

to get the same result.



In [ ]:

    
awk -F"\t" 'BEGIN{sum=0} \
            $3=="repeat" {sum += $5 - $4 + 1} \
            END{print sum}' genes.gff

Maybe we would like to know the mean score of the genes. We need to calculate the total score, and divide this by the number of genes. To keep track of the number of genes, we use a variable called count. Each time a new gene is found, 1 must be added to count. This could be done by writing

count = count + 1

but instead we will use the shorthand

count++



In [ ]:

    
awk -F"\t" 'BEGIN{sum=0; count=0} \
            $3=="gene" {sum += $6; count++} \
            END{print sum/count}' genes.gff

Finally, awk has a default behaviour that means we do not even need the BEGIN block. It can be completely omitted in this example because we are setting sum and count to zero. The first time awk sees a variable being used, it will set it to zero by default. For example, when awk reads the first line of the file, the piece of code

count++

tells awk to add 1 to count. However, if awk has not encountered the variable count before, it assumes it is zero (as if we had written BEGIN{count=0}), then adds 1 to it. The result is that count is equal to 1. Similar comments apply to the variable sum.



In [ ]:

    
awk -F"\t" '$3=="gene" {sum += $6; count++} \
            END{print sum/count}' genes.gff

If this confuses you, then be explicit and use the BEGIN block of code. The result is the same.

Exercises

The following exercises all use the BED file exercises.bed. Before starting the exercises, open a new terminal and navigate to the awk/ directory, which contains exercises.bed.

Use awk to find the answers to the following questions about the file exercises.bed. Many questions will require using pipes (eg "awk ... | sort -u" for question 1).

What are the names of the contigs in the file?
How many contigs are there?
How many features are on the positive strand?
How many features are on the negative strand?
How many genes are there?
How many genes have no strand assigned to them (ie the final column is not there)?
Are any gene names repeated? (Hint: you do not need to find their names, just a yes or no answer. Consider the number of unique gene names.)
What is the total score of the repeats?
How many features are in contig-1?
How many repeats are in contig-1?
What is the mean score of the repeats in contig-1?