This workshop is designed to build your skills with one of the most popular forms of the unix command-line, Bash
.
Before you begin this workshop you should have familiarised yourself with chapters 1-20 of the interactive guide.
At the end of this workshop you should be able to;
wc
, head
, tail
, grep
, sort
and cut
.|
operator.>
operatorRun the Setup Code
In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.
If you restart your kernel you will also need to rerun the setup code
Don't use the
cd
command
The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd
command to answer an exercise.
In [1]:
wget -q https://www.dropbox.com/s/4tv0mebn25vy1cq/setup.sh?dl=0 -O setup.sh
source setup.sh
In this tutorial we will work with the genome of Helicobacter pylori. Australian scientist Barry Marshal famously infected himself with this bacterium in order to prove that it was the causative agent of gastric ulcers. He and Robin Warren won the 2005 Nobel prize for their discovery.
The genome of H. pylori is available for download from all major sequence databases. As an exercise try finding the genome page for this organism.
The webpage that appears is a typical genome entry. Not all organisms have sequenced genomes but those that do typically have a dedicated page like this with links to key genomic resources. Notice that there are links to the genome itself, the sequences of all proteins as well as a bunch of files under the heading of genome annotation.
In this tutorial we will work with the genome annotation for H. pylori
As a side note you will probably notice that the genome page for this organism indicates that there are a large number of available genome sequences (755 at last count). Since this is a well studied bacterium the genomes of many strains have been sequenced. NCBI page defaults to showing the genome of the reference strain but the genomes of all the others are also available.
Use the ls
command to list the files in current directory.
In [ ]:
ls
You should see a file called h_pylori.gff
. This file is in a text based tabular format called gff
(genome features file). The complete listing of positions and descriptions of genes and other features in a genome is called a genome annotation. The gff
format is a common way of storing annotations.
The h_pylori.gff
file contains entries for many thousands of genes. One command which is very useful for peeking inside a large file like this is the head
command.
The head
command shows the top few lines of a file. Try it out by entering the command shown below in a test cell.
head h_pylori.gff
As you might expect there is also a command called tail
which shows the last few lines of a file. Try it out like this
tail h_pylori.gff
Now look at the help for the head
command like this
head --help
Your Task: Write code to display the first 5 lines of h_pylori.gff
In [7]:
e1_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.gff
### END SOLUTION
}
In [8]:
test_e1
Examine the entries in h_pylori.gff
. The gff format is a tabular text based format that consists of up to $9$ columns;
The columns (in order) are;
Pay attention to the columns in bold. We will work with these in this tutorial
The wc
(word count) command is very simple. It counts lines, words and characters in text.
For example try it on the h_pylori.gff
file
wc h_pylori.gff
The output includes (from left to right) the number of lines, words and characters.
Look at the help for wc
and find the option for printing just the number of lines.
wc --help
Your Task: Write code to print the number of lines in the h_pylori.gff
file.
In [4]:
e2_answer(){
### BEGIN SOLUTION
wc -l h_pylori.gff
### END SOLUTION
}
In [5]:
test_e2
Up to now we have always used just a single command but the real power of unix comes from the ability to combine commands. One way to combine commands is by using the pipe operator, which is just a vertical bar |
.
For example we could pipe the output of the head
command into wc
like this
head h_pylori.gff | wc
Notice that the wc
command on the right doesn't need to specify an input file. This is because it's input is the output from the previous command. Not all commands can be used in pipes like this but most of the basic unix ones can.
Your task: Write a command which prints the number of words in the top 5 lines of h_pylori.gff
In [6]:
e3_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.gff | wc -w
### END SOLUTION
}
In [7]:
test_e3
The grep
command is used to search for patterns in text. For example we could search for all entries containing the word pathogenicity
to find features related to pathogenicity.
grep 'pathogenicity' h_pylori.gff
The output is a bit messy because it includes a lot of information but if you read closely you should see that all these entries are related to the cag pathogenicity island. This is a group of genes that are thought to have been incorporated into the H. pylori genome through lateral gene transfer (ie from another organism). Variations in this region are associated with H. pylori infections that lead to gastric cancer.
Your task: Write a command that outputs a count of the number of pathogenicity related features in the H. pylori genome
Hint: Each feature is on a separate line so counting lines is the same as counting features.
In [8]:
e4_answer(){
### BEGIN SOLUTION
grep 'pathogenicity' h_pylori.gff | wc -l
### END SOLUTION
}
In [9]:
test_e4
In exercise 3 we saw that a pipe can be used to send the output of one command to another. Another thing we can do with output is send it to a file. This is done using the output redirection operator >
.
For example we could create a subset of h_pylori.gff
containing only pathogenicity related features like this
grep 'pathogenicity' h_pylori.gff > h_pylori_path.gff
Try running this command in a test cell. Then try the ls
command and you should see the new file that was created.
A single operator like this, >
will always overwrite the contents of the file where the output is sent.
A double operator like this >>
will append to the end of the file.
Your task: Use output redirection to create a file with the name h_pylori_path_cds.gff
containing only entries that include the words 'pathogenicity' and 'CDS'
Note: An easy way to grep for A AND B is to use two grep commands, piping output of the first to the second.
In [10]:
e5_answer(){
### BEGIN SOLUTION
grep 'pathogenicity' h_pylori.gff | grep 'CDS' > h_pylori_path_cds.gff
### END SOLUTION
}
In [11]:
test_e5
Unix includes a powerful suite of tools for working with tabular data. This includes sort, cut, join as well as the much more general purpose tool awk.
The cut
command can be used to extract columns from a file.
For example we could extract just the first three columns from h_pylori.bed
like this
cut -f 1,2,3 h_pylori.gff
Here we used -f
to specify that we wanted to cut in field
mode (ie column delimited as opposed to byte or character). The most common way to use cut
is with the -f
option as this is the most natural. We also used 1,2,3
to specify that we wanted columns 1, 2 and 3.
Your task: Write a command to extract only the start and end coordinates (columns 4 and 5) from pathogenicity related entries of h_pylori.gff
In [12]:
e6_answer(){
### BEGIN SOLUTION
grep 'pathogenicity' h_pylori.gff | cut -f 4,5
### END SOLUTION
}
In [13]:
test_e6
Another very useful command for use with tables is sort
. To get an idea of how the pathogenicity genes are spaced it might be useful to sort by their start coordinate. Since the start coordinate is in column 4 we can sort as follows;
grep 'pathogenicity' h_pylori.gff | sort -n -k 4
To make this easier to read we cut
out only columns 1-5
grep 'pathogenicity' h_pylori.gff | sort -n -k 4 | cut -f 1-5
Look closely at the start and end coordinates in the output from the above command (run it in a test cell). You should be able to see that all of the genes occur together with relatively little space between them.
Another use of the sort
command is to find all the unique entries in a column. We can do this using the -u
(unique) option as follows. Remember that column 3 in gff
is the feature type so this should show a list of all the possible feature types in the file.
cut -f 3 h_pylori.gff | sort -u
Notice that this results in a few items that begin with a #
character. These aren't useful because they aren't feature types at all. They are comments which means that they should be ignored.
One way to remove these would be to use the -v
option of the grep command. This allows us to use grep
to search for text and exclude all lines containing that text. It is essentially a way of inverting a search.
cut -f 3 h_pylori.gff | sort -u | grep '#' -v
Your task: Write a command to show all possible unique values for the first column of h_pylori.gff
, excluding comments
Interpretation: Given what you know about the gff
format and the fact that H. pylori is a Bacterium why is there only a single unique value in column 1 of the file.
In [14]:
e7_answer(){
### BEGIN SOLUTION
cut -f 1 h_pylori.gff | sort -u | grep '#' -v
### END SOLUTION
}
In [15]:
test_e7