Workshop 2 - Unix Shell Basics

This workshop is designed to build your skills with one of the most popular forms of the unix command-line, Bash.

Before you begin this workshop you should have familiarised yourself with chapters 1-20 of the interactive guide.

At the end of this workshop you should be able to;

  1. Use many new unix commands including wc, head, tail, grep, sort and cut.
  2. Chain commands together with the pipe | operator.
  3. Redirect output of commands to a file with the > operator

IMPORTANT

Run the Setup Code

In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.

If you restart your kernel you will also need to rerun the setup code

Don't use the cd command

The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd command to answer an exercise.


In [1]:
wget -q https://www.dropbox.com/s/4tv0mebn25vy1cq/setup.sh?dl=0 -O setup.sh
source setup.sh


Setup Done

Background

In this tutorial we will work with the genome of Helicobacter pylori. Australian scientist Barry Marshal famously infected himself with this bacterium in order to prove that it was the causative agent of gastric ulcers. He and Robin Warren won the 2005 Nobel prize for their discovery.

The genome of H. pylori is available for download from all major sequence databases. As an exercise try finding the genome page for this organism.

  1. Visit https://www.ncbi.nlm.nih.gov/genome
  2. Type Helicobacter pylori in the search box

The webpage that appears is a typical genome entry. Not all organisms have sequenced genomes but those that do typically have a dedicated page like this with links to key genomic resources. Notice that there are links to the genome itself, the sequences of all proteins as well as a bunch of files under the heading of genome annotation.

In this tutorial we will work with the genome annotation for H. pylori

As a side note you will probably notice that the genome page for this organism indicates that there are a large number of available genome sequences (755 at last count). Since this is a well studied bacterium the genomes of many strains have been sequenced. NCBI page defaults to showing the genome of the reference strain but the genomes of all the others are also available.

Use the ls command to list the files in current directory.


In [ ]:
ls

You should see a file called h_pylori.gff. This file is in a text based tabular format called gff (genome features file). The complete listing of positions and descriptions of genes and other features in a genome is called a genome annotation. The gff format is a common way of storing annotations.

Exercise 1

The h_pylori.gff file contains entries for many thousands of genes. One command which is very useful for peeking inside a large file like this is the head command.

The head command shows the top few lines of a file. Try it out by entering the command shown below in a test cell.

head h_pylori.gff

As you might expect there is also a command called tail which shows the last few lines of a file. Try it out like this

tail h_pylori.gff

Now look at the help for the head command like this

head --help

Your Task: Write code to display the first 5 lines of h_pylori.gff


In [7]:
e1_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.gff
### END SOLUTION
}

In [8]:
test_e1


test_e1: command not found

GFF Format

Examine the entries in h_pylori.gff. The gff format is a tabular text based format that consists of up to $9$ columns;

The columns (in order) are;

  1. seqname - name of the chromosome.
  2. source - name of the program or project that generated this feature
  3. feature - feature type name, e.g. Gene, Exon, rRNA
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A numeric score. Not important for this workshop
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Pay attention to the columns in bold. We will work with these in this tutorial

Exercise 2

The wc (word count) command is very simple. It counts lines, words and characters in text.

For example try it on the h_pylori.gff file

wc h_pylori.gff

The output includes (from left to right) the number of lines, words and characters.

Look at the help for wc and find the option for printing just the number of lines.

wc --help

Your Task: Write code to print the number of lines in the h_pylori.gff file.


In [4]:
e2_answer(){
### BEGIN SOLUTION
wc -l h_pylori.gff
### END SOLUTION
}

In [5]:
test_e2


Your answer is correct

Exercise 3

Up to now we have always used just a single command but the real power of unix comes from the ability to combine commands. One way to combine commands is by using the pipe operator, which is just a vertical bar |.

For example we could pipe the output of the head command into wc like this

head h_pylori.gff | wc

Notice that the wc command on the right doesn't need to specify an input file. This is because it's input is the output from the previous command. Not all commands can be used in pipes like this but most of the basic unix ones can.

Your task: Write a command which prints the number of words in the top 5 lines of h_pylori.gff


In [6]:
e3_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.gff | wc -w
### END SOLUTION
}

In [7]:
test_e3


Your answer is correct

Exercise 4

The grep command is used to search for patterns in text. For example we could search for all entries containing the word pathogenicity to find features related to pathogenicity.

grep 'pathogenicity' h_pylori.gff

The output is a bit messy because it includes a lot of information but if you read closely you should see that all these entries are related to the cag pathogenicity island. This is a group of genes that are thought to have been incorporated into the H. pylori genome through lateral gene transfer (ie from another organism). Variations in this region are associated with H. pylori infections that lead to gastric cancer.

Your task: Write a command that outputs a count of the number of pathogenicity related features in the H. pylori genome

Hint: Each feature is on a separate line so counting lines is the same as counting features.


In [8]:
e4_answer(){
### BEGIN SOLUTION
grep 'pathogenicity' h_pylori.gff | wc -l
### END SOLUTION
}

In [9]:
test_e4


Your answer is correct

Exercise 5

In exercise 3 we saw that a pipe can be used to send the output of one command to another. Another thing we can do with output is send it to a file. This is done using the output redirection operator >.

For example we could create a subset of h_pylori.gff containing only pathogenicity related features like this

grep 'pathogenicity' h_pylori.gff > h_pylori_path.gff

Try running this command in a test cell. Then try the ls command and you should see the new file that was created.

A single operator like this, > will always overwrite the contents of the file where the output is sent.

A double operator like this >> will append to the end of the file.

Your task: Use output redirection to create a file with the name h_pylori_path_cds.gff containing only entries that include the words 'pathogenicity' and 'CDS'

Note: An easy way to grep for A AND B is to use two grep commands, piping output of the first to the second.


In [10]:
e5_answer(){
### BEGIN SOLUTION
grep 'pathogenicity' h_pylori.gff | grep 'CDS' > h_pylori_path_cds.gff
### END SOLUTION
}

In [11]:
test_e5


Your answer is correct

Exercise 6

Unix includes a powerful suite of tools for working with tabular data. This includes sort, cut, join as well as the much more general purpose tool awk.

The cut command can be used to extract columns from a file.

For example we could extract just the first three columns from h_pylori.bed like this

cut -f 1,2,3 h_pylori.gff

Here we used -f to specify that we wanted to cut in field mode (ie column delimited as opposed to byte or character). The most common way to use cut is with the -f option as this is the most natural. We also used 1,2,3 to specify that we wanted columns 1, 2 and 3.

Your task: Write a command to extract only the start and end coordinates (columns 4 and 5) from pathogenicity related entries of h_pylori.gff


In [12]:
e6_answer(){
### BEGIN SOLUTION
grep 'pathogenicity' h_pylori.gff | cut -f 4,5
### END SOLUTION
}

In [13]:
test_e6


Your answer is correct

Sorting coordinates

Another very useful command for use with tables is sort. To get an idea of how the pathogenicity genes are spaced it might be useful to sort by their start coordinate. Since the start coordinate is in column 4 we can sort as follows;

grep 'pathogenicity' h_pylori.gff | sort -n -k 4

To make this easier to read we cut out only columns 1-5

grep 'pathogenicity' h_pylori.gff | sort -n -k 4 | cut -f 1-5

Look closely at the start and end coordinates in the output from the above command (run it in a test cell). You should be able to see that all of the genes occur together with relatively little space between them.

Exercise 7

Another use of the sort command is to find all the unique entries in a column. We can do this using the -u (unique) option as follows. Remember that column 3 in gff is the feature type so this should show a list of all the possible feature types in the file.

cut -f 3 h_pylori.gff | sort -u

Notice that this results in a few items that begin with a # character. These aren't useful because they aren't feature types at all. They are comments which means that they should be ignored.

One way to remove these would be to use the -v option of the grep command. This allows us to use grep to search for text and exclude all lines containing that text. It is essentially a way of inverting a search.

cut -f 3 h_pylori.gff | sort -u | grep '#' -v

Your task: Write a command to show all possible unique values for the first column of h_pylori.gff, excluding comments

Interpretation: Given what you know about the gff format and the fact that H. pylori is a Bacterium why is there only a single unique value in column 1 of the file.


In [14]:
e7_answer(){
### BEGIN SOLUTION
cut -f 1 h_pylori.gff | sort -u | grep '#' -v
### END SOLUTION
}

In [15]:
test_e7


Your answer is correct