Assessment Exercise

This notebook is part of your formal assessment for BC2023. It will not count toward your final mark, however you must pass this notebook in order to pass BC2023. A pass mark is 5/10.

Complete and submit the notebook exactly as you have done for all previous notebooks. Before you submit you should make sure your answers are correct by running the test code for each. You can also use the Validate button to check the entire notebook.

All except two of the exercises have a test box to allow you to check your answer before you submit. Two questions require short textual answers and are not auto-graded.

If you are having trouble with a question here are some troubleshooting steps;

  1. Firstly make sure you have worked through the interactive tutorial again, and through workshops $1-4$. Do this carefully and try to get every answer. If you try to skip ahead you will have trouble.
  2. Make sure you have read the question carefully
  3. Many questions include introductory text with example code. Make a test cell (you can make as many as you like) and actually run the examples. Most answers only require small changes to these examples so if you can understand the examples you should be able to answer the question

IMPORTANT

Run the Setup Code

In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.

If you restart your kernel you will also need to rerun the setup code

Don't use the cd command

The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd command to answer an exercise.


In [1]:
# Essential Setup Code : Must be run first.
wget https://www.dropbox.com/s/uhua9bbsndfbcs8/setup.sh?dl=0 -O setup.sh
source ./setup.sh


--2017-09-15 14:22:15--  https://www.dropbox.com/s/uhua9bbsndfbcs8/setup.sh?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.83.1, 2620:100:6033:1::a27d:5301
Connecting to www.dropbox.com (www.dropbox.com)|162.125.83.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dl.dropboxusercontent.com/content_link/ilrYsNtLJl1nwUzFUjbFjDSxV6w1G4YtqDdAvwwFNBpw13zQjjsDw3CfhNRnF9iP/file [following]
--2017-09-15 14:22:16--  https://dl.dropboxusercontent.com/content_link/ilrYsNtLJl1nwUzFUjbFjDSxV6w1G4YtqDdAvwwFNBpw13zQjjsDw3CfhNRnF9iP/file
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 162.125.83.6, 2620:100:6033:6::a27d:5306
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|162.125.83.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1522 (1.5K) [text/x-sh]
Saving to: 'setup.sh’

setup.sh            100%[===================>]   1.49K  --.-KB/s    in 0s      

2017-09-15 14:22:17 (130 MB/s) - 'setup.sh’ saved [1522/1522]

Setup Done

Question 1

Background

The echo command can be used to print text. For example try the following command in a test cell. It should print "hello"

echo "Hello"

Your task

Write a command to print the phrase "Keyboard good, mouse bad". Make sure you print the phrase exactly (including correct capitalisation).


In [2]:
e1_answer(){
### BEGIN SOLUTION
echo "Keyboard good, mouse bad"
### END SOLUTION
}

In [3]:
test_e1


Your answer is correct

Question 2

Background

First use the ls command to list the files in the current directory. You should see several files, including one called h_pylori.bed

The head command can be used to print the first few lines of a file. For example the following command will list the first 10 lines of h_pylori.bed.

head h_pylori.bed

The number of lines printed by head can be changed using the -n option.

Your task

Write a command to list the first 5 lines of the file h_pylori.bed


In [4]:
e2_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.bed
### END SOLUTION
}

In [5]:
test_e2


Your answer is correct

Question 3

Background

The wc command counts lines, words and characters in text. By default it prints all three of these counts as well as the name of the file, for example

wc h_pylori.bed

Your task

By using the appropriate option to the wc command write a command that prints just the number of lines and name of the file for the file h_pylori.bed.


In [6]:
e3_answer(){
### BEGIN SOLUTION
wc -l h_pylori.bed
### END SOLUTION
}

In [7]:
test_e3


Your answer is correct

Question 4

Background

The pipe symbol, | can be used to send the output of one command (on the left) to the input of another command (on the right). For example I could pipe the output of echo to wc to count the number of characters in the word "Pikachu".

echo "Pikachu" | wc -c

Your task

Write a command which prints just the number of characters in the first 10 lines of the file h_pylori.bed. (You will need to pipe head to wc).


In [8]:
e4_answer(){
### BEGIN SOLUTION
head -n 10 h_pylori.bed | wc -c
### END SOLUTION
}

In [9]:
test_e4


Your answer is correct

Question 5

Background

Use the ls command to list files in the current directory again. This time look for a file called h_pylori.faa.

Now use the head command to look at the first few lines of this file

head h_pylori.faa

Your task

Answer the following question in the answer field below. This answer is not auto-graded and you may simply write your response as free form text.

What is the format of the file, h_pylori.faa Note: Your answer should be a bioinformatics specific format, not simply text


In [ ]:

Question 6

Background

The grep command can be used to search for patterns in a stream of text. For example the following command will find the header lines from all proteins with pathogen in the name from within the file h_pylori.faa.

grep 'pathogen' h_pylori.faa

Your task

The file h_pylori.faa contains one entry for every protein encoded on the Helicobacter pylori genome. Write a command that outputs just this number (ie the number of proteins encoded on the H. pylori genome).


In [10]:
e6_answer(){
### BEGIN SOLUTION
grep '>' h_pylori.faa | wc -l
### END SOLUTION
}

In [11]:
test_e6


Your answer is correct

Question 7

Background

Use the grep command to find the definition line of the entry for the cag14 gene product in the file h_pylori.faa.

grep 'cag14' h_pylori.faa

At the start of this text, you should see the accession number corresponding to this protein, it is NP_207331.1

Your task

Use grep to search the file h_pylori.bed to find the bed entry for the gene that encodes the cag14 protein. (You will need to use the information in the background above for this).

What are the start and end positions of this gene?

Note that the bed format was described in workshop 4 04_bedtools


In [ ]:

Question 8

Background

The sort command can be used to sort tabular files. By default it sorts on the first column, but this can be changed by specifying a column to sort on using -k <column>. Another default of sort is to sort alphabetically but this can be changed using the -n option. Putting this together we can sort the h_pylori.bed file according to start coordinate as follows;

sort -k 2 -n h_pylori.bed

In the command above -k 2 specifies sorting on the second column (which contains start coordinates) and -n specifies a numerical sort.

Your task

Write a command using sort and head (combined with a pipe) that outputs only the bed entry corresponding to the gene with the smallest start coordinate in the H. pylori genome.

Note: You may see an error like this

sort: write failed: 'standard output': Broken pipe

This is something you can safely ignore


In [18]:
e8_answer(){
### BEGIN SOLUTION
cat h_pylori.bed  | sort -k 2 -n | head -n 1
### END SOLUTION
}

In [19]:
test_e8


Your answer is correct

Question 9

Background

The default sort order of the sort command can be reversed using the -r option.

Your task

Building on your answer to question 8, write a command which prints only the bed entry corresponding to the gene with the largest end coordinate in the H. pylori genome


In [20]:
e9_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | sort -k 3 -n -r | head -n 1
### END SOLUTION
}

In [21]:
test_e9


sort: write failed: 'standard output': Broken pipe
sort: write error
Your answer is correct

Question 10

Background

The cut command extracts columns from tabular data. For example to cut the name field from the first 10 lines of h_pylori.bed we would do the following

head h_pylori.bed | cut -f 4

The uniq command removes identical lines from text input. It only works if the input is sorted first, and for that reason it is often combined with the sort command. For example we could find only unique entries in the first field of h_pylori.bed

cat h_pylori.bed | cut -f 1 | sort | uniq

Note that this just produces a single line of output. This is because the H. pylori genome contains just a single chromosome.

Your task

Write a command to count the number of uniquely named entries (using column 4, the name field) in h_pylori.bed


In [16]:
e10_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | cut -f 4 | sort | uniq | wc -l
### END SOLUTION
}

In [17]:
test_e10


Your answer is correct