This notebook is part of your formal assessment for BC2023. It will not count toward your final mark, however you must pass this notebook in order to pass BC2023. A pass mark is 5/10.
Complete and submit the notebook exactly as you have done for all previous notebooks. Before you submit you should make sure your answers are correct by running the test code for each. You can also use the Validate
button to check the entire notebook.
All except two of the exercises have a test box to allow you to check your answer before you submit. Two questions require short textual answers and are not auto-graded.
If you are having trouble with a question here are some troubleshooting steps;
Run the Setup Code
In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.
If you restart your kernel you will also need to rerun the setup code
Don't use the
cd
command
The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd
command to answer an exercise.
In [1]:
# Essential Setup Code : Must be run first.
wget https://www.dropbox.com/s/uhua9bbsndfbcs8/setup.sh?dl=0 -O setup.sh
source ./setup.sh
Background
The echo
command can be used to print text. For example try the following command in a test cell. It should print "hello"
echo "Hello"
Your task
Write a command to print the phrase "Keyboard good, mouse bad". Make sure you print the phrase exactly (including correct capitalisation).
In [2]:
e1_answer(){
### BEGIN SOLUTION
echo "Keyboard good, mouse bad"
### END SOLUTION
}
In [3]:
test_e1
Background
First use the ls
command to list the files in the current directory. You should see several files, including one called h_pylori.bed
The head
command can be used to print the first few lines of a file. For example the following command will list the first 10 lines of h_pylori.bed
.
head h_pylori.bed
The number of lines printed by head
can be changed using the -n
option.
Your task
Write a command to list the first 5 lines of the file h_pylori.bed
In [4]:
e2_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.bed
### END SOLUTION
}
In [5]:
test_e2
Background
The wc
command counts lines, words and characters in text. By default it prints all three of these counts as well as the name of the file, for example
wc h_pylori.bed
Your task
By using the appropriate option to the wc
command write a command that prints just the number of lines and name of the file for the file h_pylori.bed
.
In [6]:
e3_answer(){
### BEGIN SOLUTION
wc -l h_pylori.bed
### END SOLUTION
}
In [7]:
test_e3
Background
The pipe symbol, |
can be used to send the output of one command (on the left) to the input of another command (on the right). For example I could pipe the output of echo
to wc
to count the number of characters in the word "Pikachu".
echo "Pikachu" | wc -c
Your task
Write a command which prints just the number of characters in the first 10 lines of the file h_pylori.bed
. (You will need to pipe head
to wc
).
In [8]:
e4_answer(){
### BEGIN SOLUTION
head -n 10 h_pylori.bed | wc -c
### END SOLUTION
}
In [9]:
test_e4
Background
Use the ls
command to list files in the current directory again. This time look for a file called h_pylori.faa
.
Now use the head
command to look at the first few lines of this file
head h_pylori.faa
Your task
Answer the following question in the answer field below. This answer is not auto-graded and you may simply write your response as free form text.
What is the format of the file, h_pylori.faa
Note: Your answer should be a bioinformatics specific format, not simply text
In [ ]:
Background
The grep
command can be used to search for patterns in a stream of text. For example the following command will find the header lines from all proteins with pathogen
in the name from within the file h_pylori.faa
.
grep 'pathogen' h_pylori.faa
Your task
The file h_pylori.faa
contains one entry for every protein encoded on the Helicobacter pylori genome.
Write a command that outputs just this number (ie the number of proteins encoded on the H. pylori genome).
In [10]:
e6_answer(){
### BEGIN SOLUTION
grep '>' h_pylori.faa | wc -l
### END SOLUTION
}
In [11]:
test_e6
Background
Use the grep
command to find the definition line of the entry for the cag14
gene product in the file h_pylori.faa
.
grep 'cag14' h_pylori.faa
At the start of this text, you should see the accession number corresponding to this protein, it is NP_207331.1
Your task
Use grep
to search the file h_pylori.bed
to find the bed
entry for the gene that encodes the cag14
protein. (You will need to use the information in the background above for this).
What are the start
and end
positions of this gene?
Note that the bed
format was described in workshop 4 04_bedtools
In [ ]:
Background
The sort
command can be used to sort tabular files. By default it sorts on the first column, but this can be changed by specifying a column to sort on using -k <column>
. Another default of sort
is to sort alphabetically but this can be changed using the -n
option. Putting this together we can sort the h_pylori.bed
file according to start coordinate as follows;
sort -k 2 -n h_pylori.bed
In the command above -k 2
specifies sorting on the second column (which contains start coordinates) and -n
specifies a numerical sort.
Your task
Write a command using sort
and head
(combined with a pipe) that outputs only the bed
entry corresponding to the gene with the smallest start coordinate in the H. pylori genome.
Note: You may see an error like this
sort: write failed: 'standard output': Broken pipe
This is something you can safely ignore
In [18]:
e8_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | sort -k 2 -n | head -n 1
### END SOLUTION
}
In [19]:
test_e8
In [20]:
e9_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | sort -k 3 -n -r | head -n 1
### END SOLUTION
}
In [21]:
test_e9
Background
The cut
command extracts columns from tabular data. For example to cut the name field from the first 10 lines of h_pylori.bed
we would do the following
head h_pylori.bed | cut -f 4
The uniq
command removes identical lines from text input. It only works if the input is sorted first, and for that reason it is often combined with the sort
command. For example we could find only unique entries in the first field of h_pylori.bed
cat h_pylori.bed | cut -f 1 | sort | uniq
Note that this just produces a single line of output. This is because the H. pylori genome contains just a single chromosome.
Your task
Write a command to count the number of uniquely named entries (using column 4, the name field) in h_pylori.bed
In [16]:
e10_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | cut -f 4 | sort | uniq | wc -l
### END SOLUTION
}
In [17]:
test_e10