This notebook is part of your formal assessment for BC2023. It will not count toward your final mark, however you must pass this notebook in order to pass BC2023. A pass mark is 5/10.
Complete and submit the notebook exactly as you have done for all previous notebooks. Before you submit you should make sure your answers are correct by running the test code for each. You can also use the Validate button to check the entire notebook.
All except two of the exercises have a test box to allow you to check your answer before you submit. Two questions require short textual answers and are not auto-graded.
If you are having trouble with a question here are some troubleshooting steps;
Run the Setup Code
In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.
If you restart your kernel you will also need to rerun the setup code
Don't use the
cdcommand
The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd command to answer an exercise.
In [1]:
# Essential Setup Code : Must be run first.
wget https://www.dropbox.com/s/uhua9bbsndfbcs8/setup.sh?dl=0 -O setup.sh
source ./setup.sh
Background
The echo command can be used to print text. For example try the following command in a test cell. It should print "hello"
echo "Hello"
Your task
Write a command to print the phrase "Keyboard good, mouse bad". Make sure you print the phrase exactly (including correct capitalisation).
In [2]:
e1_answer(){
### BEGIN SOLUTION
echo "Keyboard good, mouse bad"
### END SOLUTION
}
In [3]:
test_e1
Background
First use the ls command to list the files in the current directory. You should see several files, including one called h_pylori.bed
The head command can be used to print the first few lines of a file. For example the following command will list the first 10 lines of h_pylori.bed.
head h_pylori.bed
The number of lines printed by head can be changed using the -n option.
Your task
Write a command to list the first 5 lines of the file h_pylori.bed
In [4]:
e2_answer(){
### BEGIN SOLUTION
head -n 5 h_pylori.bed
### END SOLUTION
}
In [5]:
test_e2
Background
The wc command counts lines, words and characters in text. By default it prints all three of these counts as well as the name of the file, for example
wc h_pylori.bed
Your task
By using the appropriate option to the wc command write a command that prints just the number of lines and name of the file for the file h_pylori.bed.
In [6]:
e3_answer(){
### BEGIN SOLUTION
wc -l h_pylori.bed
### END SOLUTION
}
In [7]:
test_e3
Background
The pipe symbol, | can be used to send the output of one command (on the left) to the input of another command (on the right). For example I could pipe the output of echo to wc to count the number of characters in the word "Pikachu".
echo "Pikachu" | wc -c
Your task
Write a command which prints just the number of characters in the first 10 lines of the file h_pylori.bed. (You will need to pipe head to wc).
In [8]:
e4_answer(){
### BEGIN SOLUTION
head -n 10 h_pylori.bed | wc -c
### END SOLUTION
}
In [9]:
test_e4
Background
Use the ls command to list files in the current directory again. This time look for a file called h_pylori.faa.
Now use the head command to look at the first few lines of this file
head h_pylori.faa
Your task
Answer the following question in the answer field below. This answer is not auto-graded and you may simply write your response as free form text.
What is the format of the file, h_pylori.faa
Note: Your answer should be a bioinformatics specific format, not simply text
In [ ]:
Background
The grep command can be used to search for patterns in a stream of text. For example the following command will find the header lines from all proteins with pathogen in the name from within the file h_pylori.faa.
grep 'pathogen' h_pylori.faa
Your task
The file h_pylori.faa contains one entry for every protein encoded on the Helicobacter pylori genome.
Write a command that outputs just this number (ie the number of proteins encoded on the H. pylori genome).
In [10]:
e6_answer(){
### BEGIN SOLUTION
grep '>' h_pylori.faa | wc -l
### END SOLUTION
}
In [11]:
test_e6
Background
Use the grep command to find the definition line of the entry for the cag14 gene product in the file h_pylori.faa.
grep 'cag14' h_pylori.faa
At the start of this text, you should see the accession number corresponding to this protein, it is NP_207331.1
Your task
Use grep to search the file h_pylori.bed to find the bed entry for the gene that encodes the cag14 protein. (You will need to use the information in the background above for this).
What are the start and end positions of this gene?
Note that the bed format was described in workshop 4 04_bedtools
In [ ]:
Background
The sort command can be used to sort tabular files. By default it sorts on the first column, but this can be changed by specifying a column to sort on using -k <column>. Another default of sort is to sort alphabetically but this can be changed using the -n option. Putting this together we can sort the h_pylori.bed file according to start coordinate as follows;
sort -k 2 -n h_pylori.bed
In the command above -k 2 specifies sorting on the second column (which contains start coordinates) and -n specifies a numerical sort.
Your task
Write a command using sort and head (combined with a pipe) that outputs only the bed entry corresponding to the gene with the smallest start coordinate in the H. pylori genome.
Note: You may see an error like this
sort: write failed: 'standard output': Broken pipe
This is something you can safely ignore
In [18]:
e8_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | sort -k 2 -n | head -n 1
### END SOLUTION
}
In [19]:
test_e8
In [20]:
e9_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | sort -k 3 -n -r | head -n 1
### END SOLUTION
}
In [21]:
test_e9
Background
The cut command extracts columns from tabular data. For example to cut the name field from the first 10 lines of h_pylori.bed we would do the following
head h_pylori.bed | cut -f 4
The uniq command removes identical lines from text input. It only works if the input is sorted first, and for that reason it is often combined with the sort command. For example we could find only unique entries in the first field of h_pylori.bed
cat h_pylori.bed | cut -f 1 | sort | uniq
Note that this just produces a single line of output. This is because the H. pylori genome contains just a single chromosome.
Your task
Write a command to count the number of uniquely named entries (using column 4, the name field) in h_pylori.bed
In [16]:
e10_answer(){
### BEGIN SOLUTION
cat h_pylori.bed | cut -f 4 | sort | uniq | wc -l
### END SOLUTION
}
In [17]:
test_e10