Midterm exam (EEEBGU4055)

3/13/2019

Instructions

Use Mardown or code to answer questions as instructed. The exam is open book so you are allowed and encouraged to use any online resources available. However, you must answer questions on your own. You cannot use chat applications to share answers.

This exam is composed of 15 questions of which you only need to answer 10. I recommend that you start by browsing the questions to choose which you plan to answer. For the questions you choose not to answer please leave the answer cell empty.

The exam is due at 12:50pm. You must download the notebook in HTML format at that time and upload it to courseworks.

Question 1: Find the RefSeq reference genome of the Brown Rat (Rattus_norvegicus) on an online database.

(1) What are the scaffold and contig N50 size statistics for this genome?

(2) What does the N50 statistic measure?

(3) Why is the N50 scaffold size always larger than the N50 contig size?

In [1]:
# answer here
Question 2:
(1) Describe briefly, what is a genome annotation?

(2) Describe the format of a GFF file and how genome annotation information is represented in it.

(3) Name two types of evidence/data that are used to generate genome annotation information for a GFF file?

In [2]:
## answer here
Question 3: Write Python code to accomplish the following three tasks. Pay careful attention to exactly what each task is asking for.

(1) Write a Python dictionary that represents barcode information that could be used to demultiplex samples. The keys of the dictionary should be sample names (any string you choose) and the values of the dictionary should be unique six base barcode sequences as string objects.

(2) Write a Python function to return a random sequence of DNA of a given length determined by an argument provided to the function. The returned sequence should be a string object.

(3) Write a Python function to return the reverse complement of a sequence that is given to the function as an argument. The returned sequence should be a string object.

In [6]:
# code 1

In [7]:
# code 2

In [8]:
# code 3
Question 4: Write Python code to answer the following questions (this will be graded as part of your answer), using the file: /home/jovyan/ro-data/SRP021469/35236_rex_SRR1754731.fastq.gz as input.

(1) How many reads are in this fastq file?

(2) What are the lengths of reads in the file?

In [9]:
# code here

In [10]:
# markdown here
Question 5: A man and a woman who are both carriers of the same mutation that causes Tay Sachs (an autosomal recessive disease) decide to have children.

(1) What is the probability that their first child will be born with Tay Sachs?

(2) What is the probability that their second child will be born with Tay Sachs?

(3) What is the probability, if they have four children, that at least one will have Tay Sachs?

In [11]:
# answer
Question 6: Answer separately for each image below. What is the most likely mode of inheritance in each pedigree?

(1)

![](https://github.com/eaton-lab/eaton-lab.github.io/raw/master/data/midterm-pedigree1.png)

(2)

![](https://github.com/eaton-lab/eaton-lab.github.io/raw/master/data/midterm-pedigree2.png)

(3)

![](https://github.com/eaton-lab/eaton-lab.github.io/raw/master/data/midterm-pedigree3.png)

In [12]:
# answer
Question 7: Explain in detail how you would find the mutation that causes a completely penetrant autosomal dominant disease.

(1) What population, families, or individuals would you study?

(2) Which individuals would you sequence?

(3) Which sequencing technology would you use and to what coverage?

(4) What would you do with the data once you get it from the sequencer?

(5) What logic would you use to identify the causal mutation and gene?

In [76]:
# answer
Question 8: You were awarded a \$50K grant to sequence a reference genome for your study organism, a species of salamander. You don't know the genome size of your species ahead of time.

(1) What approach could you use to estimate genome size and heterozygosity?

(2) Why does genome size and heterozygosity matter for assembling a genome?

(3) Approximately what size range are salamander genomes, generally?

(4) Is this considered large or small?

(5) How would this affect your plan for spending your $50K efficiently?

In [77]:
# answer
Question 9: You are interested in the studying the evolution of gene A by examining the phylogenetic history of this gene in clade X, which includes many species that vary with respect to a phenotype affected by gene A.

(1) How/Where could you find published available data for gene A in clade X?

(2) What steps are involved in inferring a phylogeny from these data?

(3) If this gene was duplicated at some point during the evolution of clade X, describe the type of phylogenetic patterns that this could cause, while using the proper terms to refer to duplicated gene copies versus copies that are inherited from a common ancestor.

In [78]:
# answer
Question 10: You cross a·b/A·B x a·b/a·b and get 100 offspring: 25 a·B/a·b, 25 A·b/a·b, 25 a·b/a·b, 25 A·B/a·b.

(1) How many centimorgans are there between A(a) and B(b)?

(2) Are A(a) and B(b) from the same or different chromosomes?

Next you cross a·b/A·B x a·b/a·b and get 100 offspring: 10 a·B/a·b, 10 A·b/a·b, 40 a·b/a·b, and 40 A·B/a·b.

(3) How many centimorgans are there between A(a) and B(b)?

(4) Are A(a) and B(b) from the same or different chromosomes?

In [79]:
# answer
Question 11: In the context of de novo genome assembly:

(1) What is a *de Bruijn* graph and what do the edges and nodes of the graph represent?

(2) Why does a *de Bruijn* graph provide a more efficient way than Hamiltonian graphs for *de novo* assembly of short read data?

(3) Name two things that make *de novo* genome assembly difficult because they introduce ambiguities into the path connecting nodes in a graph.

(4) Name two reasons why *de novo* genome assemblies from short reads are typically not able to assemble full chromosome-scale contigs.

In [80]:
# answer
Question 12:
(1) How many bases do you need to sequence a human genome to a coverage of 30X? Explain how you calculated this?

(2) If the sequencing library of that human is made up of fragments that are all >300 bp and you sequence those fragments from both ends, how many reads (each end counts as a read) do you need to sequence to a coverage of 30X?

(3) If you are just sequencing the exome (the exons / coding sequence) of a human genome (1% of the genome) because that's where most of the mutations that cause disease are located, how many bases do you need to get 60X coverage?

In [81]:
# answer
Question 13:
A male mouse of strain A and a female mouse of strain B are crossed, and the female F1 offspring is mated to the same male mouse of strain A to produce a backcrossed mouse.

(1) What is the probability that a given chromosome pair of the backcrossed mouse contains <=50% genetic material inherited from the male mouse of strain A?

(2) What is the probability that a given chromosome pair of the backcrossed mouse contains >50% genetic material inherited from the male mouse of strain A?

In [82]:
# answer
Question 14:
(1) Describe briefly, what is RAD-seq?

(2) Describe an example type of study where RAD-seq would be useful?

(3) Approximately how many loci and SNPs would you expect RAD-seq to provide for this study?

(4) If you generated an initial RAD-seq data set and found that it yielded way fewer loci and SNPs than you were hoping, how could you modify the protocol to sample the genome more densely?

In [83]:
# answer
Question 15:
(1) At Metaphase I, one pair of sister chromatids contain genetic material inherited from: (A) only one parent, (B) both parents, (C) either one or both parents.

(2) How many recombination events does each chromosome pair have during meiosis? Give a reasonable range?

(3) What effect does the gene PRDM9 have on recombination in mice?

In [84]:
# answer