Simplfiy the following expression and write in words the probability law or definition being used to go from the left-hand-side to the right-hand-side. There should be no intermediate steps. See example question. Answer the following questions in Markdown.
$\DeclareMathOperator{\P}{P}$ $\DeclareMathOperator{\E}{E}$
$$\sum_{\mathcal{X}}\P(x|y) = $$
$$\sum_{\mathcal{Y}}\P(x|y)P(y) = $$
$$\sum_{\mathcal{Y}}\P(x,y) = $$
$$\frac{\P(x,y)}{\P(y)} = $$
If $y$ and $x$ are independent: $$\P(x|y) = $$
Using the path example from HW 1, answer the following questions symbolically. Recall that the sample space is pathways and see HW 1 Key for the list of paths. Given that:
Ebola virus disease (EVD) cases have been recently observed in many countries due to an ongoing West African epidemic. EVD is particularly dangerous due to its high fatality rate: about 50%. The standard clinical diagnosis technique which detects viral RNA or human antibodies requires 5 hours. You have developed a new blood test that is 95% accurate and only requires 2 minutes. Good job. Accuracy here means if the patient has EVD, your test says they have EVD 95% of the time. Similarly, if the patient doesn't have EVD, your test says they do not have EVD 95% of the time. Answer these questions first symbolically then in Python
[5] To test your EVD screening method, you visit a country where only 1% of the population has EVD and randomly test people on the street. Assuming that the probability someone has EVD is 1%, if your method returns a positive, what is the actual probability the person has EVD? Hint: You cannot compute the conditional which the question asks for directly. Look at your notes for how to rearrange conditionals.
[2] Your EVD screening method has passed preliminary clinical trials. Good job. Now, to reduce false positives you need to decide how many times you repeat the screening method before treating a patient for EVD. For there to be a 99.9% probability the patient has EVD, how many screenings must you do? Assume the screenings are conditionally independent and that you only consider someone to be EVD positve after $n$ positve trials. Hint: See the useful equation below, which holds if $X$ and $Y$ are independent, but not $X$/$Z$ nor $Y$/$Z$ (thus $X/Y$ are conditionally independent). This is true, for example, if you have multiple independent trials which are conditioned on something $$P(X,Y,Z) = P(X,Y|Z)P(Z) = P(X | Y,Z) P(Y | Z) P(Z) = $$ $$P(X | Z) P(Y | Z) P(Z)$$ Also, due to conditional independence $P(X,Y) \neq P(X) P(Y)$, but $\sum_Z P(X | Z) P(Y | Z) P(Z) = P(X,Y)$ is true using the equation above
[2] A patient was tested for EVD, tested positive, then was tested again and that test was negative. Per the protocol above, the patient was released. The local news picked up the story and the general public believes that passing and failing a test indicates there was a 50/50 chance the patient had EVD. There is panic and protestors outside hospitals. What is the actual probability the patient had EVD? Note: There is actually a balance between question 2 and 3 that we will revisit in a later homework.
[1] Is the assumption that repeat screenings are independent valid? To put another way, if we see a false positive or false negative on the first trial are we equally likely to see one on the second trial? Why or why not?
Answer the following problems in Python using booleans
Is $2^{16}$ greater than $10^4$?
Using the frexp
function, show how the floating point number $0.1$ is represented and rebuild the number from its pieces. Use base 10
is $0.1 + 0.2$ exactly equal to $(1.0 + 2.0) / 10.0$ with floats? Why not?
When creating permutations from a combination of elements where you can use each element once, the number of permutations is $n!$. For example, if you have three letters and you can use each letter only once, the number of permutations is $3\times 2 \times 1$. How many ways can a deck of cards be shuffled? Are there more stars in the universe ($10^{23}$) or deck shuffling permutations?
The next cell contains a joint probability mass function for $x$ and $y$. $x$ is the first number and $y$ is the second. You may access elements like this: P[0,2]
. P[0,2]
is the probability that $x=0$ and $y=2$. Demonstrate that the two random variables are not independent.
Calculate the marginal $\P(x=2)$, where $x$ is the first index using the next cell's joint probability mass function.
Calculate the conditional $\P(x=2 | y = 1)$, where $x$ is the first index using the next cell's joint probability mass function.
In [3]:
#This is loading the data for question 3.5 through 3.8.
#Execute this cell and use new cells below. Do not answer in this cell!
import numpy as np
P = np.zeros( (3,3) )
P[0,0] = 1. / 9
P[0,1] = 1. / 9
P[0,2] = 0.
P[1,0] = 1. / 3
P[1,1] = 0
P[1,2] = 1. / 6
P[2,0] = 1. / 9
P[2,1] = 1. / 18
P[2,2] = 1. / 9
In [6]:
#Example of how to calculate P(x = 0)
py0 = P[0,0] + P[0,1] + P[0,2]
print py0
In [3]:
#Question 3.1
print 2**16 > 10**4
Using Python, Markdown, and equations, describe a random variable for the sum of two dice when rolled and create a Python cell where a user may enter the sum and your cell prints the probability of that sample. For the Python portion, see the cell below for an example of the python program and see here for an example of how your answer should look (yours should be much shorter though). Hint: There is a simple equation for the probability. Try writing out a few terms, like P(2), P(3), P(7), P(11), to derive the equation
Rubric: [3] explaination/equation, [3] for the python program.
In [11]:
roll = 3 #Enter the value of a single die roll here
P = 1 / 6. #State space is 6, uniform, and each roll has a single permutation.
print 'The probability of rolling a', roll, 'is', P
Proteins are made up of amino acids in a polymer chain. The list of amino acids in that chain is called a protein sequence. There are 20 amino acids and thus 20 possibilities at each sequence position. By comparing sequences of proteins from different species, we can predict the evolutionary history of proteins and organisms based on sequence similarity. Furthermore, by seeing which parts of a protein sequence do not change, we can predict which specific amino acids are related to the function of a protein. In this problem, we will build two models of a protein representing two possible evolutionary histories. Model 1 will have important amino acids at positions 23, 24 and 25. Model 2 will have important amino acids at positions 24, 25 and 99. Although we will not do this now, we can use these two models to assign sequences to different lineages.
Your task is to create two Python cells, similar to Problem 4, which output the probability of a sample entered at the top. $s$, the amino acid identity and $i$, the position, are entered at the top and your program should determine the probability of seeing that amino acid at that specific position. If the position is important (23,24,25), in model 1 the probability of amino acids V, A, I, and L is equal and four times the probability of each other amino acid, which have equal probabilities. In model 2, if the position is 24 or 25, the same rule applies. If the position is 99 in model 2, the probability of D and E is equal and one hundred times the probability of each other amino acid, which again have equal probability. If a position is unimportant, all amino acids have equal probability. Hint: Combine these algebraic statements ALONG with the normalization condition to get a probability mass function
Your answer should be two Markdown and two Python cells. The Markdown should contain a mathematical description of the two models with equations. The Python should be well-documented, contain conditionals, have an $s$ and $i$ variable at the top for a user to enter the values and it should print out a clear message repeating the $s$, $i$ variable along with the probability.
Rubric: [5] covnerting text into probability equations, [1] variable assignment, [2] for conditionals, [4] probabilities, [2] string formatting, [1] correct answer
In [ ]: