Simplfiy the following expression and write in words the probability law or definition being used to go from the left-hand-side to the right-hand-side. There should be no intermediate steps. See example question. Answer the following questions in Markdown.
$\DeclareMathOperator{\P}{P}$ $\DeclareMathOperator{\E}{E}$
$$\sum_{\mathcal{X}}\P(x|y) = $$
$$\sum_{\mathcal{Y}}\P(x|y)P(y) = $$
$$\sum_{\mathcal{Y}}\P(x,y) = $$
$$\frac{\P(x,y)}{\P(y)} = $$
If $y$ and $x$ are independent: $$\P(x|y) = $$
1.1 $\sum_{\mathcal{X}}\P(x|y) = 1$, due the the Law of Total Probability.
1.2 $\sum_{\mathcal{Y}}\P(x|y)\P(y) = \P(x)$. Marginalization of conditional
1.3 $\sum_{\mathcal{Y}}\P(x,y) = \P(x)$. Definition of marginal
1.4 $\frac{\P(x,y)}{\P(y)} = \P(x|y)$. Definition of conditional
1.5 $\P(x|y) = \P(x)$. Definition of independence
Using the path example from HW 1, answer the following questions symbolically. Recall that the sample space is pathways and see HW 1 Key for the list of paths. Given that:
2.1 $$n=6,\: Q=18,\: \P=\frac{1}{3}$$
2.2 $$n=1,\: Q=4,\: \P=\frac{1}{4}$$
2.3 $$n=2,\: Q=18,\: \P=\frac{1}{9}$$
2.4 $$n=5,\: Q=11,\: P=\frac{5}{11}$$
2.5 $$n=1,\: Q=2,\: P=\frac{1}{2}$$
2.6 $$n = 4\,\: Q=18,\: \P = \frac{4}{18}$$
2.7
$\P(X = 2) = \frac{1}{3} \neq \P(X = 2\, | \, Y = A) \Rightarrow$ $X$ and $Y$ are not independent.
$\P(Z | Y = A) = 1 \neq \P(Z) \Rightarrow$ $Z$ and $Y$ are not independent.
By the transitive property $X$ and $Z$ are not independent.
Ebola virus disease (EVD) cases have been recently observed in many countries due to an ongoing West African epidemic. EVD is particularly dangerous due to its high fatality rate: about 50%. The standard clinical diagnosis technique which detects viral RNA or human antibodies requires 5 hours. You have developed a new blood test that is 95% accurate and only requires 2 minutes. Good job. Accuracy here means if the patient has EVD, your test says they have EVD 95% of the time. Similarly, if the patient doesn't have EVD, your test says they do not have EVD 95% of the time. Answer these questions first symbolically then in Python
[5] To test your EVD screening method, you visit a country where only 1% of the population has EVD and randomly test people on the street. Assuming that the probability someone has EVD is 1%, if your method returns a positive, what is the actual probability the person has EVD? Hint: You cannot compute the conditional which the question asks for directly. Look at your notes for how to rearrange conditionals.
[2] Your EVD screening method has passed preliminary clinical trials. Good job. Now, to reduce false positives you need to decide how many times you repeat the screening method before treating a patient for EVD. For there to be a 99.9% probability the patient has EVD, how many screenings must you do? Assume the screenings are conditionally independent and that you only consider someone to be EVD positve after $n$ positve trials. Hint: See the useful equation below, which holds if $X$ and $Y$ are independent, but not $X$/$Z$ nor $Y$/$Z$ (thus $X/Y$ are conditionally independent). This is true, for example, if you have multiple independent trials which are conditioned on something $$P(X,Y,Z) = P(X,Y|Z)P(Z) = P(X | Y,Z) P(Y | Z) P(Z) = $$ $$P(X | Z) P(Y | Z) P(Z)$$ Also, due to conditional independence $P(X,Y) \neq P(X) P(Y)$, but $\sum_Z P(X | Z) P(Y | Z) P(Z) = P(X,Y)$ is true using the equation above
[2] A patient was tested for EVD, tested positive, then was tested again and that test was negative. Per the protocol above, the patient was released. The local news picked up the story and the general public believes that passing and failing a test indicates there was a 50/50 chance the patient had EVD. There is panic and protestors outside hospitals. What is the actual probability the patient had EVD? Note: There is actually a balance between question 2 and 3 that we will revisit in a later homework.
[1] Is the assumption that repeat screenings are independent valid? To put another way, if we see a false positive or false negative on the first trial are we equally likely to see one on the second trial? Why or why not?
$\P(A=0)=0.99$ - probability that one does not have EVD
$\P(A=1)=0.01$ - probability that one has EVD
$\P(B=0)$ - probability that the test is negative
$\P(B=1)$ -probability that the test is positive
$\P(B=1|A=1)=0.95$ - probability that the test is positive given that a person has EVD
$\P(B=0|A=0)=0.95$ - probability that the test is negative given that a person does not EVD
$\P(A=1|B=1)=?$ - probability that a person has EVD, given the test is positive
$\P(B=1)=\sum_{\mathcal{A}}\P(B=1|A)P(A)=P(A=1)P(B=1|A=1)+P(A=0)P(B=1|A=0)=0.01\times0.95+0.99\times(1-0.95)$
$$\P(A=1|B=1)=\frac{\P(B=1|A=1)\P(A=1)}{\P(B=1)}$$
In [1]:
marginal_evd = 0.01 #p of having EVD
cond_positive = 0.95 # p of having positive test given one has EVD
cond_negative = 0.95 # p of having negative test given one does not have EVD
marginal_positive = marginal_evd * cond_positive + (1 - cond_positive)*(1-marginal_evd) # marginal probability of having positive test
cond_evd=cond_positive * marginal_evd / marginal_positive # p of having EVD given the test is positive
print 'The probability that the test is positive is equal to {:.4} and if the test is positive, the probability that a person has EVD is {:.4}'.format(marginal_positive,cond_evd)
In [2]:
n=4 # try different number of trials until 'cond_evd_z' is less than 0.001
#The probability of having EVD given Z has occurred
cond_evd_z=cond_positive**n * marginal_evd / ( (1 - cond_negative)**n * (1 - marginal_evd) + cond_positive**n * marginal_evd)
print 'Since {:.4} is greater than 0.999, the number of trials needed is {}.' .format(cond_evd_z, n)
$P(B=0|A=1)- $ probability of having negative test given one has EVD
$P(A=1|B=0)=\frac{P(B=0|A=1)(P(B=1|A=1))P(A=1)}{(P(B=1))P(B=0)} $ probability of having EVD given the test is negative
$P(A=0)=0.99 - $ probability that one does not have EVD
$P(A=1)=0.01 - $ probability that one has EVD
$P(B=0)=1-P(B=1)=1-0.059=0.941- $ probability that the test is negative
$$P(A=1|B_0 = 1, B_1 = 0) = \frac{P(B_0 = 1| A = 1) P(B_1 = 0| A = 1) P(A = 1)}{P(B_0 = 1 | A = 1) P(B_1 = 0 | A = 1) P(A = 1) + P(B_0 = 1 | A = 0) P(B_1 = 0 | A = 0) P(A = 0)}$$
In [3]:
cond_evd_event = cond_positive * (1 - cond_positive) * marginal_evd / (cond_positive * (1 - cond_positive) * marginal_evd + (1 - cond_negative) * cond_negative * (1 - marginal_evd))
print 'The probability the patient has EVD is {:.5}'.format(cond_evd_event)
Answer the following problems in Python using booleans
Is $2^{16}$ greater than $10^4$?
Using the frexp
function, show how the floating point number $0.1$ is represented and rebuild the number from its pieces. Use base 10
is $0.1 + 0.2$ exactly equal to $(1.0 + 2.0) / 10.0$ with floats? Why not?
When creating permutations from a combination of elements where you can use each element once, the number of permutations is $n!$. For example, if you have three letters and you can use each letter only once, the number of permutations is $3\times 2 \times 1$. How many ways can a deck of cards be shuffled? Are there more stars in the universe ($10^{23}$) or deck shuffling permutations?
The next cell contains a joint probability mass function for $x$ and $y$. $x$ is the first number and $y$ is the second. You may access elements like this: P[0,2]
. P[0,2]
is the probability that $x=0$ and $y=2$. Demonstrate that the two random variables are not independent.
Calculate the marginal $\P(x=2)$, where $x$ is the first index using the next cell's joint probability mass function.
Calculate the conditional $\P(x=2 | y = 1)$, where $x$ is the first index using the next cell's joint probability mass function.
In [3]:
#This is loading the data for question 3.5 through 3.8.
#Execute this cell and use new cells below. Do not answer in this cell!
import numpy as np
P = np.zeros( (3,3) )
P[0,0] = 1. / 9
P[0,1] = 1. / 9
P[0,2] = 0.
P[1,0] = 1. / 3
P[1,1] = 0
P[1,2] = 1. / 6
P[2,0] = 1. / 9
P[2,1] = 1. / 18
P[2,2] = 1. / 9
In [4]:
#Example of how to calculate P(x = 0)
px0 = P[0,0] + P[0,1] + P[0,2]
print px0
In [32]:
#Question 4.1
print 2**16 > 10**4
In [48]:
#Question 4.2
from math import frexp
m,e=frexp(.1)
print '{:4}*2^({:})={}' .format(m,int(e), m * 2**e)
In [27]:
#Question 4.3
print 'It is', 0.1+0.2==(1.0+2.0)/10, 'that they are equal'
print 'They are not equal because there are no exact representation for either number in base 2 mantissa/exponent format'
print '0.1 + 0.2 is {:.20}'.format(0.1+0.2)
print '(1.0 + 2.0) / 10 is {:.20}'.format((1.0 + 2.0) / 10 )
In [26]:
#Question 4.4
from math import factorial
factor=factorial(42)
print 'It is', factor>10**23, 'that there are more deck permutations than stars in the universe'
In [31]:
#Question 4.5
print P[1,1], (P[1,1]+P[1,0]+P[1,2])*(P[0,1]+P[1,1]+P[2,1])
print 'Since these are not equal, P(X=1, Y=1) != P(X=1) P(Y=1), which violates independence'
In [33]:
#Question 4.6
print P[2,0]+P[2,1]+P[2,2]
In [34]:
#Question 4.7
P[2,1]/(P[0,1]+P[1,1]+P[2,1])
Out[34]:
Using Python, Markdown, and equations, describe a random variable for the sum of two dice when rolled and create a Python cell where a user may enter the sum and your cell prints the probability of that sample. For the Python portion, see the cell below for an example of the python program and see here for an example of how your answer should look (yours should be much shorter though). Hint: There is a simple equation for the probability. Try writing out a few terms, like P(2), P(3), P(7), P(11), to derive the equation
Rubric: [3] explaination/equation, [3] for the python program.
In [11]:
roll = 3 #Enter the value of a single die roll here
P = 1 / 6. #State space is 6, uniform, and each roll has a single permutation.
print 'The probability of rolling a', roll, 'is', P
Consider a random variable, $D$, which represents the sum of two fair dice being rolled. Different $D$ values have different permutation numbers. For example $D=7$ has 6 permutations. The sample space is 36 and the permutation number changes based on the distance from the maximum at $D=7$ so that $n/Q$, the probability, can be modeled as:
$$P(D=x) = \frac{6 - |x - 7|}{36} $$The program below computes the probability of $D$ given an input.
In [37]:
dsum = 4 #The sum
print 'The probability of two dice summing to', dsum, 'after a roll is', (6. - abs(dsum - 7)) / 36 #equation defined above.
Proteins are made up of amino acids in a polymer chain. The list of amino acids in that chain is called a protein sequence. There are 20 amino acids and thus 20 possibilities at each sequence position. By comparing sequences of proteins from different species, we can predict the evolutionary history of proteins and organisms based on sequence similarity. Furthermore, by seeing which parts of a protein sequence do not change, we can predict which specific amino acids are related to the function of a protein. In this problem, we will build two models of a protein representing two possible evolutionary histories. Model 1 will have important amino acids at positions 23, 24 and 25. Model 2 will have important amino acids at positions 24, 25 and 99. Although we will not do this now, we can use these two models to assign sequences to different lineages.
Your task is to create two Python cells, similar to Problem 4, which output the probability of a sample entered at the top. $s$, the amino acid identity and $i$, the position, are entered at the top and your program should determine the probability of seeing that amino acid at that specific position. If the position is important (23,24,25), in model 1 the probability of amino acids V, A, I, and L is equal and four times the probability of each other amino acids, which have equal probabilities. In model 2, if the position is 24 or 25, the same rule applies. If the position is 99 in model 2, the probability of D and E is equal and one hundred times the probability of each other amino acids, which again have equal probability. If a position is unimportant, all amino acids have equal probability. Hint: Combine these algebraic statements ALONG with the normalization condition to get a probability mass function
Your answer should be two Markdown and two Python cells. The Markdown should contain a mathematical description of the two models with equations. The Python should be well-documented, contain conditionals, have an $s$ and $i$ variable at the top for a user to enter the values and it should print out a clear message repeating the $s$, $i$ variable along with the probability.
Rubric: [5] covnerting text into probability equations, [1] variable assignment, [2] for conditionals, [4] probabilities, [2] string formatting, [1] correct answer
$\newcommand{\Pa}[1]{\textrm{P}\left(\textrm{#1}\right)}$
$s$-aa aminoacid identity
$i$-position
position is important: 23, 24, 25
let $x=$ probability of finding AA other than V,L,I and A
$$\Pa{V} = \Pa{A} = \Pa{I} = \Pa{L} = 4 \Pa{not V,L,I,A} = 4x$$$$\Pa{V} + \Pa{A} + \Pa{I} + \Pa{L} + 4 \Pa{not V,L,I,A} = 1$$$$4x+4x+4x+4x+(20-4)x=1$$$$x=\frac{1}{32}$$so the probability of the important amino acids is $4/32$
If position is unimportant:
$\Pr(\cdot)$ is constant
$$\Pr(\cdot)=\frac{1}{20}$$
In [1]:
# Model 1
position= 24 # position of the amino acid
identity= 'L' # The identity of the amino acid
probability=0.05 #probability of amino acids when position is not important.
if position==23 or position==24 or position==25: # position of 23, 24 and 25
if identity=='L' or identity=='A' or identity=='I' or identity=='V': # identity of A, V, L , I
probability=4.0/32
else:
probability=1.0/32
print 'The probability of finding amino acid {} at position {} is {:.4}' .format(identity,position,probability)
If position is 24 or 25: Same as Model 1
If position is 99:
Let $z$ be the probability for unimportant amino acids. Then:
$$\Pa{E} = \Pa{D} = 100 \Pa{not E,D} = 100z$$$$\Pa{E} + \Pa{D} + (20-2) \Pa{not E,D} = 1$$$$100z+100z+(20-2)z=1$$$$z=\frac{1}{218}$$If position is unimportant: Same as Model 1
In [44]:
#Model 2
position= 99 # position of the amino acid
identity= 'L' # The identity of the amino acid
probability=0.05 #probability of amino acids when position is not important.
if position==24 or position==25: # position of 24 and 25
if identity=='L' or identity=='A' or identity=='I' or identity=='V': # identity of A, V, L , I
probability=4.0/32
else:
probability=1.0/32
if position == 99:
if identity == 'D' or identity == 'E':
probability = 100 * 1. / 218
else:
probability = 1. / 218
print 'The probability of finding amino acid {} at position {} is {:.4}' .format(identity,position,probability)
In [ ]: