John Winter Data Challenge



In [621]:

    
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import scipy
from sklearn.decomposition import PCA


%matplotlib inline

Question 1.a



In [139]:

    
height=np.array([150,163,167,168,170,178])

By default numpy uses linear interpolation



In [140]:

    
print 'min',np.min(height)
print '1st', np.percentile(height,25)
print 'median',np.median(height)
print '3rd',np.percentile(height,75)
print 'max',np.max(height)









    



min 150
1st 164.0
median 167.5
3rd 169.5
max 178

We can also force numpy to use nearest values



In [164]:

    
print 'min',np.min(height)
print '1st', np.percentile(height,25, interpolation='lower')
print 'median',np.median(height)
print '3rd',np.percentile(height,75, interpolation='higher')
print 'max',np.max(height)









    



min 150
1st 163
median 167.5
3rd 170
max 178

Question 1.b



In [143]:

    
print 'mean',np.mean(height)









    



mean 166.0

Question 1.c



In [170]:

    
l_q75 = np.percentile(height,75)
l_q25 = np.percentile(height,25)
l_iqr = l_q75 - l_q25



In [171]:

    
print 'Linear interpolation IQR',l_iqr









    



Linear interpolation IQR 5.5



In [172]:

    
q75 = np.percentile(height,75, interpolation='higher')
q25 = np.percentile(height,25, interpolation='lower')
iqr = q75 - q25



In [173]:

    
print 'IQR',iqr









    



IQR 7

Question 1.d



In [174]:

    
l_q25-l_iqr*1.5









    Out[174]:





155.75



In [175]:

    
l_q75+l_iqr*1.5









    Out[175]:





177.75

150 and 178 are both possible outliers based on the IQR 'fence' definition using linear interpolation



In [177]:

    
q25-iqr*1.5









    Out[177]:





152.5



In [178]:

    
q75+iqr*1.5









    Out[178]:





180.5

150 is a possible outlier based on the IQR 'fence' definition using nearest value

Question 1.e



In [191]:

    
seaborn.boxplot(height,whis=1.5,vert='True')
seaborn.plt.title('Linear Interpolation - Height')









    Out[191]:





<matplotlib.text.Text at 0x113f7e8d0>



In [192]:

    
item = {}

item["label"] = 'box'

item["med"] = 167.5
item["q1"] = 163 
item["q3"] = 170

item["whislo"] = 163 
item["whishi"] = 178 
item["fliers"] = [] 

stats = [item]

fig, axes = plt.subplots(1, 1)
axes.bxp(stats)
axes.set_title('Nearest Values - Height')
y_axis = [150,163,167,168,170,178]
y_values = ['150','163','167','168','170','178']
plt.yticks(y_axis, y_values)









    Out[192]:





([<matplotlib.axis.YTick at 0x113c13fd0>,
  <matplotlib.axis.YTick at 0x113c13b10>,
  <matplotlib.axis.YTick at 0x113ef8850>,
  <matplotlib.axis.YTick at 0x113f03250>,
  <matplotlib.axis.YTick at 0x113f039d0>,
  <matplotlib.axis.YTick at 0x113f0d190>],
 <a list of 6 Text yticklabel objects>)

Question 1.f



In [198]:

    
print 'Variance',height.var()









    



Variance 71.6666666667



In [199]:

    
print "Standard Deviation", height.std()









    



Standard Deviation 8.4656167328

Question 2

i. Metric,Interval/Discrete

ii. Non-metric, Ordinal

iii. Non-metric, Nominal/Categorical

iv. Possibly in between. Without more information would categorize Non-metric, Ordinal

V. Possible to argue it is in between. I would categorize Non-metric, Ordinal

Question 3

Shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of one, two and three standard deviations.

Question 4.a

The 84th percentile. Using the rule above we can just do 50(mean)+68(1 std)/2

Question 4.b

Approximately 27% of people. I used a standard normal distribution z-table.



In [211]:

    
print 'z-score:',(90-100)/16.0









    



z-score: -0.625

Question 5

Probability of Silver Given Silver=P(Find S/SC)/(P(Find S/SC)+P(Find S/GC)+P(Find S/MC))

P(S|S)=1/(1+0+.5)=1/1.5=2/3

Question 6

In order for the longer piece to be more than twice the length of the shorter piece, the line must be cut at below 1/3 the length or above 2/3. Therefore the probability is the union of these two or 2/3rds.

Question 7

Probabilty of flu given positive test = (P(Positive test | given present)*P(Disease present in anyone))/(P(probability of anyone getting positive test result))

(P(B|A)*P(A))/(P(B))

Given:

P(A) = .1

P(B|~A) = .01

P(~B|A) = .03

Derived:

P(B|A) = 1-P(~B|A) =.9997

P(B) = P(B|A) x P(A) + P(B|~A) x P(~A)

P(B) = .9997.1 + .01.9 =.109

P(B|A) = .9997*.1/.109 = .917

Probabilty of flu given positive test is approximately 92%

Question 8

The mu over the time period is 100. I calculate the probability of critical failures in all the cases other than 100+ (0-99) and subtract the sum probability from 1



In [254]:

    
total=0
for x in range(0,100):
    total+=scipy.stats.distributions.poisson.pmf(x, 100)
   
print 'probability:', 1-total









    



probability: 0.513298798279

51% probability of 100 or more critical failures over the next 50 years.

Question 9.a



In [ ]:

    
SE = STD/sqrt(n)



In [260]:

    
.8/math.sqrt(100)









    Out[260]:





0.08



In [ ]:

    
**Question 9.b**

Assuming a normal distribution we use the z table to find the corresponding number of standard deviations.

The 95% confidence interval is composed of the following:

Lower = 1.6-.08*1.96 = 1.443

Upper = 1.6+.08*1.96 = 1.757

Question 9.c

U = Umbrellas/Apartment * Apartments

U = 12,800

Question 9.d

SE = STD/sqrt(n)



In [276]:

    
.8*8000/math.sqrt(100)









    Out[276]:





640.0

Question 9.d

Assuming a normal distribution we use the z table to find the corresponding number of standard deviations.

The 95% confidence interval is composed of the following:

Lower = 12,800-640*1.96 = 11,545.6

Upper = 12,800+640*1.96 = 14,054.4

Question 10

First I randomly sample from a normal distribution using numpy then count all values within the interval [0,1]



In [296]:

    
random_normal=np.random.normal(size=1000)



In [297]:

    
seaborn.distplot(random_normal)









    Out[297]:





<matplotlib.axes._subplots.AxesSubplot at 0x113d60e90>



In [298]:

    
within=((0 < x) & (x < 1)).sum()



In [299]:

    
within/1000.0









    Out[299]:





0.33300000000000002

33.3% of the randomly generated values were in the interval [0,1]

Question 11

There is not enough information available to determine if the promotion was effective. The month to month variance may be such that 350 is a typical occurance, and so the observation would have nothing to do with the promotion.

Question 12

To answer this question I used a ttest to find whether the samples were drawn from the same population. If they are drawn from the same population then the p-value will be large, which suggests it would be likey to randomly draw these samples from the same population.



In [371]:

    
t,p_value=scipy.stats.ttest_ind([79.98,80.04,80.02,80.04,80.03,80.03,80.04,79.97,80.05,80.03,80.02],[80.02,79.94,79.98,79.97,79.97,80.03,79.95,79.97])



In [372]:

    
p_value









    Out[372]:





0.0035356633089528265



In [370]:

    
1-p









    Out[370]:





0.99646433669104717

This p_value is quite small. It is outside a 99% confidence interval so we would reject the null hypothesis and say that the results from the two methods differ.

As a check I plot the two data sets and see for myself that distributions do in fact look significantly different.



In [338]:

    
data={'a':[79.98,80.04,80.02,80.04,80.03,80.03,80.04,79.97,80.05,80.03,80.02],'b':[80.02,79.94,79.98,79.97,79.97,80.03,79.95,79.97]}



In [366]:

    
ax=seaborn.boxplot(data['a'])
ax.set_xlim([79.94,80.05])









    Out[366]:





(79.94, 80.05)



In [367]:

    
ax=seaborn.boxplot(data['b'])
ax.set_xlim([79.94,80.05])









    Out[367]:





(79.94, 80.05)



In [ ]:

    
**Question 13**

First perform chisquared test. Also will look at a a visual check.



In [446]:

    
air=pd.DataFrame(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],[1668,1407,1370,1309,1341,1338,1406,1446,1332,1363,1410,1526])
air=air.reset_index()
air.columns=['guests','month']
air['expected']=[sum(air['guests'])/12]*12
scipy.stats.chisquare(air['guests'],air['expected'])









    Out[446]:





Power_divergenceResult(statistic=79.051809794180272, pvalue=2.2502167472646262e-12)

The low p-value indicates that it is extremely unlikey that the bookings are uniformly distributed. It is possible that there is a season pattern.



In [447]:

    
air.plot()









    Out[447]:





<matplotlib.axes._subplots.AxesSubplot at 0x116c42310>



In [448]:

    
air_season=pd.DataFrame(['Spring','Summer','Fall','Winter'],[(1370+1309+1341),(1338+1406+1446),(1332+1363+1410),(1526+1668+1407)])
air_season=air_season.reset_index()
air_season.columns=['guests','month']
air_season['expected']=[sum(air_season['guests'])/4]*4
air_season.plot()









    Out[448]:





<matplotlib.axes._subplots.AxesSubplot at 0x117a8aed0>

The visuals reinforce the evidence that the bookings are not uniformly distributed.



In [481]:

    
x=np.dot([[2,3],[2,1]],[[3],[2]])



In [ ]:

    
x=np.dot([[3,0,1],[-4,1,2],[-6,0,-2]],[[-1],[1],[3]])



In [478]:

    
x









    Out[478]:





array([[ 0],
       [11],
       [ 0]])



In [483]:

    
x/4









    Out[483]:





array([[3],
       [2]])

Question 14.a



In [485]:

    
q14=pd.DataFrame([16,12,13,11,10,9,8,7,5,3,2,0],[8,10,6,2,8,-1,4,6,-3,-1,-3,0])
q14=q14.reset_index()
q14.columns=['x2','x1']



In [560]:

    
np.corrcoef(q14['x1'],q14['x2'])









    Out[560]:





array([[ 1.       ,  0.7456203],
       [ 0.7456203,  1.       ]])



In [559]:

    
plt.scatter(q14['x1'],q14['x2'])
plt.axis('equal');

The correlation of X1 and X2 is .74 and PCA will provide information about the nature of the linear relationship. If the question is to whether we can use PCA to cut out data and only retain the component with the highest variance I would lean towards no, but it is difficult to say without background information on the data and the desired use.

Question 14.b



In [607]:

    
X=np.column_stack((q14['x1'],q14['x2']))



In [609]:

    
pca = PCA(n_components=2)



In [610]:

    
pca.fit(X)









    Out[610]:





PCA(copy=True, n_components=2, whiten=False)



In [615]:

    
pca.get_covariance() #Uses n vs. n-1









    Out[615]:





array([[ 21.16666667,  15.08333333],
       [ 15.08333333,  19.33333333]])



In [611]:

    
print(pca.components_) #Eigenvectors









    



[[-0.72823813 -0.68532417]
 [ 0.68532417 -0.72823813]]



In [ ]:

    
**Question 14.c**



In [612]:

    
print(pca.explained_variance_) #Eigenvalues, variance









    



[ 35.36116217   5.13883783]



In [619]:

    
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',linewidth=2,shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
plt.scatter(q14['x1'],q14['x2'], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 1 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)



In [ ]:

    
**Question 14.d**



In [620]:

    
print(pca.explained_variance_ratio_)









    



[ 0.87311512  0.12688488]

More than 87% of the variability is explained by a single component.