Exploration Example

Let's start with importing some plotting functions (don't care about the warning ... we should use something else, but this is just easier, for the time being).



In [1]:

    
%pylab inline









    



/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    



Populating the interactive namespace from numpy and matplotlib



In [5]:

    
windows = [625, 480, 621, 633]
mac = [647, 503, 559, 586]

Now Lets try to calculate the mean ... you can just use mean()



In [ ]:

we can also plot the raw data



In [ ]:

    
figure()
plot(windows)
plot(mac,'r')

apply a t-test to check for significance



In [53]:

    
from scipy.stats import ttest_ind
from scipy.stats import ttest_rel
import scipy.stats as stats
#onesided t-test
ttest_ind(mac,windows)
#two sided t-test









    Out[53]:





Ttest_indResult(statistic=-0.33810258358163742, pvalue=0.74680239857459685)



In [54]:

    
ttest_rel(mac,windows)









    Out[54]:





Ttest_relResult(statistic=-0.71305042851453826, pvalue=0.52727545422603439)

let's say we get more data



In [41]:

    
more_win = [625, 480, 621, 633,694,599,505,527,651,505]
more_mac = [647, 503, 559, 586, 458, 380, 477, 409, 589,472]

what to do if we have more than 3 use an ANOVA in python stats.f_oneway()



In [51]:

    
more_bottom = [485,436, 512, 564, 560, 587, 391, 488, 555, 446]









    Out[51]:





Ttest_relResult(statistic=-2.6758056901941498, pvalue=0.025379874652221083)

Anscombe's quartet

Let's take a look at some other data set (and actually import data from a file).



In [25]:

    
import pandas as pd
aq=pd.read_csv('data/anscombesQuartet.csv')



In [28]:

    
aq



In [29]:

    
mean(aq['I_y'])









    Out[29]:





7.5009090909090927

again ... caluclate the means for all x.

calcuate the variance for x

variance for y



In [36]:









    Out[36]:





Ttest_indResult(statistic=-0.33810258358163742, pvalue=0.74680239857459685)



In [35]:



In [38]:









    Out[38]:





F_onewayResult(statistic=3.6909472287945335, pvalue=0.038278814395010151)



In [46]:



In [ ]:

	I_x	I_y	II_x	II_y	III_x	III_y	IV_x	IV_y
0	10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
1	8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
2	13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
3	9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
4	11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
5	14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6	6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
7	4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
8	12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
9	7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
10	5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89