Introduction to Statistics in Python

https://www.scipy-lectures.org/packages/statistics/index.html

Importing and Exploring our Data



In [1]:

    
#import pandas and use magic function
import pandas as pd
%matplotlib inline



In [54]:

    
# import our data using pandas read_csv() function where delimiter = ';', index_col = 0, na_values = '.'
data = pd.read_csv('https://www.scipy-lectures.org/_downloads/brain_size.csv', 
                   delimiter=';', index_col=0, na_values='.')



In [11]:

    
# check out our data using pandas df.head() function
data.head()



In [12]:

    
# how many observations do we have? use pandas df.shape attribute
data.shape









    Out[12]:





(40, 7)



In [14]:

    
# check out one column with df['column name'] or df.column_nae
data.Gender.head()









    Out[14]:





1    Female
2      Male
3      Male
4      Male
5    Female
Name: Gender, dtype: object



In [15]:

    
# make a groupby object on the dataframe
groupby_gender = data.groupby('Gender')



In [16]:

    
# take the mean of the groupby object across all measures using the .mean() method
groupby_gender.mean()



In [24]:

    
# take a look at our data distributions and pair-wise correlations
from pandas import plotting
plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']]);



In [25]:

    
plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']]);



In [ ]:

    
## Distributions and



In [123]:

    
sns.kdeplot(data['FSIQ'])
sns.kdeplot(data['PIQ'])
sns.kdeplot(data['VIQ'])









    Out[123]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f45e76efac8>



In [124]:

    
sns.kdeplot(data['FSIQ'],cumulative=True)
sns.kdeplot(data['PIQ'],cumulative=True)
sns.kdeplot(data['VIQ'],cumulative=True)









    Out[124]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f45e7491668>



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Hypothesis Testing



In [28]:

    
from scipy import stats

1-sample t-test: Testing the value of a population mean.

scipy.stats.ttest_1samp() tests if the population mean of data is likely to
be equal to a given value (technically if observations are drawn from a Gaussian
distributions of given population mean). It returns the T statistic, and the
p-value (see the function’s help)



In [29]:

    
# runa 1 sample t-test
stats.ttest_1samp(data['VIQ'], 0)









    Out[29]:





Ttest_1sampResult(statistic=30.088099970849328, pvalue=1.3289196468728067e-28)

2-sample t-test: testing for difference across populations

We have seen above that the mean VIQ in the male and female populations were different.
To test if this is significant, we do a 2-sample t-test with scipy.stats.ttest_ind()



In [30]:

    
female_viq = data[data['Gender'] == 'Female']['VIQ']
male_viq = data[data['Gender'] == 'Male']['VIQ']
stats.ttest_ind(female_viq, male_viq)









    Out[30]:





Ttest_indResult(statistic=-0.7726161723275011, pvalue=0.44452876778583217)

Paired tests: repeated measurements on the same indivuals

The PIQ, VIQ and FSIQ are three different measures of IQ in the same individual.
We can first look if FSIQ and PIQ are different using the 2-sample t-test.



In [32]:

    
stats.ttest_ind(data['FSIQ'], data['PIQ'])









    Out[32]:





Ttest_indResult(statistic=0.465637596380964, pvalue=0.6427725009414841)

However this doesn't account for individual differences contributing to variance in data.
We can use a paired t-test or repeated measures test to account for these individual differences.



In [33]:

    
stats.ttest_rel(data['FSIQ'], data['PIQ'])









    Out[33]:





Ttest_relResult(statistic=1.7842019405859857, pvalue=0.08217263818364236)

This is actually equivalent to doing a 1-sample t-test on the difference of the two measures.



In [35]:

    
stats.ttest_1samp(data['FSIQ'] - data['PIQ'], 0)









    Out[35]:





Ttest_1sampResult(statistic=1.7842019405859857, pvalue=0.08217263818364236)

These tests assume normality in the data. A non-parametric alternative is the Wilcoxian signed rank test



In [36]:

    
stats.wilcoxon(data['FSIQ'], data['PIQ'])









    Out[36]:





WilcoxonResult(statistic=274.5, pvalue=0.10659492713506856)

Note: The corresponding test in the non paired case is the Mann–Whitney U test, scipy.stats.mannwhitneyu().

Linear models, multiple factors, and analysis of variance

Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. We will use the statsmodels module to:

Fit a linear model. We will use the simplest strategy, ordinary least squares (OLS).
Test that coef is non zero.



In [52]:

    
# Let's simulate some data according to the model
import numpy as np

x = np.linspace(-5, 5, 20)
np.random.seed(1)

# normal distributed noise
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)

# Create a data frame containing all the relevant variables
sim_data = pd.DataFrame({'x': x, 'y': y})



In [53]:

    
# Specify an OLS model and fit it
from statsmodels.formula.api import ols
model = ols("y ~ x", sim_data).fit()



In [46]:

    
# Inspect the results of the model fit
print(model.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.804
Model:                            OLS   Adj. R-squared:                  0.794
Method:                 Least Squares   F-statistic:                     74.03
Date:                Thu, 06 Dec 2018   Prob (F-statistic):           8.56e-08
Time:                        11:51:32   Log-Likelihood:                -57.988
No. Observations:                  20   AIC:                             120.0
Df Residuals:                      18   BIC:                             122.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.5335      1.036     -5.342      0.000      -7.710      -3.357
x              2.9369      0.341      8.604      0.000       2.220       3.654
==============================================================================
Omnibus:                        0.100   Durbin-Watson:                   2.956
Prob(Omnibus):                  0.951   Jarque-Bera (JB):                0.322
Skew:                          -0.058   Prob(JB):                        0.851
Kurtosis:                       2.390   Cond. No.                         3.03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [48]:

    
# Retrieve the model params, note tab completion
model.params









    Out[48]:





Intercept   -5.533459
x            2.936888
dtype: float64

Categorical variables: comparing groups or multiple categories



In [55]:

    
data.head()



In [62]:

    
# We can write a comparison between IQ of male and female using a linear model:

model = ols("VIQ ~ Gender", data).fit()

# ols automatically detects Gender as categorical
# model = ols('VIQ ~ C(Gender)', data).fit()

print(model.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    VIQ   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.5969
Date:                Thu, 06 Dec 2018   Prob (F-statistic):              0.445
Time:                        11:56:48   Log-Likelihood:                -182.42
No. Observations:                  40   AIC:                             368.8
Df Residuals:                      38   BIC:                             372.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        109.4500      5.308     20.619      0.000      98.704     120.196
Gender[T.Male]     5.8000      7.507      0.773      0.445      -9.397      20.997
==============================================================================
Omnibus:                       26.188   Durbin-Watson:                   1.709
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                3.703
Skew:                           0.010   Prob(JB):                        0.157
Kurtosis:                       1.510   Cond. No.                         2.62
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Link to t-tests between different FSIQ and PIQ

To compare different types of IQ, we need to create a “long-form” table,
listing IQs, where the type of IQ is indicated by a categorical variable:



In [79]:

    
iq_melt = pd.melt(data, value_vars=['FSIQ', 'PIQ'], value_name='iq', var_name="type")



In [80]:

    
iq_melt.head()



In [81]:

    
model = ols("iq ~ type", iq_melt).fit()
print(model.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     iq   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.2168
Date:                Thu, 06 Dec 2018   Prob (F-statistic):              0.643
Time:                        12:05:30   Log-Likelihood:                -364.35
No. Observations:                  80   AIC:                             732.7
Df Residuals:                      78   BIC:                             737.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept     113.4500      3.683     30.807      0.000     106.119     120.781
type[T.PIQ]    -2.4250      5.208     -0.466      0.643     -12.793       7.943
==============================================================================
Omnibus:                      164.598   Durbin-Watson:                   1.531
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                8.062
Skew:                          -0.110   Prob(JB):                       0.0178
Kurtosis:                       1.461   Cond. No.                         2.62
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note that we can get the same results now of doing the independant t-tests for each pariing.



In [82]:

    
stats.ttest_ind(data['VIQ'], data['PIQ'])









    Out[82]:





Ttest_indResult(statistic=0.2570673685607642, pvalue=0.7978036781233253)

Multiple Regression: including multiple factors



In [85]:

    
iris = pd.read_csv('https://www.scipy-lectures.org/_downloads/iris.csv')



In [93]:

    
import seaborn as sns
sns.pairplot(iris, hue='name');

Sepal and petal size tend to be related: bigger flowers are bigger!
But is there in addition a systematic effect of species?



In [95]:

    
model = ols('sepal_width ~ name + petal_length', iris).fit()
print(model.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:            sepal_width   R-squared:                       0.478
Model:                            OLS   Adj. R-squared:                  0.468
Method:                 Least Squares   F-statistic:                     44.63
Date:                Thu, 06 Dec 2018   Prob (F-statistic):           1.58e-20
Time:                        12:10:56   Log-Likelihood:                -38.185
No. Observations:                 150   AIC:                             84.37
Df Residuals:                     146   BIC:                             96.41
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              2.9813      0.099     29.989      0.000       2.785       3.178
name[T.versicolor]    -1.4821      0.181     -8.190      0.000      -1.840      -1.124
name[T.virginica]     -1.6635      0.256     -6.502      0.000      -2.169      -1.158
petal_length           0.2983      0.061      4.920      0.000       0.178       0.418
==============================================================================
Omnibus:                        2.868   Durbin-Watson:                   1.753
Prob(Omnibus):                  0.238   Jarque-Bera (JB):                2.885
Skew:                          -0.082   Prob(JB):                        0.236
Kurtosis:                       3.659   Cond. No.                         54.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis of Variance (ANOVA)

1-Way Anova



In [103]:

    
data.head()



In [109]:

    
sns.boxplot('name', "sepal_length", data=iris)









    Out[109]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f45e6c9fba8>



In [111]:

    
setosa = iris[iris['name'] == 'setosa']['sepal_length']
versicolor = iris[iris['name'] == 'versicolor']['sepal_length']
virginica = iris[iris['name'] == 'virginica']['sepal_length']



In [112]:

    
f_value, p_value = stats.f_oneway(setosa, versicolor, virginica)



In [117]:

    
print(f_value, p_value)









    



119.26450218450468 1.6696691907693826e-31



In [134]:

    
data.head()



In [136]:

    
data = data.dropna(axis=0, how='any')



In [141]:

    
data.isna().sum()









    Out[141]:





Gender       0
FSIQ         0
VIQ          0
PIQ          0
Weight       0
Height       0
MRI_Count    0
dtype: int64



In [147]:

    
import statsmodels.api as sm
from statsmodels.formula.api import ols
 
mod = ols('sepal_length ~ name',
                data=iris).fit();
                
aov_table = sm.stats.anova_lm(mod, typ=2);
print(aov_table);









    



             sum_sq     df           F        PR(>F)
name      63.212133    2.0  119.264502  1.669669e-31
Residual  38.956200  147.0         NaN           NaN



In [ ]:

	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
Gender
Female	111.9	109.45	110.45	137.200000	65.765000	862654.6
Male	115.0	115.25	111.60	166.444444	71.431579	954855.4

	Gender	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
1	Female	133	132	124	118.0	64.5	816932
2	Male	140	150	124	NaN	72.5	1001121
3	Male	139	123	150	143.0	73.3	1038437
4	Male	133	129	128	172.0	68.8	965353
5	Female	137	132	134	147.0	65.0	951545