https://www.scipy-lectures.org/packages/statistics/index.html
In [1]:
#import pandas and use magic function
import pandas as pd
%matplotlib inline
In [54]:
# import our data using pandas read_csv() function where delimiter = ';', index_col = 0, na_values = '.'
data = pd.read_csv('https://www.scipy-lectures.org/_downloads/brain_size.csv',
delimiter=';', index_col=0, na_values='.')
In [11]:
# check out our data using pandas df.head() function
data.head()
Out[11]:
In [12]:
# how many observations do we have? use pandas df.shape attribute
data.shape
Out[12]:
In [14]:
# check out one column with df['column name'] or df.column_nae
data.Gender.head()
Out[14]:
In [15]:
# make a groupby object on the dataframe
groupby_gender = data.groupby('Gender')
In [16]:
# take the mean of the groupby object across all measures using the .mean() method
groupby_gender.mean()
Out[16]:
In [24]:
# take a look at our data distributions and pair-wise correlations
from pandas import plotting
plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']]);
In [25]:
plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']]);
In [ ]:
## Distributions and
In [123]:
sns.kdeplot(data['FSIQ'])
sns.kdeplot(data['PIQ'])
sns.kdeplot(data['VIQ'])
Out[123]:
In [124]:
sns.kdeplot(data['FSIQ'],cumulative=True)
sns.kdeplot(data['PIQ'],cumulative=True)
sns.kdeplot(data['VIQ'],cumulative=True)
Out[124]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [28]:
from scipy import stats
scipy.stats.ttest_1samp() tests if the population mean of data is likely to
be equal to a given value (technically if observations are drawn from a Gaussian
distributions of given population mean). It returns the T statistic, and the
p-value (see the function’s help)
In [29]:
# runa 1 sample t-test
stats.ttest_1samp(data['VIQ'], 0)
Out[29]:
In [30]:
female_viq = data[data['Gender'] == 'Female']['VIQ']
male_viq = data[data['Gender'] == 'Male']['VIQ']
stats.ttest_ind(female_viq, male_viq)
Out[30]:
The PIQ, VIQ and FSIQ are three different measures of IQ in the same individual.
We can first look if FSIQ and PIQ are different using the 2-sample t-test.
In [32]:
stats.ttest_ind(data['FSIQ'], data['PIQ'])
Out[32]:
However this doesn't account for individual differences contributing to variance in data.
We can use a paired t-test or repeated measures test to account for these individual differences.
In [33]:
stats.ttest_rel(data['FSIQ'], data['PIQ'])
Out[33]:
This is actually equivalent to doing a 1-sample t-test on the difference of the two measures.
In [35]:
stats.ttest_1samp(data['FSIQ'] - data['PIQ'], 0)
Out[35]:
These tests assume normality in the data. A non-parametric alternative is the Wilcoxian signed rank test
In [36]:
stats.wilcoxon(data['FSIQ'], data['PIQ'])
Out[36]:
Note:
The corresponding test in the non paired case is the Mann–Whitney U test, scipy.stats.mannwhitneyu().
Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. We will use the statsmodels module to:
In [52]:
# Let's simulate some data according to the model
import numpy as np
x = np.linspace(-5, 5, 20)
np.random.seed(1)
# normal distributed noise
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
# Create a data frame containing all the relevant variables
sim_data = pd.DataFrame({'x': x, 'y': y})
In [53]:
# Specify an OLS model and fit it
from statsmodels.formula.api import ols
model = ols("y ~ x", sim_data).fit()
In [46]:
# Inspect the results of the model fit
print(model.summary())
In [48]:
# Retrieve the model params, note tab completion
model.params
Out[48]:
In [55]:
data.head()
Out[55]:
In [62]:
# We can write a comparison between IQ of male and female using a linear model:
model = ols("VIQ ~ Gender", data).fit()
# ols automatically detects Gender as categorical
# model = ols('VIQ ~ C(Gender)', data).fit()
print(model.summary())
In [79]:
iq_melt = pd.melt(data, value_vars=['FSIQ', 'PIQ'], value_name='iq', var_name="type")
In [80]:
iq_melt.head()
Out[80]:
In [81]:
model = ols("iq ~ type", iq_melt).fit()
print(model.summary())
Note that we can get the same results now of doing the independant t-tests for each pariing.
In [82]:
stats.ttest_ind(data['VIQ'], data['PIQ'])
Out[82]:
In [85]:
iris = pd.read_csv('https://www.scipy-lectures.org/_downloads/iris.csv')
In [93]:
import seaborn as sns
sns.pairplot(iris, hue='name');
Sepal and petal size tend to be related: bigger flowers are bigger!
But is there in addition a systematic effect of species?
In [95]:
model = ols('sepal_width ~ name + petal_length', iris).fit()
print(model.summary())
In [103]:
data.head()
Out[103]:
In [109]:
sns.boxplot('name', "sepal_length", data=iris)
Out[109]:
In [111]:
setosa = iris[iris['name'] == 'setosa']['sepal_length']
versicolor = iris[iris['name'] == 'versicolor']['sepal_length']
virginica = iris[iris['name'] == 'virginica']['sepal_length']
In [112]:
f_value, p_value = stats.f_oneway(setosa, versicolor, virginica)
In [117]:
print(f_value, p_value)
In [134]:
data.head()
Out[134]:
In [136]:
data = data.dropna(axis=0, how='any')
In [141]:
data.isna().sum()
Out[141]:
In [147]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
mod = ols('sepal_length ~ name',
data=iris).fit();
aov_table = sm.stats.anova_lm(mod, typ=2);
print(aov_table);
In [ ]: