https://www.scipy-lectures.org/packages/statistics/index.html
In [23]:
#import pandas and use magic function for plotting
import pandas as pd
%matplotlib inline
In [25]:
# import our data using pandas read_csv() function where
# delimiter = ';', index_col = 0, na_values = '.'
data = pd.read_csv('https://www.scipy-lectures.org/_downloads/brain_size.csv',
delimiter=';', index_col=0, na_values='.')
In [26]:
# check out our data using pandas df.head() function
data.head()
Out[26]:
In [29]:
data.describe()
Out[29]:
In [30]:
# how many observations do we have? use pandas df.shape attribute
data.shape
Out[30]:
In [32]:
# check out one column with df['column name'] or df.column_name
#data['Gender']
data.Gender
Out[32]:
In [34]:
# make a groupby object on the dataframe
data_grouped = data.groupby("Gender")
In [44]:
# look at mean weight per group in data_grouped
data_grouped['Height'].min()
Out[44]:
In [46]:
# import plotting from pandas
from pandas import plotting
# take a look at our data distributions and pair-wise correlations
# using plotting.scatter_matrix() on Weight, Height, and MRI_Count
plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']]);
In [47]:
# scatter_matrix on PIQ VIQ FSIQ
plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']]);
In [51]:
# import seaborn plotting package
import seaborn as sns
# use the kdeplot() function on FSIQ, PIQ and VIQ
sns.kdeplot(data['FSIQ']);
sns.kdeplot(data['PIQ']);
sns.kdeplot(data['VIQ']);
In [52]:
sns.kdeplot(data['FSIQ'], cumulative=True);
sns.kdeplot(data['PIQ'], cumulative=True);
sns.kdeplot(data['VIQ'], cumulative=True);
# cumulative = True
In [53]:
# import stats from scipy
from scipy import stats
scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value (technically if observations are drawn from a Gaussian distributions of given population mean). It returns the T statistic, and the p-value (see the function’s help)
In [54]:
# run a 1 sample t-test on VIQ against 0
stats.ttest_1samp(data['VIQ'], 0)
Out[54]:
In [55]:
# set up two series for female and male VIQ
female = data[data.Gender == 'Female']['VIQ']
male = data[data.Gender == 'Male']['VIQ']
In [56]:
# use stats.ttest_ind() on the two series
stats.ttest_ind(female, male)
Out[56]:
The PIQ, VIQ and FSIQ are three different measures of IQ in the same individual.
We can first look if FSIQ and PIQ are different using the 2-sample t-test.
In [57]:
# run a 2-sample t-test on FSIQ and PIQ
stats.ttest_ind(data['FSIQ'], data['PIQ'])
Out[57]:
However this doesn't account for individual differences contributing to variance in data.
We can use a paired t-test or repeated measures test to account for these individual differences.
In [60]:
# use the ttest_rel() to do a paired t-test on FSIQ and PIQ
stats.ttest_rel(data['FSIQ'], data['PIQ'])
Out[60]:
This is actually equivalent to doing a 1-sample t-test on the difference of the two measures.
In [61]:
# 1-sample ttest on the difference
stats.ttest_1samp(data['FSIQ'] - data['PIQ'], 0)
Out[61]:
These tests assume normality in the data. A non-parametric alternative is the Wilcoxian signed rank test
In [62]:
# for a non-parametric version of this test, use stats.wilcoxon()
stats.wilcoxon(data['FSIQ'], data['PIQ'])
Out[62]:
Note:
The corresponding test in the non paired case is the Mann–Whitney U test, scipy.stats.mannwhitneyu().
Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. We will use the statsmodels module to:
In [65]:
# Let's simulate some data according to the model
# import numpy
import numpy as np
# create range of 20 x values over -5 to 5 using np.linspace()
x = np.linspace(-5,5,20)
# set random seed to 1
np.random.seed(1)
# normal distributed noise of y = -5 + 3x + 4 * np.random.normal(size=x.shape)
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
# Create a DataFrame named sim_data containing all the relevant variables
fake_data = pd.DataFrame({"x": x, "y": y})
In [66]:
# import ols from statsmodels.formmula.api
from statsmodels.formula.api import ols
# Specify an OLS model (y ~ x) and fit it using the .fit() method
model = ols('y ~ x', fake_data).fit()
In [68]:
# Inspect the results of the model using model.summary()
model.summary()
print(model.params)
In [ ]:
# Retrieve the model params, note tab completion
In [69]:
# remind ourselves of our data using head()
data.head()
Out[69]:
In [70]:
# We can write a comparison between VIQ of male and female
# (VIQ ~ Gender) using a linear model:
model = ols("VIQ ~ Gender", data).fit()
# ols automatically detects Gender as categorical
# model = ols('VIQ ~ C(Gender)', data).fit()
# view results of the model using .summary()
model.summary()
Out[70]:
In [ ]:
# read in our iris data https://www.scipy-lectures.org/_downloads/iris.csv
In [ ]:
# use seaborns pairplot() function and hue='name', use ; to remove output
Sepal and petal size tend to be related: bigger flowers are bigger!
But is there in addition a systematic effect of species?
In [ ]:
# let's make a model that takes into account this potential interaction of name (sepal_width ~ name + petal_length)
# view the results of the model
In [71]:
# remind ourselves of our iris data
iris = pd.read_csv("https://www.scipy-lectures.org/_downloads/iris.csv")
In [72]:
iris.head()
Out[72]:
In [74]:
# let's visualize the difference species and their sepal_length in
# a box plot using seaborns box_plot()
sns.boxplot(x="name", y="sepal_length", data=iris);
In [76]:
# we need to split our data to run an anova using stats.f_oneway()
# make a variable for setosa, versicolor and virginica on sepal_length
setosa = iris[iris['name'] == 'setosa']['sepal_length']
versicolor = iris[iris['name'] == 'versicolor']['sepal_length']
virginica = iris[iris['name'] == 'virginica']['sepal_length']
In [78]:
# use stats.f_oneway() on the series you made for each species and
# get an f_value and p_value
f_value, p_value = stats.f_oneway(setosa, versicolor, virginica)
# print the results
print(f_value, p_value)
In [82]:
# another way to do it that prevents having to split your data
# import sm from statsmodels.api and import ols from statsmodels.formula.api
import statsmodels.api as sm
from statsmodels.formula.api import ols
# specify the model (sepal_length ~ name) and fit
model = ols('sepal_length ~ name', iris).fit()
# make an aov table using sm.stats.anova_lm() using type=2 since we don't have an interaction term
aov_table = sm.stats.anova_lm(model, typ=2)
# print the aov_table
print(aov_table)
In [83]:
model = ols('FSIQ ~ Gender + MRI_Count + Gender:MRI_Count', data).fit()
In [84]:
aov_table = sm.stats.anova_lm(model, typ=1)
In [85]:
print(aov_table)
In [ ]: