Javier Garcia-Bernardo garcia@uva.nl
- Compare one group vs one value
- Compare two groups
- Compare two paired groups
- ANOVA
- Multiple comparison (Tukey correction)
- Linear regression
- Logistic regression
- Machine learning
In [10]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell
#Print the plots in this screen
%matplotlib inline
#Be able to plot images saved in the hard drive
from IPython.display import Image
#Make the notebook wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np
import scipy.stats
import statsmodels.formula.api as smf
In [11]:
def qq_plot(x):
(osm, osr),(slope, intercept, r) = scipy.stats.probplot(x, dist='norm', plot=None)
plt.plot(osm, osr, '.', osm, slope*osm + intercept)
plt.xlabel('Quantiles',fontsize=14)
plt.ylabel('Quantiles Obs',fontsize=14)
Assumptions --> Not easy without prior understanding of statistics: http://statsmodels.sourceforge.net/devel/diagnostic.html
Regression using dummy variables for each category ~ ANOVA
In [375]:
df = pd.read_csv("data/big3_position.csv",sep="\t")
df.head()
Out[375]:
mod = smf.ols(formula='MarketCap ~ Revenue + Employees + Revenue*Assets + C(TypeEnt)', data=df)* means calculate the interaction effect
In [235]:
#How to run a regression (be careful, this is wrong)
mod = smf.ols(formula='MarketCap ~ Revenue + Employees', data=df)
res = mod.fit()
print(res.summary())
Why is it wrong?
In [12]:
## Asumptions
#- Normality
qq_plot(df["MarketCap"])
In [376]:
##Let's fix that
df["logMarketcap"] = np.log(df["MarketCap"])
df["logRevenue"] = np.log(df["Revenue"])
df["logEmployees"] = np.log(df["Employees"])
df["logAssets"] = np.log(df["Assets"])
In [377]:
## Asumptions
#- Normality
qq_plot(df["logMarketcap"])
In [237]:
#Keep only finite values (otherwise we'd get a linear algebra error in our regression)
df = df.replace([np.inf, -np.inf], np.nan)
df_nonans = df.dropna(subset=["logMarketcap","logRevenue","logEmployees","logAssets"])
#Kepe only three types of entities
df_nonans = df_nonans.loc[df_nonans["TypeEnt"].isin(["Financial company","Industrial company","Bank"])]
In [238]:
#Make sure all our values are fine
df_nonans.describe()
Out[238]:
In [239]:
#Now our cond. number is okay
mod1 = smf.ols(formula='logMarketcap ~ logAssets', data=df_nonans)
res1 = mod1.fit()
print(res1.summary())
So test is telling us that the residuals are not normally distributed
In [78]:
#Check this --> Only for the tails
qq_plot(res.resid)
In [226]:
#Residuals are nice except for tails
df_nonans["residuals"] = res.resid
df_nonans["predicted"] = res.predict()
sns.lmplot(x="predicted",y="residuals",fit_reg=False,data=df_nonans)
Out[226]:
In [379]:
df_nonans["predicted"] = res1.predict()
sns.lmplot("logMarketcap","predicted",data=df_nonans)
plt.xlim((5,22))
Out[379]:
In [228]:
#Now our cond. number is okay
mod2 = smf.ols(formula='logMarketcap ~ logAssets + C(TypeEnt)', data=df_nonans)
res2 = mod2.fit()
print(res2.summary())
In [82]:
#Check this --> Only for the tails
qq_plot(res.resid)
In [259]:
df_nonans["predicted"] = res2.predict()
sns.lmplot("logMarketcap","predicted",data=df_nonans)
Out[259]:
In [260]:
#Now our cond. number is okay. Not okay
mod3 = smf.ols(formula='logMarketcap ~ logAssets*logEmployees + C(TypeEnt)', data=df_nonans)
res3 = mod3.fit()
print(res3.summary())
In [261]:
df_nonans["predicted"] = res3.predict()
sns.lmplot("logMarketcap","predicted",data=df_nonans)
Out[261]:
In [230]:
from statsmodels.iolib.summary2 import summary_col
dfoutput = summary_col([res1,res2,res3],stars=True)
dfoutput
Out[230]:
More info: http://nbviewer.jupyter.org/urls/umich.box.com/shared/static/6tfc1e0q6jincsv5pgfa.ipynb More info: http://nbviewer.jupyter.org/urls/umich.box.com/shared/static/lc6uf6dmabmitjbup3yt.ipynb
Combine:
For instance, we want to measure the relationship between unemployment and productivity for different regions in many countries. Each country may have their own "structural" unemployment, we can get rid of this variability by using country as the random effects.
Another example, we measure the relationship between two variables at different years. We don't really care about how they change in time, we only care about how one variable affect the other. We can use time as the random effect.
To allow for random slopes: re_formula parameter (see first notebook in the more info example)
In [241]:
Image(url="http://zoonek2.free.fr/UNIX/48_R/g1096.png")
Out[241]:
In [307]:
#https://hlplab.wordpress.com/tag/regression/
print("Random slopes")
Image(url="https://hlplab.files.wordpress.com/2011/05/simpsons-paradox.png",width=800)
Out[307]:
In [289]:
#Random intercepts for different type of entities. We can also give random slopes
mod = smf.mixedlm(formula='logMarketcap ~ logAssets*logEmployees',groups="TypeEnt", data=df_nonans)
res4 = mod.fit()
print(res4.summary())
In [273]:
res4.random_effects
Out[273]:
In [302]:
df_nonans["pred"] = 3.258 + 0.674*df_nonans["logAssets"] -0.248*df_nonans["logEmployees"] + 0.023*df_nonans["logAssets"]*df_nonans["logEmployees"]
sns.lmplot(x= "logMarketcap",y="pred",data=df_nonans,fit_reg=False,hue="TypeEnt",scatter_kws={"alpha":0.5})
Out[302]:
In [381]:
df = pd.read_csv("data/it_es.csv",sep="\t")
df = df.loc[df["Year"]>2004]
df.head()
Out[381]:
In [370]:
#Random intercepts for different type of entities. We can also give random slopes
mod = smf.mixedlm(formula='LABOUR_PRODUCTIVITY ~ UNEMP_R',re_formula="Year",groups="Country", data=df)
res4 = mod.fit()
print(res4.summary())
In [371]:
res4.random_effects
Out[371]:
In [372]:
df["CityYear"] = df["Metropolitan areas"] + df["Year"].astype(str)
sns.lmplot("UNEMP_R","LABOUR_PRODUCTIVITY",data=df,hue="CityYear",fit_reg=False,legend=False)
Out[372]:
In [385]:
a = df.groupby(["Country","Metropolitan areas"]).mean().reset_index()
sns.lmplot("UNEMP_R","LABOUR_PRODUCTIVITY",data=a,hue="Country",fit_reg=False)
#mod = smf.ols(formula='LABOUR_PRODUCTIVITY ~ UNEMP_R', data=a)
#print(mod.fit().summary())
Out[385]:
In [93]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/6/6d/Exam_pass_logistic_curve.jpeg")
Out[93]:
In [134]:
df = pd.read_csv("data/Titanic.csv")
df = df.dropna()
df.head()
Out[134]:
In [205]:
#Now our cond. number is okay. Not okay
mod = smf.logit(formula='Survived ~ C(PClass) + C(Sex)', data=df)
res2 = mod.fit()
print(res2.summary())
In [206]:
##Odds ratios
print(np.exp(res2.params))
In [216]:
def calc_prob(odds):
return odds/(odds+1)
In [220]:
#Female in first class
print("Female, 1st class")
c = df.loc[(df["Sex"]=="female") & (df["PClass"]=="1st")]["Survived"]
print("Odds: ",calc_prob(8.695166), "Real: ", np.sum(c)/len(c),"\n")
#Female in second class
print("Female, 2nd class")
c = df.loc[(df["Sex"]=="female") & (df["PClass"]=="2nd")]["Survived"]
print("Odds: ",calc_prob(8.695166*0.452182), "Real: ", np.sum(c)/len(c),"\n")
#Female in third class
print("Female, 3rd class")
c = df.loc[(df["Sex"]=="female") & (df["PClass"]=="3rd")]["Survived"]
print("Odds: ",calc_prob(8.695166*0.155614), "Real: ", np.sum(c)/len(c),"\n")
In [219]:
#Female in first class
print("male, 1st class")
c = df.loc[(df["Sex"]=="male") & (df["PClass"]=="1st")]["Survived"]
print("Odds: ",calc_prob(8.695166*0.072710), "Real: ", np.sum(c)/len(c),"\n")
#Female in second class
print("male, 2nd class")
c = df.loc[(df["Sex"]=="male") & (df["PClass"]=="2nd")]["Survived"]
print("Odds: ",calc_prob(8.695166*0.452182*0.072710), "Real: ", np.sum(c)/len(c),"\n")
#Female in third class
print("male, 3rd class")
c = df.loc[(df["Sex"]=="male") & (df["PClass"]=="3rd")]["Survived"]
print("Odds: ",calc_prob(8.695166*0.155614*0.072710), "Real: ", np.sum(c)/len(c),"\n")
In [128]:
print("For instance here we have 10 variables measuring the same, blue = SVR, red = linear regression. The weights are not good for linear regression")
Image(url="http://www.alivelearn.net/wp-content/uploads/2012/11/weight.jpg")
Out[128]:
In [249]:
from sklearn import svm
y = df_nonans["logMarketcap"]
X = df_nonans.loc[:, ["logAssets","logEmployees","logRevenue"]]
clf = svm.SVR(kernel="linear")
clf.fit(X, y)
predicted_values = clf.predict(X)
df_nonans["predicted"] = predicted_values
print(clf.intercept_)
print(clf.coef_)
In [250]:
sns.lmplot("logMarketcap","predicted",data=df_nonans)
Out[250]:
In [243]:
#Now our cond. number is okay. Not okay
mod = smf.ols(formula='logMarketcap ~ logAssets + logEmployees + logRevenue', data=df_nonans)
res2 = mod.fit()
print(res2.summary())
In [ ]: