Javier Garcia-Bernardo garcia@uva.nl
In [1]:
import pandas as pd
import numpy as np
import pylab as plt
from scipy.stats import chi2_contingency,ttest_ind
#This allows us to use R
%load_ext rpy2.ipython
#Visualize in line
%matplotlib inline
#Be able to plot images saved in the hard drive
from IPython.display import Image,display
#Make the notebook wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
To learn about statistics well, I recommend this course: https://www.coursera.org/specializations/social-science
In [58]:
print("look-elsewhere effect")
Image(url="http://www.tylervigen.com/chart-pngs/13.png",width=1000)
Out[58]:
In [4]:
#Create some totally random data
import numpy as np
df = pd.DataFrame(np.random.random((100,51)))
cols_x = []
for i in range(50):
cols_x.append("x"+str(i))
df.columns = ["y"] + cols_x
print(cols_x)
df.head()
Out[4]:
In [5]:
#Fit a regression
import statsmodels.formula.api as smf
mod = smf.ols(formula='y ~ {}'.format("+".join(cols_x)), data=df)
res = mod.fit()
print(res.summary())
We're in a replication crisis.
Number of failed replications: of another author (your own papers)
`
chemistry: 90% (60%),
biology: 80% (60%),
physics and engineering: 70% (50%),
medicine: 70% (60%),
Earth and environment science: 60% (40%).
`
In [4]:
Image(url="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/01bec95ec63634b9062de57edde1ecf7/replicationbypvalue.png")
Out[4]:
In probability theory and statistics, the binomial distribution with parameters x, n and p is the discrete probability distribution of x
number of successes in a sequence of n
independent yes/no experiments, each of which yields success with probability p
The Relative Risk (RR) is the probability of the events happening in your sample of interest vs the probability of the events happening in the control group.
What's the probability of getting 40 heads out of 100 tosses given that a coin is fair? (p-value)
In [8]:
import scipy.stats
import statsmodels.stats.proportion
#probability of heads: p = 0.5 (50%)
#number of succeses: 40
#number of trials: n = 100
#control group = unbiased dice -> p=0.5
#sampe = this possible biased die -> p=0.4
#RR = 0.4/0.5 = 4:5 = 0.8
#p-value
pvalue = scipy.stats.binom_test(40, n=100, p=0.5)
#confidence interavls
conf = statsmodels.stats.proportion.proportion_confint(40,nobs=100,method="beta")
print(pvalue)
print(conf) #they usually do not include 0.5 (our control probability) if not significant.
What's the probability that 10 journalists out of 1 million people get killed if the chances of getting killed are 1 in 1 million for the entire country? (p-value)
In [15]:
#Code here
#number of succeses: 10
#number of trials: 1000000
#control group = unbiased dice -> p = 1/1000000
#sampe = this possible biased die -> p= 10/1000000
#RR = 10
pvalue = scipy.stats.binom_test(10,1000000,p=1/1000000)
conf = statsmodels.stats.proportion.proportion_confint(10,nobs=1000000,method="beta",alpha=0.001)
print(pvalue)
print(conf)
conf[0]*1000000,conf[1]*1000000
Out[15]:
In [72]:
#just plotting a binomial distribution, no worries about the code
x = np.linspace(0,10,11)
pmf = scipy.stats.binom.pmf(x,10,0.2)
plt.plot(x,pmf,".-")
plt.xlabel("Number of successes (out of 10)")
plt.ylabel("Frequency")
Out[72]:
A note on one-sided test vs two-sided tests
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
(wikipedia)
Example 1: We will use this class as a sample
`Is the probability of having dark eyes independent of the gender?`
In [22]:
df_eyes = pd.DataFrame(
[
[10,5],
[5,10]
],columns=["Dark","Clear"],index=["Male","Female"])
df_eyes
Out[22]:
In [24]:
import scipy.stats
chi,p,dof,expected = scipy.stats.chi2_contingency(df_eyes)
print(p)
display(expected)
In [25]:
#Ratios
ratio_real_vs_expected = df_eyes/expected
ratio_real_vs_expected
Out[25]:
In [73]:
#just plotting a chi-square distribution, no worries
x = np.linspace(0,10,11)
pmf = scipy.stats.chi2.pdf(x,df=1)
plt.plot(x,pmf,".-")
plt.xlabel("Chi-2 value")
plt.ylabel("Frequency")
Out[73]:
Using LAPOP (survey data for Latin America): http://datasets.americasbarometer.org/database-login/usersearch.php?year=2014 We are going to see if there is any relationship between the best method to finish the conflict in Colombia and the belief that the conflict will end in the next year
In [48]:
#Read the data from colombia in "data/colombia.dta", it's a stata file
df = pd.read_stata("data/colombia.dta")
df.to_csv("data/colombia.csv",index=None)
df = pd.read_csv("data/colombia.csv")
In [49]:
df.groupby("upm").mean().reset_index()
Out[49]:
In [27]:
x_variable = df["colpaz1a"] #What is the best method to continue in the conflict
other_variables =[df["colpaz2a"]] #What are the chances that peace happens within one year
In [28]:
#Create a contingency table (pd.crosstab) between the x_variable and the other_variables
col_crosstab = pd.crosstab(x_variable,other_variables)
col_crosstab
Out[28]:
In [29]:
#Calculate the chi-square test
chi,p,dof,expected = chi,p,dof,expected = scipy.stats.chi2_contingency(col_crosstab)
print(p)
display(expected)
In [30]:
col_crosstab/expected
Out[30]:
Is it significant?
In [31]:
##Now let's add gender
x_variable = df["colpaz1a"] #What is the best method to continue
other_variables =[df["q1"],df["colpaz2a"]] #Gender and what are the chances that peace happens within one year
In [35]:
#Create a contingency table (pd.crosstab) between the x_variable and the other_variables
col_crosstab = pd.crosstab(x_variable,other_variables)
col_crosstab
Out[35]:
In [36]:
#Calculate the chi-square test
chi,p,dof,expected = chi,p,dof,expected = scipy.stats.chi2_contingency(col_crosstab)
print(p)
display(expected)
In [37]:
col_crosstab/expected
Out[37]:
Is it significant?
Let's use R, Fisher test for big tables do not exist in Python
In [ ]:
#Install R (already installed)
!conda install -c r r-essentials
#Install link between R and python (already installed)
!pip install rpy2
In [38]:
%load_ext rpy2.ipython
In [39]:
%%R -i col_crosstab
fisher.test(col_crosstab,simulate.p.value=TRUE,B=1e6)