Working with data 2017. Class 4

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

Stats
- Definitions
- What's a p-value?
- One-tailed test vs two-tailed test
- Count vs expected count (binomial test)
- Independence between factors: ($\chi^2$ test)
In-class exercises to melt, pivot, concat, merge, groupby and plot.
Read data from websited
Time series



In [1]:

    
import pandas as pd
import numpy as np
import pylab as plt
from scipy.stats import chi2_contingency,ttest_ind

#This allows us to use R
%load_ext rpy2.ipython

#Visualize in line
%matplotlib inline

#Be able to plot images saved in the hard drive
from IPython.display import Image,display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

6 Statistics I

To learn about statistics well, I recommend this course: https://www.coursera.org/specializations/social-science

6.0 Definitions

6.0.1 Population vs sample

Population: The entire set of possible observations (all people in a country in a country-level survey)
Sample: The observations we actually have

6.0.2 Parameter vs statistic

Parameter: The true values that define the population ($\sigma$ and $\mu$ for a normal distribution)
Statistic: The values that we calculate using our sample (STD and MEAN for a normal distribution)

6.0.3 Probability

Probability: The proportion of times where the measured event occurs in the long run.
For instance, the probability of a coin toss is 0.5, which means that if you toss a coin 1 million times you will more or less get 500k heads and 500k tails)

6.0.4 Null hypothesis:

The hypothesis that our value is not significant.
For instance our value can be the difference between two groups, or the difference between one group and zero.
It is assumed to be true and we try to disprove it.
It is "disproved" if the chances to dismiss it by chance are lower than 5%.

6.0.5 Alternative hypothesis:

The hypothesis that our value is significant.
Accepted after dismissing the null hypothesis.

6.0.6 Types of error:

Type I ($\alpha$): Rejecting the null hypothesis when it was actually True (saying we have something we don't have). As a rule we set it to 0.05 and if the p-value is below it we accept the alternative hypothesis.
Type II ($\beta$): Accepting the null hypothesis when it was actually False (saying we don't have something we actually have). This is a less important error.

6.0.7 p-value

The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true.
"commonly misused and misinterpreted." --> a p-value of 0.01 does not mean that there is 1% chances that you are wrong!
Low p-value can be for two reasons:
- The null is true but your sample was unusual.
- The null is false.
with a p-value of 0.05 the probability of incorrectly rejecting a true null hypothesis is 23% (mainly for other bias).

6.0.8 Confidence intervals

The likely range of values where the value of a parameter lies within.
It depends on your significance level (usually 0.05)

6.0.9 Effect size:

Difference between the difference of the two groups (of the one group with zero) divided by the standard deviation.
Show the confidence intervals.
Good summary: https://www.leeds.ac.uk/educol/documents/00002182.htm
"For example, an AIDS vaccine study in Thailand obtained a P value of 0.039. Great! This was the first time that an AIDS vaccine had positive results. However, the confidence interval for effectiveness ranged from 1% to 52%. That’s not so impressive...the vaccine may work virtually none of the time up to half the time. The effectiveness is both low and imprecisely estimated." quote: http://blog.minitab.com/blog/adventures-in-statistics-2/five-guidelines-for-using-p-values

6.0.10 Correlation

Dependence or association is any statistical relationship, whether causal or not, between two random variables or two sets of data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other.

6.0.11 In general:

Larger population: We can detect smaller differences.
Smaller variability within groups: We can detect smaller differences.
Large differences between groups (effect size): Unlikely that it is due to noise.

6.1 Biases

6.1.1 Cherry-picking (yourself)

Using individual cases or data that seem to confirm a particular position, while ignoring a significant portion of related cases or data that may contradict that position.
Cherry picking may be committed intentionally or unintentionally.

6.1.2 Look-elsewhere effect (sample size)

If you try many things, one will be significant.
With a sample size large enough, any outrageous thing is likely to happen (Persi Diaconis and Frederick Mosteller)
If you have a database, scan the values to see if there is something interesting, and then use that you are cheating. Your minimum p-value shouldn't be 0.05, should be 0.05/(variables scanned before hand)

6.1.3 Optional stopping (data collection)

It is a well-known fact of null-hypothesis significance testing (NHST) that when there is "optional stopping" of data collection with testing at every new datum (a procedure also called "sequential testing" or "data peeking"), then the null hypothesis will eventually be rejected even when it is true. With enough random sampling from the null hypothesis, eventually there will be some accidental coincidence of outlying values so that p < .05 (conditionalizing on the current sample size). Anscombe (1954) called this phenomenon, "sampling to reach a foregone conclusion." from; http://doingbayesiandataanalysis.blogspot.nl/2013/11/optional-stopping-in-data-collection-p.html



In [58]:

    
print("look-elsewhere effect")
Image(url="http://www.tylervigen.com/chart-pngs/13.png",width=1000)









    



look-elsewhere effect






    Out[58]:

More on the look-elsewhere effect

If you try too many things, one of them is going to be significant.
You need adjustments (more on that another time)
We are creating a completely random dataset with 100 observations and 50 variables
We are trying to see if we can fit a linear model



In [4]:

    
#Create some totally random data
import numpy as np
df = pd.DataFrame(np.random.random((100,51)))

cols_x = []
for i in range(50):
    cols_x.append("x"+str(i))
df.columns = ["y"] + cols_x


print(cols_x)
df.head()









    



['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49']






    Out[4]:






  
    
      
      y
      x0
      x1
      x2
      x3
      x4
      x5
      x6
      x7
      x8
      ...
      x40
      x41
      x42
      x43
      x44
      x45
      x46
      x47
      x48
      x49
    
  
  
    
      0
      0.213572
      0.221967
      0.713657
      0.983606
      0.681643
      0.156636
      0.660690
      0.293840
      0.154051
      0.398285
      ...
      0.138762
      0.672604
      0.615953
      0.782186
      0.286070
      0.024790
      0.244314
      0.800681
      0.823869
      0.076661
    
    
      1
      0.143471
      0.396877
      0.565624
      0.307485
      0.601993
      0.377495
      0.718098
      0.231881
      0.761632
      0.065392
      ...
      0.343435
      0.293349
      0.116651
      0.631342
      0.183914
      0.260053
      0.764687
      0.095851
      0.304441
      0.393638
    
    
      2
      0.396554
      0.613201
      0.244101
      0.813583
      0.590953
      0.461041
      0.020353
      0.025964
      0.500359
      0.563320
      ...
      0.079348
      0.617747
      0.498122
      0.395370
      0.802521
      0.284240
      0.416556
      0.290213
      0.872692
      0.324670
    
    
      3
      0.533959
      0.746970
      0.593383
      0.672295
      0.989462
      0.441439
      0.660520
      0.541208
      0.588104
      0.686654
      ...
      0.101289
      0.300024
      0.683742
      0.867609
      0.579942
      0.453146
      0.049929
      0.290257
      0.147547
      0.367648
    
    
      4
      0.245794
      0.053775
      0.948439
      0.980810
      0.626739
      0.049619
      0.310976
      0.996501
      0.442603
      0.740117
      ...
      0.668109
      0.681462
      0.681842
      0.633676
      0.116509
      0.314008
      0.657629
      0.651648
      0.677378
      0.796769
    
  

5 rows × 51 columns



In [5]:

    
#Fit a regression
import statsmodels.formula.api as smf

mod = smf.ols(formula='y ~ {}'.format("+".join(cols_x)), data=df)
res = mod.fit()
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.149
Method:                 Least Squares   F-statistic:                     1.347
Date:                Thu, 19 Jan 2017   Prob (F-statistic):              0.150
Time:                        09:27:01   Log-Likelihood:                 34.320
No. Observations:                 100   AIC:                             33.36
Df Residuals:                      49   BIC:                             166.2
Df Model:                          50                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.2051      0.447     -0.459      0.648        -1.103     0.693
x0             0.0762      0.127      0.597      0.553        -0.180     0.332
x1             0.0611      0.130      0.469      0.641        -0.200     0.323
x2             0.2445      0.119      2.060      0.045         0.006     0.483
x3            -0.0472      0.132     -0.359      0.721        -0.312     0.217
x4             0.1674      0.125      1.344      0.185        -0.083     0.418
x5             0.3456      0.140      2.460      0.017         0.063     0.628
x6             0.2241      0.114      1.962      0.055        -0.005     0.454
x7             0.2255      0.121      1.861      0.069        -0.018     0.469
x8            -0.0215      0.109     -0.197      0.844        -0.241     0.198
x9             0.4157      0.126      3.293      0.002         0.162     0.669
x10           -0.0130      0.128     -0.102      0.919        -0.270     0.244
x11           -0.0573      0.128     -0.447      0.657        -0.315     0.200
x12           -0.0704      0.117     -0.601      0.551        -0.306     0.165
x13            0.0641      0.114      0.563      0.576        -0.165     0.293
x14            0.0532      0.114      0.467      0.643        -0.176     0.282
x15            0.0510      0.130      0.392      0.697        -0.210     0.312
x16            0.0888      0.129      0.691      0.493        -0.170     0.347
x17           -0.2135      0.153     -1.399      0.168        -0.520     0.093
x18           -0.1751      0.125     -1.403      0.167        -0.426     0.076
x19           -0.1182      0.133     -0.891      0.377        -0.385     0.148
x20            0.1372      0.144      0.956      0.344        -0.151     0.426
x21            0.0829      0.116      0.713      0.479        -0.151     0.316
x22           -0.1616      0.136     -1.186      0.241        -0.435     0.112
x23            0.0143      0.110      0.129      0.898        -0.207     0.236
x24           -0.0345      0.116     -0.296      0.768        -0.269     0.200
x25           -0.0819      0.127     -0.644      0.523        -0.337     0.174
x26           -0.0806      0.119     -0.678      0.501        -0.320     0.158
x27            0.1311      0.112      1.166      0.249        -0.095     0.357
x28            0.0863      0.121      0.711      0.480        -0.158     0.330
x29            0.1581      0.111      1.427      0.160        -0.064     0.381
x30           -0.0581      0.103     -0.561      0.577        -0.266     0.150
x31           -0.0510      0.135     -0.377      0.708        -0.323     0.221
x32            0.0615      0.130      0.472      0.639        -0.200     0.323
x33            0.0064      0.107      0.060      0.952        -0.209     0.222
x34            0.1733      0.124      1.402      0.167        -0.075     0.422
x35            0.0711      0.119      0.597      0.554        -0.168     0.311
x36           -0.1664      0.131     -1.269      0.210        -0.430     0.097
x37           -0.1245      0.122     -1.019      0.313        -0.370     0.121
x38           -0.0702      0.121     -0.578      0.566        -0.314     0.174
x39           -0.1367      0.118     -1.154      0.254        -0.375     0.101
x40            0.0432      0.135      0.321      0.750        -0.227     0.314
x41           -0.0305      0.122     -0.249      0.804        -0.276     0.216
x42            0.0315      0.122      0.258      0.797        -0.213     0.276
x43           -0.0721      0.113     -0.640      0.525        -0.299     0.154
x44            0.0884      0.133      0.666      0.508        -0.178     0.355
x45            0.1152      0.126      0.911      0.367        -0.139     0.369
x46           -0.3248      0.145     -2.237      0.030        -0.616    -0.033
x47            0.1478      0.120      1.228      0.225        -0.094     0.390
x48            0.1136      0.099      1.153      0.255        -0.084     0.312
x49            0.0418      0.120      0.349      0.728        -0.199     0.282
==============================================================================
Omnibus:                        1.486   Durbin-Watson:                   2.050
Prob(Omnibus):                  0.476   Jarque-Bera (JB):                1.356
Skew:                          -0.283   Prob(JB):                        0.508
Kurtosis:                       2.925   Cond. No.                         74.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

6.3 Don't do bad science

We're in a replication crisis.

ALWAYS give all the information needed to replicate your results (including all the parameters of your models and your data unless restricted by licenses).
Be aware of the biases and try to correct for them.
Use Bonferroni correction: if you try 10 things, your p-value should be lower than 0.05/10 to be significant

Number of failed replications: of another author (your own papers)

chemistry: 90% (60%),
biology: 80% (60%),
physics and engineering: 70% (50%),
medicine: 70% (60%),
Earth and environment science: 60% (40%).

http://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970?WT.mc_id=SFB_NNEWS_1508_RHBox



In [4]:

    
Image(url="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/01bec95ec63634b9062de57edde1ecf7/replicationbypvalue.png")









    Out[4]:

6.4 Binomial test

In probability theory and statistics, the binomial distribution with parameters x, n and p is the discrete probability distribution of x number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p

The Relative Risk (RR) is the probability of the events happening in your sample of interest vs the probability of the events happening in the control group.

What's the probability of getting 40 heads out of 100 tosses given that a coin is fair? (p-value)



In [8]:

    
import scipy.stats
import statsmodels.stats.proportion

#probability of heads: p = 0.5 (50%)
#number of succeses: 40
#number of trials: n = 100

#control group = unbiased dice -> p=0.5
#sampe = this possible biased die -> p=0.4
#RR = 0.4/0.5 = 4:5 = 0.8

#p-value
pvalue = scipy.stats.binom_test(40, n=100, p=0.5)

#confidence interavls
conf = statsmodels.stats.proportion.proportion_confint(40,nobs=100,method="beta")

print(pvalue)
print(conf) #they usually do not include 0.5 (our control probability) if not significant.









    



0.056887933641
(0.30329476870287736, 0.50279084957766518)

In class exercise:

What's the probability that 10 journalists out of 1 million people get killed if the chances of getting killed are 1 in 1 million for the entire country? (p-value)

What's our p for the sample and control group?
How much more often do journalists get killed? (Calculate the RR)
What's the null hypothesis?
What's the p-value associated?
What are the confidence intervals of our p? (give it in people killed per 10 million people)
What can we say?



In [15]:

    
#Code here

#number of succeses: 10
#number of trials: 1000000

#control group = unbiased dice -> p = 1/1000000
#sampe = this possible biased die -> p= 10/1000000
#RR = 10

pvalue = scipy.stats.binom_test(10,1000000,p=1/1000000)


conf = statsmodels.stats.proportion.proportion_confint(10,nobs=1000000,method="beta",alpha=0.001)
print(pvalue)
print(conf)
conf[0]*1000000,conf[1]*1000000









    



1.1142142324e-07
(2.6990416095007298e-06, 2.5255366737031437e-05)






    Out[15]:





(2.6990416095007297, 25.255366737031437)

How does it actually work?

You have these fail/success trials and they are independent.
The probability of sucess is p.
The probability of getting at least x successes out of n trials is given by a binomial distribution. Which means we can calculate the probability of getting a more extreme result than x out of n trials.



In [72]:

    
#just plotting a binomial distribution, no worries about the code
x = np.linspace(0,10,11)
pmf = scipy.stats.binom.pmf(x,10,0.2)
plt.plot(x,pmf,".-")
plt.xlabel("Number of successes (out of 10)")
plt.ylabel("Frequency")









    Out[72]:





<matplotlib.text.Text at 0x7f49375d3518>

A note on one-sided test vs two-sided tests

Use always two-sided test (the default) unless you really understand what you are doing.
An example of an acceptable situation where you can use a one-sided test is if you are walking around the city and say: "oh, I bet the journalists in India have higher murder rates than politicians". Then you can use one-sided tests.
An example of an unacceptable way to do it is if you are checking the murder rates for different collectives and say, "oh, this seems higher, I'm going to check if it significant"
Every statistical test function have an argument that allows you to use "one-tail", "one-sided", "greater", or something like that

6.5 $\chi^2$ (chi-square) test

6.2.1 Independence between variables using contingency tables:

In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. (wikipedia)

Visualize interaction between categorical data
We can use the $\chi^2$ (chi-square) test to see if the interactions are significant.

It has a problem: The expected (not the observed) count in each cell needs to be greater than 3-5
If this does not happen we can use the Fisher test

Example 1: We will use this class as a sample

`Is the probability of having dark eyes independent of the gender?`

Why should it not?
Why could it?



In [22]:

    
df_eyes = pd.DataFrame(
    [
        [10,5],
        [5,10]
    ],columns=["Dark","Clear"],index=["Male","Female"])

df_eyes



In [24]:

    
import scipy.stats
chi,p,dof,expected = scipy.stats.chi2_contingency(df_eyes)
print(p)
display(expected)









    



0.144127034816






    





array([[ 7.5,  7.5],
       [ 7.5,  7.5]])



In [25]:

    
#Ratios
ratio_real_vs_expected = df_eyes/expected
ratio_real_vs_expected

How does it actually work?

The program calculates for each cell $\chi^2 = (Observed^2-Expected^2)/(Expected^2)$.
The degrees of freedom (a parameter) is $(\#rows-1)\cdot(\#columns-1)$.
The probability of getting a value higher than chi_stat is given by a $\chi^2$ distribution. Which means we can calculate the probability of getting a more extreme result than x out of n trials.



In [73]:

    
#just plotting a chi-square distribution, no worries
x = np.linspace(0,10,11)
pmf = scipy.stats.chi2.pdf(x,df=1)
plt.plot(x,pmf,".-")
plt.xlabel("Chi-2 value")
plt.ylabel("Frequency")









    Out[73]:





<matplotlib.text.Text at 0x7f49375ba400>

In-class exercise

Using LAPOP (survey data for Latin America): http://datasets.americasbarometer.org/database-login/usersearch.php?year=2014 We are going to see if there is any relationship between the best method to finish the conflict in Colombia and the belief that the conflict will end in the next year



In [48]:

    
#Read the data from colombia in "data/colombia.dta", it's a stata file
df = pd.read_stata("data/colombia.dta")
df.to_csv("data/colombia.csv",index=None)
df = pd.read_csv("data/colombia.csv")



In [49]:

    
df.groupby("upm").mean().reset_index()









    Out[49]:






  
    
      
      upm
      cluster
      fecha
      wt
      q2y
      q2
      vic1exta
      q12c
      q12bn
      gi7
    
  
  
    
      0
      5154.0
      110.500000
      1.397938e+18
      1.0
      1978.166667
      35.833333
      2.333333
      4.166667
      1.583333
      29.785714
    
    
      1
      5400.0
      114.500000
      1.397012e+18
      1.0
      1975.375000
      38.625000
      1.000000
      4.750000
      0.666667
      35.250000
    
    
      2
      5440.0
      118.500000
      1.397326e+18
      1.0
      1974.833333
      39.166667
      1.000000
      4.333333
      1.000000
      NaN
    
    
      3
      5665.0
      122.500000
      1.397182e+18
      1.0
      1976.625000
      37.375000
      NaN
      4.541667
      1.695652
      99.142857
    
    
      4
      5686.0
      126.500000
      1.397434e+18
      1.0
      1976.208333
      37.791667
      1.200000
      5.000000
      1.166667
      34.750000
    
    
      5
      8436.0
      10.500000
      1.396613e+18
      1.0
      1977.125000
      36.875000
      6.000000
      5.583333
      1.458333
      60.400000
    
    
      6
      13001.0
      14.500000
      1.400393e+18
      1.0
      1976.500000
      37.500000
      2.333333
      4.541667
      1.250000
      56.818182
    
    
      7
      13430.0
      18.500000
      1.397218e+18
      1.0
      1977.791667
      36.208333
      1.800000
      4.625000
      1.416667
      59.125000
    
    
      8
      15176.0
      154.500000
      1.401394e+18
      1.0
      1978.583333
      35.416667
      2.000000
      4.541667
      1.625000
      56.600000
    
    
      9
      15469.0
      158.500000
      1.398859e+18
      1.0
      1972.500000
      41.500000
      1.375000
      3.708333
      0.916667
      27.562500
    
    
      10
      17001.0
      130.500000
      1.400537e+18
      1.0
      1975.208333
      38.791667
      1.400000
      4.416667
      1.208333
      100.083333
    
    
      11
      17174.0
      134.500000
      1.397340e+18
      1.0
      1978.833333
      35.166667
      2.000000
      3.416667
      0.750000
      63.266667
    
    
      12
      17272.0
      138.500000
      1.397333e+18
      1.0
      1977.458333
      36.541667
      2.000000
      3.916667
      0.791667
      49.000000
    
    
      13
      18001.0
      246.500000
      1.404130e+18
      1.0
      1975.416667
      38.583333
      1.000000
      3.916667
      0.791667
      169.000000
    
    
      14
      19001.0
      202.500000
      1.400332e+18
      1.0
      1973.916667
      40.083333
      1.200000
      4.500000
      1.000000
      27.800000
    
    
      15
      19585.0
      206.500000
      1.397376e+18
      1.0
      1976.041667
      37.958333
      1.000000
      4.125000
      1.041667
      NaN
    
    
      16
      20228.0
      22.500000
      1.397131e+18
      1.0
      1977.583333
      36.416667
      1.250000
      5.166667
      1.500000
      82.500000
    
    
      17
      23001.0
      26.500000
      1.400414e+18
      1.0
      1976.375000
      37.625000
      1.500000
      4.583333
      1.166667
      141.500000
    
    
      18
      23189.0
      30.500000
      1.396836e+18
      1.0
      1976.666667
      37.333333
      1.500000
      5.708333
      1.416667
      62.333333
    
    
      19
      23466.0
      34.500000
      1.397308e+18
      1.0
      1977.333333
      36.666667
      2.000000
      4.333333
      1.333333
      62.666667
    
    
      20
      25269.0
      163.000000
      1.398384e+18
      1.0
      1977.222222
      36.777778
      1.400000
      4.388889
      0.944444
      45.363636
    
    
      21
      25307.0
      166.500000
      1.400864e+18
      1.0
      1977.291667
      36.708333
      1.500000
      4.125000
      1.083333
      16.846154
    
    
      22
      25489.0
      170.500000
      1.397513e+18
      1.0
      1974.500000
      39.500000
      NaN
      4.125000
      0.875000
      82.714286
    
    
      23
      25878.0
      174.500000
      1.398334e+18
      1.0
      1975.291667
      38.708333
      1.500000
      4.250000
      1.000000
      63.200000
    
    
      24
      41001.0
      142.500000
      1.400465e+18
      1.0
      1975.291667
      38.708333
      1.142857
      4.375000
      1.166667
      65.333333
    
    
      25
      47001.0
      38.500000
      1.400760e+18
      1.0
      1978.250000
      35.750000
      1.000000
      4.708333
      1.000000
      93.400000
    
    
      26
      50001.0
      178.500000
      1.401494e+18
      1.0
      1978.375000
      35.625000
      1.800000
      3.583333
      0.750000
      86.214286
    
    
      27
      50577.0
      182.500000
      1.398377e+18
      1.0
      1977.625000
      36.375000
      1.500000
      4.208333
      1.000000
      50.333333
    
    
      28
      52227.0
      210.500000
      1.396922e+18
      1.0
      1977.541667
      36.458333
      2.333333
      5.125000
      1.478261
      119.000000
    
    
      29
      52356.0
      214.500000
      1.397023e+18
      1.0
      1975.791667
      38.208333
      2.285714
      4.416667
      0.958333
      46.428571
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      33
      68001.0
      190.500000
      1.400522e+18
      1.0
      1977.041667
      36.958333
      1.500000
      4.166667
      1.208333
      68.600000
    
    
      34
      68081.0
      194.500000
      1.400256e+18
      1.0
      1977.541667
      36.458333
      1.000000
      3.791667
      0.625000
      139.000000
    
    
      35
      68689.0
      198.500000
      1.396886e+18
      1.0
      1978.083333
      35.916667
      1.000000
      3.791667
      0.875000
      220.000000
    
    
      36
      70001.0
      42.500000
      1.400508e+18
      1.0
      1974.916667
      39.083333
      1.500000
      4.833333
      1.125000
      93.000000
    
    
      37
      70670.0
      46.500000
      1.397304e+18
      1.0
      1975.416667
      38.583333
      NaN
      5.916667
      1.333333
      117.000000
    
    
      38
      73001.0
      150.478261
      1.400803e+18
      1.0
      1973.304348
      40.695652
      1.333333
      4.130435
      0.826087
      50.928571
    
    
      39
      76109.0
      238.500000
      1.404245e+18
      1.0
      1974.043478
      39.956522
      2.000000
      4.625000
      1.333333
      103.142857
    
    
      40
      76275.0
      218.500000
      1.396836e+18
      1.0
      1974.708333
      39.291667
      1.000000
      4.291667
      0.916667
      81.000000
    
    
      41
      76306.0
      242.500000
      1.397729e+18
      1.0
      1977.208333
      36.791667
      1.250000
      4.541667
      1.041667
      33.000000
    
    
      42
      86568.0
      250.500000
      1.396037e+18
      1.0
      1977.875000
      36.125000
      1.000000
      4.666667
      1.666667
      22.833333
    
    
      43
      110011.0
      90.500000
      1.398665e+18
      1.0
      1975.541667
      38.458333
      1.285714
      2.478261
      0.260870
      121.375000
    
    
      44
      110014.0
      61.250000
      1.397563e+18
      1.0
      1974.625000
      39.375000
      1.857143
      5.041667
      1.583333
      49.166667
    
    
      45
      110015.0
      56.500000
      1.397556e+18
      1.0
      1976.541667
      37.458333
      2.090909
      4.875000
      0.958333
      44.000000
    
    
      46
      110019.0
      61.739130
      1.397223e+18
      1.0
      1978.086957
      35.913043
      2.600000
      4.173913
      1.173913
      83.000000
    
    
      47
      500125.0
      95.000000
      1.403640e+18
      1.0
      1976.291667
      37.708333
      3.833333
      4.333333
      0.958333
      60.142857
    
    
      48
      500127.0
      102.500000
      1.397340e+18
      1.0
      1976.708333
      37.291667
      1.375000
      3.916667
      0.625000
      23.888889
    
    
      49
      500128.0
      102.500000
      1.397840e+18
      1.0
      1974.375000
      39.625000
      1.428571
      3.625000
      0.500000
      43.888889
    
    
      50
      500129.0
      102.000000
      1.396728e+18
      1.0
      1975.833333
      38.166667
      1.000000
      5.291667
      1.000000
      48.294118
    
    
      51
      800120.0
      2.500000
      1.403712e+18
      1.0
      1975.291667
      38.708333
      2.200000
      4.291667
      0.833333
      75.333333
    
    
      52
      800122.0
      6.500000
      1.397444e+18
      1.0
      1974.583333
      39.416667
      1.333333
      5.375000
      0.791667
      75.909091
    
    
      53
      1100110.0
      80.500000
      1.397984e+18
      1.0
      1977.833333
      36.166667
      1.600000
      4.291667
      0.791667
      72.263158
    
    
      54
      1100112.0
      86.428571
      1.407444e+18
      1.0
      1974.142857
      39.857143
      1.800000
      4.190476
      0.476190
      95.333333
    
    
      55
      1100113.0
      80.272727
      1.399704e+18
      1.0
      1974.000000
      40.000000
      2.833333
      3.863636
      0.636364
      91.444444
    
    
      56
      1100115.0
      74.500000
      1.397110e+18
      1.0
      1976.625000
      37.375000
      1.888889
      4.333333
      0.958333
      37.909091
    
    
      57
      1100116.0
      50.500000
      1.407168e+18
      1.0
      1977.541667
      36.458333
      1.625000
      3.750000
      1.000000
      67.083333
    
    
      58
      1100118.0
      68.500000
      1.398074e+18
      1.0
      1972.500000
      41.500000
      1.900000
      3.750000
      0.708333
      41.833333
    
    
      59
      1100119.0
      64.826087
      1.397802e+18
      1.0
      1972.782609
      41.217391
      2.500000
      4.130435
      0.608696
      78.454545
    
    
      60
      7600130.0
      226.521739
      1.403508e+18
      1.0
      1976.695652
      37.304348
      2.000000
      4.434783
      1.086957
      186.000000
    
    
      61
      7600131.0
      230.478261
      1.396581e+18
      1.0
      1976.782609
      37.217391
      1.500000
      4.227273
      1.090909
      38.000000
    
    
      62
      7600132.0
      234.500000
      1.398251e+18
      1.0
      1975.416667
      38.583333
      1.600000
      4.375000
      0.666667
      67.142857
    
  

63 rows × 10 columns



In [27]:

    
x_variable = df["colpaz1a"] #What is the best method to continue in the conflict
other_variables =[df["colpaz2a"]] #What are the chances that peace happens within one year



In [28]:

    
#Create a contingency table (pd.crosstab) between the x_variable and the other_variables
col_crosstab = pd.crosstab(x_variable,other_variables)
col_crosstab









    Out[28]:






  
    
      colpaz2a
      Muy posible
      Posible
      Poco posible
      Nada posible
    
    
      colpaz1a
      
      
      
      
    
  
  
    
      Negociación
      58
      239
      358
      165
    
    
      Uso de la fuerza militar
      6
      32
      144
      327
    
    
      [No leer] Ambas
      2
      13
      34
      40



In [29]:

    
#Calculate the chi-square test
chi,p,dof,expected = chi,p,dof,expected = scipy.stats.chi2_contingency(col_crosstab)
print(p)
display(expected)









    



4.36219476812e-60






    





array([[  38.16643159,  164.23131171,  309.95768688,  307.64456982],
       [  23.69111425,  101.94358251,  192.40056417,  190.96473907],
       [   4.14245416,   17.82510578,   33.64174894,   33.39069111]])



In [30]:

    
col_crosstab/expected









    Out[30]:






  
    
      colpaz2a
      Muy posible
      Posible
      Poco posible
      Nada posible
    
    
      colpaz1a
      
      
      
      
    
  
  
    
      Negociación
      1.519660
      1.455265
      1.154996
      0.536333
    
    
      Uso de la fuerza militar
      0.253260
      0.313899
      0.748439
      1.712358
    
    
      [No leer] Ambas
      0.482806
      0.729308
      1.010649
      1.197939

Is it significant?



In [31]:

    
##Now let's add gender
x_variable = df["colpaz1a"] #What is the best method to continue
other_variables =[df["q1"],df["colpaz2a"]] #Gender and what are the chances that peace happens within one year



In [35]:

    
#Create a contingency table (pd.crosstab) between the x_variable and the other_variables
col_crosstab = pd.crosstab(x_variable,other_variables)
col_crosstab









    Out[35]:






  
    
      q1
      Hombre
      Mujer
    
    
      colpaz2a
      Muy posible
      Posible
      Poco posible
      Nada posible
      Muy posible
      Posible
      Poco posible
      Nada posible
    
    
      colpaz1a
      
      
      
      
      
      
      
      
    
  
  
    
      Negociación
      32
      121
      169
      71
      26
      118
      189
      94
    
    
      Uso de la fuerza militar
      3
      22
      74
      186
      3
      10
      70
      141
    
    
      [No leer] Ambas
      0
      4
      17
      22
      2
      9
      17
      18



In [36]:

    
#Calculate the chi-square test
chi,p,dof,expected = chi,p,dof,expected = scipy.stats.chi2_contingency(col_crosstab)
print(p)
display(expected)









    



1.31700597089e-56






    





array([[  20.23977433,   85.00705219,  150.35260931,  161.33991537,
          17.92665726,   79.22425952,  159.60507757,  146.30465444],
       [  12.56346968,   52.76657264,   93.32863188,  100.14880113,
          11.12764457,   49.17700987,   99.0719323 ,   90.81593794],
       [   2.19675599,    9.22637518,   16.31875882,   17.5112835 ,
           1.94569817,    8.59873061,   17.32299013,   15.87940762]])



In [37]:

    
col_crosstab/expected









    Out[37]:






  
    
      q1
      Hombre
      Mujer
    
    
      colpaz2a
      Muy posible
      Posible
      Poco posible
      Nada posible
      Muy posible
      Posible
      Poco posible
      Nada posible
    
    
      colpaz1a
      
      
      
      
      
      
      
      
    
  
  
    
      Negociación
      1.581045
      1.423411
      1.124024
      0.440065
      1.450354
      1.489443
      1.184173
      0.642495
    
    
      Uso de la fuerza militar
      0.238788
      0.416931
      0.792897
      1.857236
      0.269599
      0.203347
      0.706557
      1.552591
    
    
      [No leer] Ambas
      0.000000
      0.433540
      1.041746
      1.256333
      1.027909
      1.046666
      0.981355
      1.133544

Is it significant?

Let's use R, Fisher test for big tables do not exist in Python



In [ ]:

    
#Install R (already installed)
!conda install -c r r-essentials

#Install link between R and python (already installed)
!pip install rpy2



In [38]:

    
%load_ext rpy2.ipython









    



The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython



In [39]:

    
%%R -i col_crosstab
fisher.test(col_crosstab,simulate.p.value=TRUE,B=1e6)









    





	Fisher's Exact Test for Count Data with simulated p-value (based on
	1e+06 replicates)

data:  col_crosstab
p-value = 1e-06
alternative hypothesis: two.sided

	y	x0	x1	x2	x3	x4	x5	x6	x7	x8	...	x40	x41	x42	x43	x44	x45	x46	x47	x48	x49
0	0.213572	0.221967	0.713657	0.983606	0.681643	0.156636	0.660690	0.293840	0.154051	0.398285	...	0.138762	0.672604	0.615953	0.782186	0.286070	0.024790	0.244314	0.800681	0.823869	0.076661
1	0.143471	0.396877	0.565624	0.307485	0.601993	0.377495	0.718098	0.231881	0.761632	0.065392	...	0.343435	0.293349	0.116651	0.631342	0.183914	0.260053	0.764687	0.095851	0.304441	0.393638
2	0.396554	0.613201	0.244101	0.813583	0.590953	0.461041	0.020353	0.025964	0.500359	0.563320	...	0.079348	0.617747	0.498122	0.395370	0.802521	0.284240	0.416556	0.290213	0.872692	0.324670
3	0.533959	0.746970	0.593383	0.672295	0.989462	0.441439	0.660520	0.541208	0.588104	0.686654	...	0.101289	0.300024	0.683742	0.867609	0.579942	0.453146	0.049929	0.290257	0.147547	0.367648
4	0.245794	0.053775	0.948439	0.980810	0.626739	0.049619	0.310976	0.996501	0.442603	0.740117	...	0.668109	0.681462	0.681842	0.633676	0.116509	0.314008	0.657629	0.651648	0.677378	0.796769

	upm	cluster	fecha	wt	q2y	q2	vic1exta	q12c	q12bn	gi7
0	5154.0	110.500000	1.397938e+18	1.0	1978.166667	35.833333	2.333333	4.166667	1.583333	29.785714
1	5400.0	114.500000	1.397012e+18	1.0	1975.375000	38.625000	1.000000	4.750000	0.666667	35.250000
2	5440.0	118.500000	1.397326e+18	1.0	1974.833333	39.166667	1.000000	4.333333	1.000000	NaN
3	5665.0	122.500000	1.397182e+18	1.0	1976.625000	37.375000	NaN	4.541667	1.695652	99.142857
4	5686.0	126.500000	1.397434e+18	1.0	1976.208333	37.791667	1.200000	5.000000	1.166667	34.750000
5	8436.0	10.500000	1.396613e+18	1.0	1977.125000	36.875000	6.000000	5.583333	1.458333	60.400000
6	13001.0	14.500000	1.400393e+18	1.0	1976.500000	37.500000	2.333333	4.541667	1.250000	56.818182
7	13430.0	18.500000	1.397218e+18	1.0	1977.791667	36.208333	1.800000	4.625000	1.416667	59.125000
8	15176.0	154.500000	1.401394e+18	1.0	1978.583333	35.416667	2.000000	4.541667	1.625000	56.600000
9	15469.0	158.500000	1.398859e+18	1.0	1972.500000	41.500000	1.375000	3.708333	0.916667	27.562500
10	17001.0	130.500000	1.400537e+18	1.0	1975.208333	38.791667	1.400000	4.416667	1.208333	100.083333
11	17174.0	134.500000	1.397340e+18	1.0	1978.833333	35.166667	2.000000	3.416667	0.750000	63.266667
12	17272.0	138.500000	1.397333e+18	1.0	1977.458333	36.541667	2.000000	3.916667	0.791667	49.000000
13	18001.0	246.500000	1.404130e+18	1.0	1975.416667	38.583333	1.000000	3.916667	0.791667	169.000000
14	19001.0	202.500000	1.400332e+18	1.0	1973.916667	40.083333	1.200000	4.500000	1.000000	27.800000
15	19585.0	206.500000	1.397376e+18	1.0	1976.041667	37.958333	1.000000	4.125000	1.041667	NaN
16	20228.0	22.500000	1.397131e+18	1.0	1977.583333	36.416667	1.250000	5.166667	1.500000	82.500000
17	23001.0	26.500000	1.400414e+18	1.0	1976.375000	37.625000	1.500000	4.583333	1.166667	141.500000
18	23189.0	30.500000	1.396836e+18	1.0	1976.666667	37.333333	1.500000	5.708333	1.416667	62.333333
19	23466.0	34.500000	1.397308e+18	1.0	1977.333333	36.666667	2.000000	4.333333	1.333333	62.666667
20	25269.0	163.000000	1.398384e+18	1.0	1977.222222	36.777778	1.400000	4.388889	0.944444	45.363636
21	25307.0	166.500000	1.400864e+18	1.0	1977.291667	36.708333	1.500000	4.125000	1.083333	16.846154
22	25489.0	170.500000	1.397513e+18	1.0	1974.500000	39.500000	NaN	4.125000	0.875000	82.714286
23	25878.0	174.500000	1.398334e+18	1.0	1975.291667	38.708333	1.500000	4.250000	1.000000	63.200000
24	41001.0	142.500000	1.400465e+18	1.0	1975.291667	38.708333	1.142857	4.375000	1.166667	65.333333
25	47001.0	38.500000	1.400760e+18	1.0	1978.250000	35.750000	1.000000	4.708333	1.000000	93.400000
26	50001.0	178.500000	1.401494e+18	1.0	1978.375000	35.625000	1.800000	3.583333	0.750000	86.214286
27	50577.0	182.500000	1.398377e+18	1.0	1977.625000	36.375000	1.500000	4.208333	1.000000	50.333333
28	52227.0	210.500000	1.396922e+18	1.0	1977.541667	36.458333	2.333333	5.125000	1.478261	119.000000
29	52356.0	214.500000	1.397023e+18	1.0	1975.791667	38.208333	2.285714	4.416667	0.958333	46.428571
...	...	...	...	...	...	...	...	...	...	...
33	68001.0	190.500000	1.400522e+18	1.0	1977.041667	36.958333	1.500000	4.166667	1.208333	68.600000
34	68081.0	194.500000	1.400256e+18	1.0	1977.541667	36.458333	1.000000	3.791667	0.625000	139.000000
35	68689.0	198.500000	1.396886e+18	1.0	1978.083333	35.916667	1.000000	3.791667	0.875000	220.000000
36	70001.0	42.500000	1.400508e+18	1.0	1974.916667	39.083333	1.500000	4.833333	1.125000	93.000000
37	70670.0	46.500000	1.397304e+18	1.0	1975.416667	38.583333	NaN	5.916667	1.333333	117.000000
38	73001.0	150.478261	1.400803e+18	1.0	1973.304348	40.695652	1.333333	4.130435	0.826087	50.928571
39	76109.0	238.500000	1.404245e+18	1.0	1974.043478	39.956522	2.000000	4.625000	1.333333	103.142857
40	76275.0	218.500000	1.396836e+18	1.0	1974.708333	39.291667	1.000000	4.291667	0.916667	81.000000
41	76306.0	242.500000	1.397729e+18	1.0	1977.208333	36.791667	1.250000	4.541667	1.041667	33.000000
42	86568.0	250.500000	1.396037e+18	1.0	1977.875000	36.125000	1.000000	4.666667	1.666667	22.833333
43	110011.0	90.500000	1.398665e+18	1.0	1975.541667	38.458333	1.285714	2.478261	0.260870	121.375000
44	110014.0	61.250000	1.397563e+18	1.0	1974.625000	39.375000	1.857143	5.041667	1.583333	49.166667
45	110015.0	56.500000	1.397556e+18	1.0	1976.541667	37.458333	2.090909	4.875000	0.958333	44.000000
46	110019.0	61.739130	1.397223e+18	1.0	1978.086957	35.913043	2.600000	4.173913	1.173913	83.000000
47	500125.0	95.000000	1.403640e+18	1.0	1976.291667	37.708333	3.833333	4.333333	0.958333	60.142857
48	500127.0	102.500000	1.397340e+18	1.0	1976.708333	37.291667	1.375000	3.916667	0.625000	23.888889
49	500128.0	102.500000	1.397840e+18	1.0	1974.375000	39.625000	1.428571	3.625000	0.500000	43.888889
50	500129.0	102.000000	1.396728e+18	1.0	1975.833333	38.166667	1.000000	5.291667	1.000000	48.294118
51	800120.0	2.500000	1.403712e+18	1.0	1975.291667	38.708333	2.200000	4.291667	0.833333	75.333333
52	800122.0	6.500000	1.397444e+18	1.0	1974.583333	39.416667	1.333333	5.375000	0.791667	75.909091
53	1100110.0	80.500000	1.397984e+18	1.0	1977.833333	36.166667	1.600000	4.291667	0.791667	72.263158
54	1100112.0	86.428571	1.407444e+18	1.0	1974.142857	39.857143	1.800000	4.190476	0.476190	95.333333
55	1100113.0	80.272727	1.399704e+18	1.0	1974.000000	40.000000	2.833333	3.863636	0.636364	91.444444
56	1100115.0	74.500000	1.397110e+18	1.0	1976.625000	37.375000	1.888889	4.333333	0.958333	37.909091
57	1100116.0	50.500000	1.407168e+18	1.0	1977.541667	36.458333	1.625000	3.750000	1.000000	67.083333
58	1100118.0	68.500000	1.398074e+18	1.0	1972.500000	41.500000	1.900000	3.750000	0.708333	41.833333
59	1100119.0	64.826087	1.397802e+18	1.0	1972.782609	41.217391	2.500000	4.130435	0.608696	78.454545
60	7600130.0	226.521739	1.403508e+18	1.0	1976.695652	37.304348	2.000000	4.434783	1.086957	186.000000
61	7600131.0	230.478261	1.396581e+18	1.0	1976.782609	37.217391	1.500000	4.227273	1.090909	38.000000
62	7600132.0	234.500000	1.398251e+18	1.0	1975.416667	38.583333	1.600000	4.375000	0.666667	67.142857

colpaz2a	Muy posible	Posible	Poco posible	Nada posible
colpaz1a
Negociación	58	239	358	165
Uso de la fuerza militar	6	32	144	327
[No leer] Ambas	2	13	34	40

colpaz2a	Muy posible	Posible	Poco posible	Nada posible
colpaz1a
Negociación	1.519660	1.455265	1.154996	0.536333
Uso de la fuerza militar	0.253260	0.313899	0.748439	1.712358
[No leer] Ambas	0.482806	0.729308	1.010649	1.197939

q1	Hombre				Mujer
colpaz2a	Muy posible	Posible	Poco posible	Nada posible	Muy posible	Posible	Poco posible	Nada posible
colpaz1a
Negociación	32	121	169	71	26	118	189	94
Uso de la fuerza militar	3	22	74	186	3	10	70	141
[No leer] Ambas	0	4	17	22	2	9	17	18