Working with data 2017. Class 4

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

  1. Stats
    • Definitions
    • What's a p-value?
    • One-tailed test vs two-tailed test
    • Count vs expected count (binomial test)
    • Independence between factors: ($\chi^2$ test)
  2. In-class exercises to melt, pivot, concat, merge, groupby and plot.
  3. Read data from websited
  4. Time series

In [1]:
import pandas as pd
import numpy as np
import pylab as plt
from scipy.stats import chi2_contingency,ttest_ind

#This allows us to use R
%load_ext rpy2.ipython

#Visualize in line
%matplotlib inline

#Be able to plot images saved in the hard drive
from IPython.display import Image,display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))


6 Statistics I

To learn about statistics well, I recommend this course: https://www.coursera.org/specializations/social-science

6.0 Definitions

6.0.1 Population vs sample

  • Population: The entire set of possible observations (all people in a country in a country-level survey)
  • Sample: The observations we actually have

6.0.2 Parameter vs statistic

  • Parameter: The true values that define the population ($\sigma$ and $\mu$ for a normal distribution)
  • Statistic: The values that we calculate using our sample (STD and MEAN for a normal distribution)

6.0.3 Probability

  • Probability: The proportion of times where the measured event occurs in the long run.
  • For instance, the probability of a coin toss is 0.5, which means that if you toss a coin 1 million times you will more or less get 500k heads and 500k tails)

6.0.4 Null hypothesis:

  • The hypothesis that our value is not significant.
  • For instance our value can be the difference between two groups, or the difference between one group and zero.
  • It is assumed to be true and we try to disprove it.
  • It is "disproved" if the chances to dismiss it by chance are lower than 5%.

6.0.5 Alternative hypothesis:

  • The hypothesis that our value is significant.
  • Accepted after dismissing the null hypothesis.

6.0.6 Types of error:

  • Type I ($\alpha$): Rejecting the null hypothesis when it was actually True (saying we have something we don't have). As a rule we set it to 0.05 and if the p-value is below it we accept the alternative hypothesis.
  • Type II ($\beta$): Accepting the null hypothesis when it was actually False (saying we don't have something we actually have). This is a less important error.

6.0.7 p-value

  • The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true.
  • "commonly misused and misinterpreted." --> a p-value of 0.01 does not mean that there is 1% chances that you are wrong!
  • Low p-value can be for two reasons:
    • The null is true but your sample was unusual.
    • The null is false.
  • with a p-value of 0.05 the probability of incorrectly rejecting a true null hypothesis is 23% (mainly for other bias).

6.0.8 Confidence intervals

  • The likely range of values where the value of a parameter lies within.
  • It depends on your significance level (usually 0.05)

6.0.9 Effect size:

6.0.10 Correlation

  • Dependence or association is any statistical relationship, whether causal or not, between two random variables or two sets of data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other.

6.0.11 In general:

  • Larger population: We can detect smaller differences.
  • Smaller variability within groups: We can detect smaller differences.
  • Large differences between groups (effect size): Unlikely that it is due to noise.

6.1 Biases

6.1.1 Cherry-picking (yourself)

  • Using individual cases or data that seem to confirm a particular position, while ignoring a significant portion of related cases or data that may contradict that position.
  • Cherry picking may be committed intentionally or unintentionally.

6.1.2 Look-elsewhere effect (sample size)

  • If you try many things, one will be significant.
  • With a sample size large enough, any outrageous thing is likely to happen (Persi Diaconis and Frederick Mosteller)
  • If you have a database, scan the values to see if there is something interesting, and then use that you are cheating. Your minimum p-value shouldn't be 0.05, should be 0.05/(variables scanned before hand)

6.1.3 Optional stopping (data collection)

  • It is a well-known fact of null-hypothesis significance testing (NHST) that when there is "optional stopping" of data collection with testing at every new datum (a procedure also called "sequential testing" or "data peeking"), then the null hypothesis will eventually be rejected even when it is true. With enough random sampling from the null hypothesis, eventually there will be some accidental coincidence of outlying values so that p < .05 (conditionalizing on the current sample size). Anscombe (1954) called this phenomenon, "sampling to reach a foregone conclusion." from; http://doingbayesiandataanalysis.blogspot.nl/2013/11/optional-stopping-in-data-collection-p.html

In [58]:
print("look-elsewhere effect")
Image(url="http://www.tylervigen.com/chart-pngs/13.png",width=1000)


look-elsewhere effect
Out[58]:

More on the look-elsewhere effect

  • If you try too many things, one of them is going to be significant.
  • You need adjustments (more on that another time)
  • We are creating a completely random dataset with 100 observations and 50 variables
  • We are trying to see if we can fit a linear model

In [4]:
#Create some totally random data
import numpy as np
df = pd.DataFrame(np.random.random((100,51)))

cols_x = []
for i in range(50):
    cols_x.append("x"+str(i))
df.columns = ["y"] + cols_x


print(cols_x)
df.head()


['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49']
Out[4]:
y x0 x1 x2 x3 x4 x5 x6 x7 x8 ... x40 x41 x42 x43 x44 x45 x46 x47 x48 x49
0 0.213572 0.221967 0.713657 0.983606 0.681643 0.156636 0.660690 0.293840 0.154051 0.398285 ... 0.138762 0.672604 0.615953 0.782186 0.286070 0.024790 0.244314 0.800681 0.823869 0.076661
1 0.143471 0.396877 0.565624 0.307485 0.601993 0.377495 0.718098 0.231881 0.761632 0.065392 ... 0.343435 0.293349 0.116651 0.631342 0.183914 0.260053 0.764687 0.095851 0.304441 0.393638
2 0.396554 0.613201 0.244101 0.813583 0.590953 0.461041 0.020353 0.025964 0.500359 0.563320 ... 0.079348 0.617747 0.498122 0.395370 0.802521 0.284240 0.416556 0.290213 0.872692 0.324670
3 0.533959 0.746970 0.593383 0.672295 0.989462 0.441439 0.660520 0.541208 0.588104 0.686654 ... 0.101289 0.300024 0.683742 0.867609 0.579942 0.453146 0.049929 0.290257 0.147547 0.367648
4 0.245794 0.053775 0.948439 0.980810 0.626739 0.049619 0.310976 0.996501 0.442603 0.740117 ... 0.668109 0.681462 0.681842 0.633676 0.116509 0.314008 0.657629 0.651648 0.677378 0.796769

5 rows × 51 columns


In [5]:
#Fit a regression
import statsmodels.formula.api as smf

mod = smf.ols(formula='y ~ {}'.format("+".join(cols_x)), data=df)
res = mod.fit()
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.149
Method:                 Least Squares   F-statistic:                     1.347
Date:                Thu, 19 Jan 2017   Prob (F-statistic):              0.150
Time:                        09:27:01   Log-Likelihood:                 34.320
No. Observations:                 100   AIC:                             33.36
Df Residuals:                      49   BIC:                             166.2
Df Model:                          50                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.2051      0.447     -0.459      0.648        -1.103     0.693
x0             0.0762      0.127      0.597      0.553        -0.180     0.332
x1             0.0611      0.130      0.469      0.641        -0.200     0.323
x2             0.2445      0.119      2.060      0.045         0.006     0.483
x3            -0.0472      0.132     -0.359      0.721        -0.312     0.217
x4             0.1674      0.125      1.344      0.185        -0.083     0.418
x5             0.3456      0.140      2.460      0.017         0.063     0.628
x6             0.2241      0.114      1.962      0.055        -0.005     0.454
x7             0.2255      0.121      1.861      0.069        -0.018     0.469
x8            -0.0215      0.109     -0.197      0.844        -0.241     0.198
x9             0.4157      0.126      3.293      0.002         0.162     0.669
x10           -0.0130      0.128     -0.102      0.919        -0.270     0.244
x11           -0.0573      0.128     -0.447      0.657        -0.315     0.200
x12           -0.0704      0.117     -0.601      0.551        -0.306     0.165
x13            0.0641      0.114      0.563      0.576        -0.165     0.293
x14            0.0532      0.114      0.467      0.643        -0.176     0.282
x15            0.0510      0.130      0.392      0.697        -0.210     0.312
x16            0.0888      0.129      0.691      0.493        -0.170     0.347
x17           -0.2135      0.153     -1.399      0.168        -0.520     0.093
x18           -0.1751      0.125     -1.403      0.167        -0.426     0.076
x19           -0.1182      0.133     -0.891      0.377        -0.385     0.148
x20            0.1372      0.144      0.956      0.344        -0.151     0.426
x21            0.0829      0.116      0.713      0.479        -0.151     0.316
x22           -0.1616      0.136     -1.186      0.241        -0.435     0.112
x23            0.0143      0.110      0.129      0.898        -0.207     0.236
x24           -0.0345      0.116     -0.296      0.768        -0.269     0.200
x25           -0.0819      0.127     -0.644      0.523        -0.337     0.174
x26           -0.0806      0.119     -0.678      0.501        -0.320     0.158
x27            0.1311      0.112      1.166      0.249        -0.095     0.357
x28            0.0863      0.121      0.711      0.480        -0.158     0.330
x29            0.1581      0.111      1.427      0.160        -0.064     0.381
x30           -0.0581      0.103     -0.561      0.577        -0.266     0.150
x31           -0.0510      0.135     -0.377      0.708        -0.323     0.221
x32            0.0615      0.130      0.472      0.639        -0.200     0.323
x33            0.0064      0.107      0.060      0.952        -0.209     0.222
x34            0.1733      0.124      1.402      0.167        -0.075     0.422
x35            0.0711      0.119      0.597      0.554        -0.168     0.311
x36           -0.1664      0.131     -1.269      0.210        -0.430     0.097
x37           -0.1245      0.122     -1.019      0.313        -0.370     0.121
x38           -0.0702      0.121     -0.578      0.566        -0.314     0.174
x39           -0.1367      0.118     -1.154      0.254        -0.375     0.101
x40            0.0432      0.135      0.321      0.750        -0.227     0.314
x41           -0.0305      0.122     -0.249      0.804        -0.276     0.216
x42            0.0315      0.122      0.258      0.797        -0.213     0.276
x43           -0.0721      0.113     -0.640      0.525        -0.299     0.154
x44            0.0884      0.133      0.666      0.508        -0.178     0.355
x45            0.1152      0.126      0.911      0.367        -0.139     0.369
x46           -0.3248      0.145     -2.237      0.030        -0.616    -0.033
x47            0.1478      0.120      1.228      0.225        -0.094     0.390
x48            0.1136      0.099      1.153      0.255        -0.084     0.312
x49            0.0418      0.120      0.349      0.728        -0.199     0.282
==============================================================================
Omnibus:                        1.486   Durbin-Watson:                   2.050
Prob(Omnibus):                  0.476   Jarque-Bera (JB):                1.356
Skew:                          -0.283   Prob(JB):                        0.508
Kurtosis:                       2.925   Cond. No.                         74.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

6.3 Don't do bad science

We're in a replication crisis.

  • ALWAYS give all the information needed to replicate your results (including all the parameters of your models and your data unless restricted by licenses).
  • Be aware of the biases and try to correct for them.
  • Use Bonferroni correction: if you try 10 things, your p-value should be lower than 0.05/10 to be significant

Number of failed replications: of another author (your own papers)

`

chemistry: 90% (60%),
biology: 80% (60%),
physics and engineering: 70% (50%),
medicine: 70% (60%),
Earth and environment science: 60% (40%).

`

http://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970?WT.mc_id=SFB_NNEWS_1508_RHBox


In [4]:
Image(url="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/01bec95ec63634b9062de57edde1ecf7/replicationbypvalue.png")


Out[4]:

6.4 Binomial test

In probability theory and statistics, the binomial distribution with parameters x, n and p is the discrete probability distribution of x number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p

The Relative Risk (RR) is the probability of the events happening in your sample of interest vs the probability of the events happening in the control group.

What's the probability of getting 40 heads out of 100 tosses given that a coin is fair? (p-value)


In [8]:
import scipy.stats
import statsmodels.stats.proportion

#probability of heads: p = 0.5 (50%)
#number of succeses: 40
#number of trials: n = 100

#control group = unbiased dice -> p=0.5
#sampe = this possible biased die -> p=0.4
#RR = 0.4/0.5 = 4:5 = 0.8

#p-value
pvalue = scipy.stats.binom_test(40, n=100, p=0.5)

#confidence interavls
conf = statsmodels.stats.proportion.proportion_confint(40,nobs=100,method="beta")

print(pvalue)
print(conf) #they usually do not include 0.5 (our control probability) if not significant.


0.056887933641
(0.30329476870287736, 0.50279084957766518)

In class exercise:

What's the probability that 10 journalists out of 1 million people get killed if the chances of getting killed are 1 in 1 million for the entire country? (p-value)

  • What's our p for the sample and control group?
  • How much more often do journalists get killed? (Calculate the RR)
  • What's the null hypothesis?
  • What's the p-value associated?
  • What are the confidence intervals of our p? (give it in people killed per 10 million people)
  • What can we say?

In [15]:
#Code here

#number of succeses: 10
#number of trials: 1000000

#control group = unbiased dice -> p = 1/1000000
#sampe = this possible biased die -> p= 10/1000000
#RR = 10

pvalue = scipy.stats.binom_test(10,1000000,p=1/1000000)


conf = statsmodels.stats.proportion.proportion_confint(10,nobs=1000000,method="beta",alpha=0.001)
print(pvalue)
print(conf)
conf[0]*1000000,conf[1]*1000000


1.1142142324e-07
(2.6990416095007298e-06, 2.5255366737031437e-05)
Out[15]:
(2.6990416095007297, 25.255366737031437)

How does it actually work?

  • You have these fail/success trials and they are independent.
  • The probability of sucess is p.
  • The probability of getting at least x successes out of n trials is given by a binomial distribution. Which means we can calculate the probability of getting a more extreme result than x out of n trials.

In [72]:
#just plotting a binomial distribution, no worries about the code
x = np.linspace(0,10,11)
pmf = scipy.stats.binom.pmf(x,10,0.2)
plt.plot(x,pmf,".-")
plt.xlabel("Number of successes (out of 10)")
plt.ylabel("Frequency")


Out[72]:
<matplotlib.text.Text at 0x7f49375d3518>

A note on one-sided test vs two-sided tests

  • Use always two-sided test (the default) unless you really understand what you are doing.
  • An example of an acceptable situation where you can use a one-sided test is if you are walking around the city and say: "oh, I bet the journalists in India have higher murder rates than politicians". Then you can use one-sided tests.
  • An example of an unacceptable way to do it is if you are checking the murder rates for different collectives and say, "oh, this seems higher, I'm going to check if it significant"
  • Every statistical test function have an argument that allows you to use "one-tail", "one-sided", "greater", or something like that

6.5 $\chi^2$ (chi-square) test

6.2.1 Independence between variables using contingency tables:

In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. (wikipedia)

  • Visualize interaction between categorical data
  • We can use the $\chi^2$ (chi-square) test to see if the interactions are significant.
  • It has a problem: The expected (not the observed) count in each cell needs to be greater than 3-5
  • If this does not happen we can use the Fisher test

Example 1: We will use this class as a sample

`Is the probability of having dark eyes independent of the gender?`

  • Why should it not?
  • Why could it?

In [22]:
df_eyes = pd.DataFrame(
    [
        [10,5],
        [5,10]
    ],columns=["Dark","Clear"],index=["Male","Female"])

df_eyes


Out[22]:
Dark Clear
Male 10 5
Female 5 10

In [24]:
import scipy.stats
chi,p,dof,expected = scipy.stats.chi2_contingency(df_eyes)
print(p)
display(expected)


0.144127034816
array([[ 7.5,  7.5],
       [ 7.5,  7.5]])

In [25]:
#Ratios
ratio_real_vs_expected = df_eyes/expected
ratio_real_vs_expected


Out[25]:
Dark Clear
Male 1.333333 0.666667
Female 0.666667 1.333333

How does it actually work?

  • The program calculates for each cell $\chi^2 = (Observed^2-Expected^2)/(Expected^2)$.
  • The degrees of freedom (a parameter) is $(\#rows-1)\cdot(\#columns-1)$.
  • The probability of getting a value higher than chi_stat is given by a $\chi^2$ distribution. Which means we can calculate the probability of getting a more extreme result than x out of n trials.

In [73]:
#just plotting a chi-square distribution, no worries
x = np.linspace(0,10,11)
pmf = scipy.stats.chi2.pdf(x,df=1)
plt.plot(x,pmf,".-")
plt.xlabel("Chi-2 value")
plt.ylabel("Frequency")


Out[73]:
<matplotlib.text.Text at 0x7f49375ba400>

In-class exercise

Using LAPOP (survey data for Latin America): http://datasets.americasbarometer.org/database-login/usersearch.php?year=2014 We are going to see if there is any relationship between the best method to finish the conflict in Colombia and the belief that the conflict will end in the next year


In [48]:
#Read the data from colombia in "data/colombia.dta", it's a stata file
df = pd.read_stata("data/colombia.dta")
df.to_csv("data/colombia.csv",index=None)
df = pd.read_csv("data/colombia.csv")

In [49]:
df.groupby("upm").mean().reset_index()


Out[49]:
upm cluster fecha wt q2y q2 vic1exta q12c q12bn gi7
0 5154.0 110.500000 1.397938e+18 1.0 1978.166667 35.833333 2.333333 4.166667 1.583333 29.785714
1 5400.0 114.500000 1.397012e+18 1.0 1975.375000 38.625000 1.000000 4.750000 0.666667 35.250000
2 5440.0 118.500000 1.397326e+18 1.0 1974.833333 39.166667 1.000000 4.333333 1.000000 NaN
3 5665.0 122.500000 1.397182e+18 1.0 1976.625000 37.375000 NaN 4.541667 1.695652 99.142857
4 5686.0 126.500000 1.397434e+18 1.0 1976.208333 37.791667 1.200000 5.000000 1.166667 34.750000
5 8436.0 10.500000 1.396613e+18 1.0 1977.125000 36.875000 6.000000 5.583333 1.458333 60.400000
6 13001.0 14.500000 1.400393e+18 1.0 1976.500000 37.500000 2.333333 4.541667 1.250000 56.818182
7 13430.0 18.500000 1.397218e+18 1.0 1977.791667 36.208333 1.800000 4.625000 1.416667 59.125000
8 15176.0 154.500000 1.401394e+18 1.0 1978.583333 35.416667 2.000000 4.541667 1.625000 56.600000
9 15469.0 158.500000 1.398859e+18 1.0 1972.500000 41.500000 1.375000 3.708333 0.916667 27.562500
10 17001.0 130.500000 1.400537e+18 1.0 1975.208333 38.791667 1.400000 4.416667 1.208333 100.083333
11 17174.0 134.500000 1.397340e+18 1.0 1978.833333 35.166667 2.000000 3.416667 0.750000 63.266667
12 17272.0 138.500000 1.397333e+18 1.0 1977.458333 36.541667 2.000000 3.916667 0.791667 49.000000
13 18001.0 246.500000 1.404130e+18 1.0 1975.416667 38.583333 1.000000 3.916667 0.791667 169.000000
14 19001.0 202.500000 1.400332e+18 1.0 1973.916667 40.083333 1.200000 4.500000 1.000000 27.800000
15 19585.0 206.500000 1.397376e+18 1.0 1976.041667 37.958333 1.000000 4.125000 1.041667 NaN
16 20228.0 22.500000 1.397131e+18 1.0 1977.583333 36.416667 1.250000 5.166667 1.500000 82.500000
17 23001.0 26.500000 1.400414e+18 1.0 1976.375000 37.625000 1.500000 4.583333 1.166667 141.500000
18 23189.0 30.500000 1.396836e+18 1.0 1976.666667 37.333333 1.500000 5.708333 1.416667 62.333333
19 23466.0 34.500000 1.397308e+18 1.0 1977.333333 36.666667 2.000000 4.333333 1.333333 62.666667
20 25269.0 163.000000 1.398384e+18 1.0 1977.222222 36.777778 1.400000 4.388889 0.944444 45.363636
21 25307.0 166.500000 1.400864e+18 1.0 1977.291667 36.708333 1.500000 4.125000 1.083333 16.846154
22 25489.0 170.500000 1.397513e+18 1.0 1974.500000 39.500000 NaN 4.125000 0.875000 82.714286
23 25878.0 174.500000 1.398334e+18 1.0 1975.291667 38.708333 1.500000 4.250000 1.000000 63.200000
24 41001.0 142.500000 1.400465e+18 1.0 1975.291667 38.708333 1.142857 4.375000 1.166667 65.333333
25 47001.0 38.500000 1.400760e+18 1.0 1978.250000 35.750000 1.000000 4.708333 1.000000 93.400000
26 50001.0 178.500000 1.401494e+18 1.0 1978.375000 35.625000 1.800000 3.583333 0.750000 86.214286
27 50577.0 182.500000 1.398377e+18 1.0 1977.625000 36.375000 1.500000 4.208333 1.000000 50.333333
28 52227.0 210.500000 1.396922e+18 1.0 1977.541667 36.458333 2.333333 5.125000 1.478261 119.000000
29 52356.0 214.500000 1.397023e+18 1.0 1975.791667 38.208333 2.285714 4.416667 0.958333 46.428571
... ... ... ... ... ... ... ... ... ... ...
33 68001.0 190.500000 1.400522e+18 1.0 1977.041667 36.958333 1.500000 4.166667 1.208333 68.600000
34 68081.0 194.500000 1.400256e+18 1.0 1977.541667 36.458333 1.000000 3.791667 0.625000 139.000000
35 68689.0 198.500000 1.396886e+18 1.0 1978.083333 35.916667 1.000000 3.791667 0.875000 220.000000
36 70001.0 42.500000 1.400508e+18 1.0 1974.916667 39.083333 1.500000 4.833333 1.125000 93.000000
37 70670.0 46.500000 1.397304e+18 1.0 1975.416667 38.583333 NaN 5.916667 1.333333 117.000000
38 73001.0 150.478261 1.400803e+18 1.0 1973.304348 40.695652 1.333333 4.130435 0.826087 50.928571
39 76109.0 238.500000 1.404245e+18 1.0 1974.043478 39.956522 2.000000 4.625000 1.333333 103.142857
40 76275.0 218.500000 1.396836e+18 1.0 1974.708333 39.291667 1.000000 4.291667 0.916667 81.000000
41 76306.0 242.500000 1.397729e+18 1.0 1977.208333 36.791667 1.250000 4.541667 1.041667 33.000000
42 86568.0 250.500000 1.396037e+18 1.0 1977.875000 36.125000 1.000000 4.666667 1.666667 22.833333
43 110011.0 90.500000 1.398665e+18 1.0 1975.541667 38.458333 1.285714 2.478261 0.260870 121.375000
44 110014.0 61.250000 1.397563e+18 1.0 1974.625000 39.375000 1.857143 5.041667 1.583333 49.166667
45 110015.0 56.500000 1.397556e+18 1.0 1976.541667 37.458333 2.090909 4.875000 0.958333 44.000000
46 110019.0 61.739130 1.397223e+18 1.0 1978.086957 35.913043 2.600000 4.173913 1.173913 83.000000
47 500125.0 95.000000 1.403640e+18 1.0 1976.291667 37.708333 3.833333 4.333333 0.958333 60.142857
48 500127.0 102.500000 1.397340e+18 1.0 1976.708333 37.291667 1.375000 3.916667 0.625000 23.888889
49 500128.0 102.500000 1.397840e+18 1.0 1974.375000 39.625000 1.428571 3.625000 0.500000 43.888889
50 500129.0 102.000000 1.396728e+18 1.0 1975.833333 38.166667 1.000000 5.291667 1.000000 48.294118
51 800120.0 2.500000 1.403712e+18 1.0 1975.291667 38.708333 2.200000 4.291667 0.833333 75.333333
52 800122.0 6.500000 1.397444e+18 1.0 1974.583333 39.416667 1.333333 5.375000 0.791667 75.909091
53 1100110.0 80.500000 1.397984e+18 1.0 1977.833333 36.166667 1.600000 4.291667 0.791667 72.263158
54 1100112.0 86.428571 1.407444e+18 1.0 1974.142857 39.857143 1.800000 4.190476 0.476190 95.333333
55 1100113.0 80.272727 1.399704e+18 1.0 1974.000000 40.000000 2.833333 3.863636 0.636364 91.444444
56 1100115.0 74.500000 1.397110e+18 1.0 1976.625000 37.375000 1.888889 4.333333 0.958333 37.909091
57 1100116.0 50.500000 1.407168e+18 1.0 1977.541667 36.458333 1.625000 3.750000 1.000000 67.083333
58 1100118.0 68.500000 1.398074e+18 1.0 1972.500000 41.500000 1.900000 3.750000 0.708333 41.833333
59 1100119.0 64.826087 1.397802e+18 1.0 1972.782609 41.217391 2.500000 4.130435 0.608696 78.454545
60 7600130.0 226.521739 1.403508e+18 1.0 1976.695652 37.304348 2.000000 4.434783 1.086957 186.000000
61 7600131.0 230.478261 1.396581e+18 1.0 1976.782609 37.217391 1.500000 4.227273 1.090909 38.000000
62 7600132.0 234.500000 1.398251e+18 1.0 1975.416667 38.583333 1.600000 4.375000 0.666667 67.142857

63 rows × 10 columns


In [27]:
x_variable = df["colpaz1a"] #What is the best method to continue in the conflict
other_variables =[df["colpaz2a"]] #What are the chances that peace happens within one year

In [28]:
#Create a contingency table (pd.crosstab) between the x_variable and the other_variables
col_crosstab = pd.crosstab(x_variable,other_variables)
col_crosstab


Out[28]:
colpaz2a Muy posible Posible Poco posible Nada posible
colpaz1a
Negociación 58 239 358 165
Uso de la fuerza militar 6 32 144 327
[No leer] Ambas 2 13 34 40

In [29]:
#Calculate the chi-square test
chi,p,dof,expected = chi,p,dof,expected = scipy.stats.chi2_contingency(col_crosstab)
print(p)
display(expected)


4.36219476812e-60
array([[  38.16643159,  164.23131171,  309.95768688,  307.64456982],
       [  23.69111425,  101.94358251,  192.40056417,  190.96473907],
       [   4.14245416,   17.82510578,   33.64174894,   33.39069111]])

In [30]:
col_crosstab/expected


Out[30]:
colpaz2a Muy posible Posible Poco posible Nada posible
colpaz1a
Negociación 1.519660 1.455265 1.154996 0.536333
Uso de la fuerza militar 0.253260 0.313899 0.748439 1.712358
[No leer] Ambas 0.482806 0.729308 1.010649 1.197939

Is it significant?


In [31]:
##Now let's add gender
x_variable = df["colpaz1a"] #What is the best method to continue
other_variables =[df["q1"],df["colpaz2a"]] #Gender and what are the chances that peace happens within one year

In [35]:
#Create a contingency table (pd.crosstab) between the x_variable and the other_variables
col_crosstab = pd.crosstab(x_variable,other_variables)
col_crosstab


Out[35]:
q1 Hombre Mujer
colpaz2a Muy posible Posible Poco posible Nada posible Muy posible Posible Poco posible Nada posible
colpaz1a
Negociación 32 121 169 71 26 118 189 94
Uso de la fuerza militar 3 22 74 186 3 10 70 141
[No leer] Ambas 0 4 17 22 2 9 17 18

In [36]:
#Calculate the chi-square test
chi,p,dof,expected = chi,p,dof,expected = scipy.stats.chi2_contingency(col_crosstab)
print(p)
display(expected)


1.31700597089e-56
array([[  20.23977433,   85.00705219,  150.35260931,  161.33991537,
          17.92665726,   79.22425952,  159.60507757,  146.30465444],
       [  12.56346968,   52.76657264,   93.32863188,  100.14880113,
          11.12764457,   49.17700987,   99.0719323 ,   90.81593794],
       [   2.19675599,    9.22637518,   16.31875882,   17.5112835 ,
           1.94569817,    8.59873061,   17.32299013,   15.87940762]])

In [37]:
col_crosstab/expected


Out[37]:
q1 Hombre Mujer
colpaz2a Muy posible Posible Poco posible Nada posible Muy posible Posible Poco posible Nada posible
colpaz1a
Negociación 1.581045 1.423411 1.124024 0.440065 1.450354 1.489443 1.184173 0.642495
Uso de la fuerza militar 0.238788 0.416931 0.792897 1.857236 0.269599 0.203347 0.706557 1.552591
[No leer] Ambas 0.000000 0.433540 1.041746 1.256333 1.027909 1.046666 0.981355 1.133544

Is it significant?

Let's use R, Fisher test for big tables do not exist in Python


In [ ]:
#Install R (already installed)
!conda install -c r r-essentials

#Install link between R and python (already installed)
!pip install rpy2

In [38]:
%load_ext rpy2.ipython


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

In [39]:
%%R -i col_crosstab
fisher.test(col_crosstab,simulate.p.value=TRUE,B=1e6)


	Fisher's Exact Test for Count Data with simulated p-value (based on
	1e+06 replicates)

data:  col_crosstab
p-value = 1e-06
alternative hypothesis: two.sided