In [ ]:

    
# Homework 3
## Due date: Wednesday 24th 23:59
- Write your own code in the blanks. It is okay to collaborate with other students, but **both students must write their own code** (No C&P!) and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
- Complete the blanks, adding comments to explain what you are doing
- Each plot must have labels

Collaborated with:



In [2]:

    
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np

Assignment 1: Data visualization

Explain what do you think that is wrong with the following figures and what kind of plot would you use. There are many correct ways to answer.

Figure 1

What do you think this plot shows?

answer here

What is wrong with the style?

answer here

What is wrong (or can be improved) with the type of plot?

answer here

What type of plot would you use and why?

answer here

Figure 2

What do you think this plot shows?

answer here

What is wrong with the message it gives?

answer here

What is wrong (or can be improved) with the type of plot? (https://en.wikipedia.org/wiki/Bar_chart)

answer here



In [4]:

    
print("Figure 1")
display(Image(url="http://www.datavis.ca/gallery/images/galvanic-3D.png",width=600))
print("Figure 2")
display(Image(url="http://www.econoclass.com/images/statdrivers.gif"))









    



Figure 1






    











    



Figure 2

Assignment 2: Binomial test

Test if the poorest people are more likely to negotiate (df["colpaz1a"] == "Negociación"), when compared with the entire population.
There are three options for the variable "colpaz1a": "Negociación", "Uso de la fuerza militar" and "[No leer] Ambas". Please discard the rows with "[No leer] Ambas" before the test.
The column with the income is "q10new". The group with the lowest income is "Menos de 160.000"
Answer the questions:
- How many people in the income group "Menos de 160.000" want to negociate?
- How many people in the income group "Menos de 160.000" want to fight?
- What is the probability that a random person in the group "Menos de 160.000" wants to negociate? -> This is our "p"
- How many people in total want to negociate?
- How many people in total want to fight?
- What is the probability that a random person wants to negociate? -> This is the "p" of the entire population.
- Calculate the RR
- What's the null hypothesis?
- What's the alternative hypothesis?
- What's the p-value associated?
- What are the confidence intervals of our p?
- What can we say?



In [5]:

    
#Read data
pd.read_stata("data/colombia.dta")



In [ ]:

    
#Discard the rows with "[No leer] Ambas"



In [ ]:

    
#How many people in the income group "Menos de 160.000" want to negociate? -> Our successes

#How many people in the income group "Menos de 160.000" want to fight?

#How many people are in the income group "Menos de 160.000"? -> Our number of trials

#What is the probability that a random person in the group "Menos de 160.000" wants to negociate? -> This is our "p"

#How many people in total want to negociate?

#How many people in total want to fight?     

#What is the probability that a random person wants to negociate? -> This is the "p" of the entire population.

#Calculate the RR

#What's the null hypothesis?

#What's the alternative hypothesis?

#What's the p-value associated? -> Do the stats

#What are the confidence intervals of our p?

#What can we say?

Assignment 3: Chi-square test

Test if there are interactions between income and how likely you are to negociate.
There are three options for the variable "colpaz1a", "Negociación", "Uso de la fuerza militar" and "[No leer] Ambas". Please discard the rows with "[No leer] Ambas" before the test.
The column with the income is "q10new".
Answer the questions:
- Can we see a trend in the crosstab?
- What does the Chi-square test say?
- What are the ratios of observed/expected?



In [ ]:

    
#Read data



In [16]:

    
df = pd.read_stata("data/colombia.dta")
df = df.loc[df["colpaz1a"] != "[No leer] Ambas"]



In [44]:

    
df = df.loc[df["q10new"] != "Ningún ingreso"]



In [ ]:

    
#Discard the rows with "[No leer] Ambas"



In [23]:

    
df = df.dropna(subset=["colpaz1a","q10new"])



In [ ]:

    
df["colpaz1a"].cat.remove_unused_categories()



In [52]:

    
#Can we see a trend in the crosstab?
a = pd.crosstab(df["colpaz1a"].astype(str),df["q10new"].astype(str))
#What does the Chi-square test say?
import scipy.stats
a2,p,b,exp = scipy.stats.chi2_contingency(a)
#What are the ratios of observed/expected?
print(a2,p,b)









    



19.355273069 0.198083659002 15



In [53]:

    
a/exp









    Out[53]:






  
    
      q10new
      Entre 1.100.001  1.400.000
      Entre 1.400.001  1.900.000
      Entre 1.900.001  3.200.000
      Entre 160.000  250.000
      Entre 250.001  340.000
      Entre 340.001  420.000
      Entre 420.001  480.000
      Entre 480.001  540.000
      Entre 540.001  590.000
      Entre 590.001  650.000
      Entre 650.001  720.000
      Entre 720.001  810.000
      Entre 810.001  960.000
      Entre 960.001  1.100.000
      Menos de 160.000
      Más de 3.200.000
    
    
      colpaz1a
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Negociación
      1.019584
      0.875232
      0.939273
      1.203444
      1.136586
      1.053762
      1.085459
      1.088830
      0.980584
      1.023218
      0.937336
      0.978410
      0.992313
      0.851416
      1.146137
      0.899544
    
    
      Uso de la fuerza militar
      0.967607
      1.206367
      1.100443
      0.663502
      0.774086
      0.911078
      0.858650
      0.853074
      1.032114
      0.961597
      1.103647
      1.035711
      1.012714
      1.245759
      0.758288
      1.166155



In [ ]:

    
pd.read_csv("../class3/data/world_bank/data.csv")

Assignment 4: Read, melt, pivot, groupby

Download your dataset for the quantitative design
Explain what type of variables you have
Fix the format to convert it into tidy data (Melt/Pivot).
Save your dataset into a file (so you don't have to do the other things every time)
Describe the data and visualize the relationship between all (or 10 if there are too many) variables with a scatter plot matrix or a correlation matrix
Do some other cool plot.

Note: If your data is already tidy before step 3, perform steps 3 and 4 in this dataset: "data/cities.csv", a dataset with the distances between 11 cities.



In [ ]:

    
#1. Download your dataset for the quantitative design

#2. Explain what type of variables you have

#3. Fix the format to convert it into tidy data (Melt/Pivot). 

#4. Save your dataset into a file (so you don't have to do the other things every time)

#5. Describe the data and visualize the relationship between all (or 10 if there are too many) variables with a scatter plot matrix or a correlation matrix

#6. Do some other cool plot.

q10new	Entre 1.100.001 1.400.000	Entre 1.400.001 1.900.000	Entre 1.900.001 3.200.000	Entre 160.000 250.000	Entre 250.001 340.000	Entre 340.001 420.000	Entre 420.001 480.000	Entre 480.001 540.000	Entre 540.001 590.000	Entre 590.001 650.000	Entre 650.001 720.000	Entre 720.001 810.000	Entre 810.001 960.000	Entre 960.001 1.100.000	Menos de 160.000	Más de 3.200.000
colpaz1a
Negociación	1.019584	0.875232	0.939273	1.203444	1.136586	1.053762	1.085459	1.088830	0.980584	1.023218	0.937336	0.978410	0.992313	0.851416	1.146137	0.899544
Uso de la fuerza militar	0.967607	1.206367	1.100443	0.663502	0.774086	0.911078	0.858650	0.853074	1.032114	0.961597	1.103647	1.035711	1.012714	1.245759	0.758288	1.166155