In [ ]:
# Homework 3
## Due date: Wednesday 24th 23:59
- Write your own code in the blanks. It is okay to collaborate with other students, but **both students must write their own code** (No C&P!) and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
- Complete the blanks, adding comments to explain what you are doing
- Each plot must have labels

Collaborated with:

In [2]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np


Assignment 1: Data visualization

  • Explain what do you think that is wrong with the following figures and what kind of plot would you use. There are many correct ways to answer.

Figure 1

  • What do you think this plot shows?

answer here

  • What is wrong with the style?

answer here

  • What is wrong (or can be improved) with the type of plot?

answer here

  • What type of plot would you use and why?

answer here

Figure 2

  • What do you think this plot shows?

answer here

  • What is wrong with the message it gives?

answer here

answer here


In [4]:
print("Figure 1")
display(Image(url="http://www.datavis.ca/gallery/images/galvanic-3D.png",width=600))
print("Figure 2")
display(Image(url="http://www.econoclass.com/images/statdrivers.gif"))


Figure 1
Figure 2

Assignment 2: Binomial test

  • Test if the poorest people are more likely to negotiate (df["colpaz1a"] == "Negociación"), when compared with the entire population.
  • There are three options for the variable "colpaz1a": "Negociación", "Uso de la fuerza militar" and "[No leer] Ambas". Please discard the rows with "[No leer] Ambas" before the test.
  • The column with the income is "q10new". The group with the lowest income is "Menos de 160.000"

  • Answer the questions:

    • How many people in the income group "Menos de 160.000" want to negociate?
    • How many people in the income group "Menos de 160.000" want to fight?
    • What is the probability that a random person in the group "Menos de 160.000" wants to negociate? -> This is our "p"
    • How many people in total want to negociate?
    • How many people in total want to fight?
    • What is the probability that a random person wants to negociate? -> This is the "p" of the entire population.
    • Calculate the RR
    • What's the null hypothesis?
    • What's the alternative hypothesis?
    • What's the p-value associated?
    • What are the confidence intervals of our p?
    • What can we say?

In [5]:
#Read data
pd.read_stata("data/colombia.dta")

In [ ]:
#Discard the rows with "[No leer] Ambas"

In [ ]:
#How many people in the income group "Menos de 160.000" want to negociate? -> Our successes

#How many people in the income group "Menos de 160.000" want to fight?

#How many people are in the income group "Menos de 160.000"? -> Our number of trials

#What is the probability that a random person in the group "Menos de 160.000" wants to negociate? -> This is our "p"

#How many people in total want to negociate?

#How many people in total want to fight?     

#What is the probability that a random person wants to negociate? -> This is the "p" of the entire population.

#Calculate the RR

#What's the null hypothesis?

#What's the alternative hypothesis?

#What's the p-value associated? -> Do the stats

#What are the confidence intervals of our p?

#What can we say?

Assignment 3: Chi-square test

  • Test if there are interactions between income and how likely you are to negociate.
  • There are three options for the variable "colpaz1a", "Negociación", "Uso de la fuerza militar" and "[No leer] Ambas". Please discard the rows with "[No leer] Ambas" before the test.
  • The column with the income is "q10new".

  • Answer the questions:

    • Can we see a trend in the crosstab?
    • What does the Chi-square test say?
    • What are the ratios of observed/expected?

In [ ]:
#Read data

In [16]:
df = pd.read_stata("data/colombia.dta")
df = df.loc[df["colpaz1a"] != "[No leer] Ambas"]

In [44]:
df = df.loc[df["q10new"] != "Ningún ingreso"]

In [ ]:
#Discard the rows with "[No leer] Ambas"

In [23]:
df = df.dropna(subset=["colpaz1a","q10new"])

In [ ]:
df["colpaz1a"].cat.remove_unused_categories()

In [52]:
#Can we see a trend in the crosstab?
a = pd.crosstab(df["colpaz1a"].astype(str),df["q10new"].astype(str))
#What does the Chi-square test say?
import scipy.stats
a2,p,b,exp = scipy.stats.chi2_contingency(a)
#What are the ratios of observed/expected?
print(a2,p,b)


19.355273069 0.198083659002 15

In [53]:
a/exp


Out[53]:
q10new Entre 1.100.001 – 1.400.000 Entre 1.400.001 – 1.900.000 Entre 1.900.001 – 3.200.000 Entre 160.000 – 250.000 Entre 250.001 – 340.000 Entre 340.001 – 420.000 Entre 420.001 – 480.000 Entre 480.001 – 540.000 Entre 540.001 – 590.000 Entre 590.001 – 650.000 Entre 650.001 – 720.000 Entre 720.001 – 810.000 Entre 810.001 – 960.000 Entre 960.001 – 1.100.000 Menos de 160.000 Más de 3.200.000
colpaz1a
Negociación 1.019584 0.875232 0.939273 1.203444 1.136586 1.053762 1.085459 1.088830 0.980584 1.023218 0.937336 0.978410 0.992313 0.851416 1.146137 0.899544
Uso de la fuerza militar 0.967607 1.206367 1.100443 0.663502 0.774086 0.911078 0.858650 0.853074 1.032114 0.961597 1.103647 1.035711 1.012714 1.245759 0.758288 1.166155

In [ ]:
pd.read_csv("../class3/data/world_bank/data.csv")

Assignment 4: Read, melt, pivot, groupby

  1. Download your dataset for the quantitative design
  2. Explain what type of variables you have
  3. Fix the format to convert it into tidy data (Melt/Pivot).
  4. Save your dataset into a file (so you don't have to do the other things every time)
  5. Describe the data and visualize the relationship between all (or 10 if there are too many) variables with a scatter plot matrix or a correlation matrix
  6. Do some other cool plot.

Note: If your data is already tidy before step 3, perform steps 3 and 4 in this dataset: "data/cities.csv", a dataset with the distances between 11 cities.


In [ ]:
#1. Download your dataset for the quantitative design

#2. Explain what type of variables you have

#3. Fix the format to convert it into tidy data (Melt/Pivot). 

#4. Save your dataset into a file (so you don't have to do the other things every time)

#5. Describe the data and visualize the relationship between all (or 10 if there are too many) variables with a scatter plot matrix or a correlation matrix

#6. Do some other cool plot.