In [ ]:
# Homework 3
## Due date: Wednesday 24th 23:59
- Write your own code in the blanks. It is okay to collaborate with other students, but **both students must write their own code** (No C&P!) and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
- Complete the blanks, adding comments to explain what you are doing
- Each plot must have labels
Collaborated with:
In [2]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell
#Print the plots in this screen
%matplotlib inline
#Be able to plot images saved in the hard drive
from IPython.display import Image
#Make the notebook wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np
answer here
answer here
answer here
answer here
answer here
answer here
answer here
In [4]:
print("Figure 1")
display(Image(url="http://www.datavis.ca/gallery/images/galvanic-3D.png",width=600))
print("Figure 2")
display(Image(url="http://www.econoclass.com/images/statdrivers.gif"))
The column with the income is "q10new". The group with the lowest income is "Menos de 160.000"
Answer the questions:
In [5]:
#Read data
pd.read_stata("data/colombia.dta")
In [ ]:
#Discard the rows with "[No leer] Ambas"
In [ ]:
#How many people in the income group "Menos de 160.000" want to negociate? -> Our successes
#How many people in the income group "Menos de 160.000" want to fight?
#How many people are in the income group "Menos de 160.000"? -> Our number of trials
#What is the probability that a random person in the group "Menos de 160.000" wants to negociate? -> This is our "p"
#How many people in total want to negociate?
#How many people in total want to fight?
#What is the probability that a random person wants to negociate? -> This is the "p" of the entire population.
#Calculate the RR
#What's the null hypothesis?
#What's the alternative hypothesis?
#What's the p-value associated? -> Do the stats
#What are the confidence intervals of our p?
#What can we say?
The column with the income is "q10new".
Answer the questions:
In [ ]:
#Read data
In [16]:
df = pd.read_stata("data/colombia.dta")
df = df.loc[df["colpaz1a"] != "[No leer] Ambas"]
In [44]:
df = df.loc[df["q10new"] != "Ningún ingreso"]
In [ ]:
#Discard the rows with "[No leer] Ambas"
In [23]:
df = df.dropna(subset=["colpaz1a","q10new"])
In [ ]:
df["colpaz1a"].cat.remove_unused_categories()
In [52]:
#Can we see a trend in the crosstab?
a = pd.crosstab(df["colpaz1a"].astype(str),df["q10new"].astype(str))
#What does the Chi-square test say?
import scipy.stats
a2,p,b,exp = scipy.stats.chi2_contingency(a)
#What are the ratios of observed/expected?
print(a2,p,b)
In [53]:
a/exp
Out[53]:
In [ ]:
pd.read_csv("../class3/data/world_bank/data.csv")
Note: If your data is already tidy before step 3, perform steps 3 and 4 in this dataset: "data/cities.csv", a dataset with the distances between 11 cities.
In [ ]:
#1. Download your dataset for the quantitative design
#2. Explain what type of variables you have
#3. Fix the format to convert it into tidy data (Melt/Pivot).
#4. Save your dataset into a file (so you don't have to do the other things every time)
#5. Describe the data and visualize the relationship between all (or 10 if there are too many) variables with a scatter plot matrix or a correlation matrix
#6. Do some other cool plot.