In [ ]:
# Homework 1
## Due date: Friday 13th
- Write your own code in the blanks. It is okay to collaborate with other students, but both students must write their own code and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
- Complete the blanks, adding comments to explain what you are doing
- Assignment 4 will be weighted more

Collaborated with:

First run this cell:


In [46]:
plt.scatter?

In [52]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell but run it

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image,display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))


Assignment 1 (ungraded but important). Read some tutorials

  • Basic programming: This is a good video: https://www.youtube.com/watch?v=N4mEzFDjqtA (miuntes 2:30 -- 30:00)
    • We covered 2:30 -- 15:45 in the first class, we'll cover most of the rest in the second class
    • He doesn't use jupyter notebook but another editor. He writes in the left and the output appears in the right.

Next, run the next cell to download some tutorials on pandas:

After downloading them, go to the jupyter dashboard and you'll see a folder pandas-cookbook, inside you can find a folder cookbook, where the tutorials are placed. Please play with the these ones:

- A quick tour of the IPython Notebook: Shows off IPython’s awesome tab completion and magic functions.
- Chapter 1: Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!
- Chapter 2: It’s not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns)
- Chapter 3: Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.

In [2]:
!git clone https://github.com/jvns/pandas-cookbook.git


Cloning into 'pandas-cookbook'...
remote: Counting objects: 410, done.
remote: Total 410 (delta 0), reused 0 (delta 0), pack-reused 410
Receiving objects: 100% (410/410), 10.53 MiB | 3.15 MiB/s, done.
Resolving deltas: 100% (207/207), done.

Assignment 2

  • Create a list named spam with numbers from 1 to 100 numbers

  • Calculate the mean of the list

  • Calculate the mean of the first half of the list

  • Check if the first element is larger than the last element


In [ ]:
## Assingment 2 (20)
#Create a list of numbers between 1 and 100 (tip: use range())
spam = list(range(1,100))
print(spam)

#Calculate the mean of the list (tip: use sum() and len()) 
mean = 
print(mean)

#Calculate the mean of the first half of the list (tip: slice the list) 
mean_first_half = spam[:50]
print(mean_first_half)

#Check if the first element is larger or equal than the last element  (>=)
first_element = spam[0]
last_elment = spam[-1]
print() #this line needs to be completed, printing the result comparison

#Convert the list spam to a tuple
tuple_spam =
print()

Assignment 3

  • import numpy as np
  • Create a numpy array of numbers between 1 and 10000 (tip: use np.arange())

  • Calculate the mean of the list (tip: np.mean())

  • Filter the even numbers (tip: even number % 2 == 0) and save the resuls


In [ ]:
#import numpy as np
import numpy as np

#Create a numpy array of numbers between 1 and 10000 (tip: use np.arange())
array =
print(array)


#Calculate the mean of the list (tip: np.mean()) 
mean =
print(mean)


#Filter the even numbers (tip: the remainder of dividing an even number by 2 is 0) and save the results to array_even
filter_condition = 
array_even = array[filter_condition]
print(array_even)

Assignment 4

  • Reading data

    • a standard csv
    • a csv using "\t"
    • a csv using a different encoding (try several until one works)
    • a csv having no header
    • a csv usign "m" as missing values (instead of "nan" or "" or "-9")
  • Read a toy dataset on alcohol consumption with four columns"

    • adults: number of adults in household
    • kids: number of kids in household
    • income: average income
    • consume: consume alcohol
  • Print the top of the file (.head()), get descriptive statistics (.describe()) and make a contingency table visualizing the realtionship between number of adults (pd.crosstab())

  • Visualize the relationship between number of adults, alcohol consumption and income using the right type of plot (among the ones we learned). Think about what should the y variable, the x variable and the hue.

  • What plot would you use to model the number of adults vs income?


In [ ]:
#import pandas
import pandas as pd

In [ ]:
#Read a standard csv, no strange things
filename = "data/hw1_csv_st.csv"
df_st = 
df_st.head()

In [ ]:
#Read a standard csv, careful, the file uses tabs ("\t") as separators
filename = "data/hw1_csv_tab.csv"
df_tab = 
df_tab.head()

In [11]:
#Read a standard csv, careful, the file uses a different encoding (no UTF-8)
filename = "data/hw1_csv_enc.csv"
df_enc = pd.read_csv(filename,encoding="iso-8859-1")
df_enc.head()


Out[11]:
Unnamed: 0 index person year treatment score
0 0 0 1 2000 1 4
1 1 1 2 2000 1 3
2 2 2 3 2000 2 6
3 3 3 4 2000 2 4
4 4 4 1 2005 1 8

In [ ]:
#Read a standard csv, careful, the file does not have a header
filename = "data/hw1_csv_header.csv"
df_hea = 
df_hea.head()

In [ ]:
#Read a standard csv, careful, the file uses "m" to indicate a missing value
filename = "data/hw1_csv_weird_na.csv"
df_na = 
df_na.head()


In [ ]:
#read stata file (the file is in "data/alcohol.dta" and it is a stata file)
df = 

#print top of the file to explore it (.head())

What type of variables are in the dataset?

For help check: https://www.boundless.com/statistics/textbooks/boundless-statistics-textbook/visualizing-data-3/the-histogram-18/types-of-variables-87-4406/

answer here

  • adults:
  • kids:
  • income:
  • consume:

In [ ]:
#print descriptive statistics and interpret them (.describe())
df.

In [ ]:
#keep only the households without kids and use this dataset for the rest of the assignment
filtering_condition =
df_nokids = df.loc[filtering_condition]

In [49]:
df = pd.read_stata("data/alcohol.dta")

In [31]:
#visualize the relationship between number of adults, alcohol consumption and income using the right type of plot

In [ ]:
#save the plot as a pdf with the name "hw1_plot.pdf"

Interpret the plot you just made

answer here

What plot(s) would you use to model the number of adults vs income? Why?

answer here


In [ ]:
#Visualize the relationship between number of adults and number of kids using a contingency table using pd.crosstab(df[x],df[y])

Interpret the table you just made

answer here

Data visualization

  • We'll talk about data visualization theory in another class but as a general rule remember that SIMPLE IS BETTER (play the video in the cell below)

In [4]:
%%html
<!-- TODO -->
<iframe width="560" height="315" src="https://zippy.gfycat.com/ImprobableFemaleBasenji.webm" frameborder="0" allowfullscreen></iframe>



In [7]:
#The only honest pie chart
Image(url="http://www.datavis.ca/gallery/images/pies/PiesIHaveEaten.png")


Out[7]:

Assignment 5: Explain a figure

  • All the plots below have the same correlation (R^2 = 0.67)

Edit the cell below to answer the questions


  • What can we say from these plots?

answer here

  • Now change the "alpha" parameter in the code below from 0 to 1 and run the cell.

answer here

  • What can we say now about the plots?

answer here


In [8]:
import seaborn as sns
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", size=4,
           scatter_kws={"s": 50, "alpha": 0})


Out[8]:
<seaborn.axisgrid.FacetGrid at 0x7f2f3dade048>

Assignment 6: Now it's your turn

  • Download a csv from here: https://vincentarelbundock.github.io/Rdatasets/datasets.html
  • Upload it to the data folder
  • Read the folder
  • Check its head and describe it
  • Write a couple of lines about what types of variables we have in the csv (e.g. quantitative, qualitative...)
  • Visualize the relationship between 2 or 3 variables with a scatter plot, a line plot or a boxplot
  • Explain what the plot shows

In [ ]: