In [ ]:

    
# Homework 1
## Due date: Friday 13th
- Write your own code in the blanks. It is okay to collaborate with other students, but both students must write their own code and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
- Complete the blanks, adding comments to explain what you are doing
- Assignment 4 will be weighted more

Collaborated with:

First run this cell:



In [46]:

    
plt.scatter?



In [52]:

    
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell but run it

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image,display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

Assignment 1 (ungraded but important). Read some tutorials

Basic programming: This is a good video: https://www.youtube.com/watch?v=N4mEzFDjqtA (miuntes 2:30 -- 30:00)
- We covered 2:30 -- 15:45 in the first class, we'll cover most of the rest in the second class
- He doesn't use jupyter notebook but another editor. He writes in the left and the output appears in the right.

Next, run the next cell to download some tutorials on pandas:

After downloading them, go to the jupyter dashboard and you'll see a folder pandas-cookbook, inside you can find a folder cookbook, where the tutorials are placed. Please play with the these ones:

- A quick tour of the IPython Notebook: Shows off IPython’s awesome tab completion and magic functions.
- Chapter 1: Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!
- Chapter 2: It’s not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns)
- Chapter 3: Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.



In [2]:

    
!git clone https://github.com/jvns/pandas-cookbook.git









    



Cloning into 'pandas-cookbook'...
remote: Counting objects: 410, done.
remote: Total 410 (delta 0), reused 0 (delta 0), pack-reused 410
Receiving objects: 100% (410/410), 10.53 MiB | 3.15 MiB/s, done.
Resolving deltas: 100% (207/207), done.

Assignment 2

Create a list named spam with numbers from 1 to 100 numbers
Calculate the mean of the list
Calculate the mean of the first half of the list
Check if the first element is larger than the last element



In [ ]:

    
## Assingment 2 (20)
#Create a list of numbers between 1 and 100 (tip: use range())
spam = list(range(1,100))
print(spam)

#Calculate the mean of the list (tip: use sum() and len()) 
mean = 
print(mean)

#Calculate the mean of the first half of the list (tip: slice the list) 
mean_first_half = spam[:50]
print(mean_first_half)

#Check if the first element is larger or equal than the last element  (>=)
first_element = spam[0]
last_elment = spam[-1]
print() #this line needs to be completed, printing the result comparison

#Convert the list spam to a tuple
tuple_spam =
print()

Assignment 3

import numpy as np
Create a numpy array of numbers between 1 and 10000 (tip: use np.arange())
Calculate the mean of the list (tip: np.mean())
Filter the even numbers (tip: even number % 2 == 0) and save the resuls



In [ ]:

    
#import numpy as np
import numpy as np

#Create a numpy array of numbers between 1 and 10000 (tip: use np.arange())
array =
print(array)


#Calculate the mean of the list (tip: np.mean()) 
mean =
print(mean)


#Filter the even numbers (tip: the remainder of dividing an even number by 2 is 0) and save the results to array_even
filter_condition = 
array_even = array[filter_condition]
print(array_even)

Assignment 4

Reading data
- a standard csv
- a csv using "\t"
- a csv using a different encoding (try several until one works)
- a csv having no header
- a csv usign "m" as missing values (instead of "nan" or "" or "-9")
Read a toy dataset on alcohol consumption with four columns"
- adults: number of adults in household
- kids: number of kids in household
- income: average income
- consume: consume alcohol
Print the top of the file (.head()), get descriptive statistics (.describe()) and make a contingency table visualizing the realtionship between number of adults (pd.crosstab())
Visualize the relationship between number of adults, alcohol consumption and income using the right type of plot (among the ones we learned). Think about what should the y variable, the x variable and the hue.
What plot would you use to model the number of adults vs income?



In [ ]:

    
#import pandas
import pandas as pd



In [ ]:

    
#Read a standard csv, no strange things
filename = "data/hw1_csv_st.csv"
df_st = 
df_st.head()



In [ ]:

    
#Read a standard csv, careful, the file uses tabs ("\t") as separators
filename = "data/hw1_csv_tab.csv"
df_tab = 
df_tab.head()



In [11]:

    
#Read a standard csv, careful, the file uses a different encoding (no UTF-8)
filename = "data/hw1_csv_enc.csv"
df_enc = pd.read_csv(filename,encoding="iso-8859-1")
df_enc.head()



In [ ]:

    
#Read a standard csv, careful, the file does not have a header
filename = "data/hw1_csv_header.csv"
df_hea = 
df_hea.head()



In [ ]:

    
#Read a standard csv, careful, the file uses "m" to indicate a missing value
filename = "data/hw1_csv_weird_na.csv"
df_na = 
df_na.head()



In [ ]:

    
#read stata file (the file is in "data/alcohol.dta" and it is a stata file)
df = 

#print top of the file to explore it (.head())

What type of variables are in the dataset?

For help check: https://www.boundless.com/statistics/textbooks/boundless-statistics-textbook/visualizing-data-3/the-histogram-18/types-of-variables-87-4406/

answer here

adults:
kids:
income:
consume:



In [ ]:

    
#print descriptive statistics and interpret them (.describe())
df.



In [ ]:

    
#keep only the households without kids and use this dataset for the rest of the assignment
filtering_condition =
df_nokids = df.loc[filtering_condition]



In [49]:

    
df = pd.read_stata("data/alcohol.dta")



In [31]:

    
#visualize the relationship between number of adults, alcohol consumption and income using the right type of plot



In [ ]:

    
#save the plot as a pdf with the name "hw1_plot.pdf"

Interpret the plot you just made

answer here

What plot(s) would you use to model the number of adults vs income? Why?

answer here



In [ ]:

    
#Visualize the relationship between number of adults and number of kids using a contingency table using pd.crosstab(df[x],df[y])

Interpret the table you just made

answer here

Data visualization

We'll talk about data visualization theory in another class but as a general rule remember that SIMPLE IS BETTER (play the video in the cell below)



In [4]:

    
%%html
<!-- TODO -->
<iframe width="560" height="315" src="https://zippy.gfycat.com/ImprobableFemaleBasenji.webm" frameborder="0" allowfullscreen></iframe>



In [7]:

    
#The only honest pie chart
Image(url="http://www.datavis.ca/gallery/images/pies/PiesIHaveEaten.png")









    Out[7]:

Assignment 5: Explain a figure

All the plots below have the same correlation (R^2 = 0.67)

Edit the cell below to answer the questions

What can we say from these plots?

answer here

Now change the "alpha" parameter in the code below from 0 to 1 and run the cell.

answer here

What can we say now about the plots?

answer here



In [8]:

    
import seaborn as sns
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", size=4,
           scatter_kws={"s": 50, "alpha": 0})









    Out[8]:





<seaborn.axisgrid.FacetGrid at 0x7f2f3dade048>

Assignment 6: Now it's your turn

Download a csv from here: https://vincentarelbundock.github.io/Rdatasets/datasets.html
Upload it to the data folder
Read the folder
Check its head and describe it
Write a couple of lines about what types of variables we have in the csv (e.g. quantitative, qualitative...)
Visualize the relationship between 2 or 3 variables with a scatter plot, a line plot or a boxplot
Explain what the plot shows



In [ ]:

	Unnamed: 0	index	person	year	treatment	score
0	0	0	1	2000	1	4
1	1	1	2	2000	1	3
2	2	2	3	2000	2	6
3	3	3	4	2000	2	4
4	4	4	1	2005	1	8