In [ ]:
# Homework 1
## Due date: Friday 13th
- Write your own code in the blanks. It is okay to collaborate with other students, but both students must write their own code and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
- Complete the blanks, adding comments to explain what you are doing
- Assignment 4 will be weighted more
Collaborated with:
First run this cell:
In [46]:
plt.scatter?
In [52]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell but run it
#Print the plots in this screen
%matplotlib inline
#Be able to plot images saved in the hard drive
from IPython.display import Image,display
#Make the notebook wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
Next, run the next cell to download some tutorials on pandas:
After downloading them, go to the jupyter dashboard and you'll see a folder pandas-cookbook, inside you can find a folder cookbook, where the tutorials are placed. Please play with the these ones:
- A quick tour of the IPython Notebook: Shows off IPython’s awesome tab completion and magic functions.
- Chapter 1: Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!
- Chapter 2: It’s not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns)
- Chapter 3: Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.
In [2]:
!git clone https://github.com/jvns/pandas-cookbook.git
In [ ]:
## Assingment 2 (20)
#Create a list of numbers between 1 and 100 (tip: use range())
spam = list(range(1,100))
print(spam)
#Calculate the mean of the list (tip: use sum() and len())
mean =
print(mean)
#Calculate the mean of the first half of the list (tip: slice the list)
mean_first_half = spam[:50]
print(mean_first_half)
#Check if the first element is larger or equal than the last element (>=)
first_element = spam[0]
last_elment = spam[-1]
print() #this line needs to be completed, printing the result comparison
#Convert the list spam to a tuple
tuple_spam =
print()
In [ ]:
#import numpy as np
import numpy as np
#Create a numpy array of numbers between 1 and 10000 (tip: use np.arange())
array =
print(array)
#Calculate the mean of the list (tip: np.mean())
mean =
print(mean)
#Filter the even numbers (tip: the remainder of dividing an even number by 2 is 0) and save the results to array_even
filter_condition =
array_even = array[filter_condition]
print(array_even)
Reading data
Read a toy dataset on alcohol consumption with four columns"
Print the top of the file (.head()), get descriptive statistics (.describe()) and make a contingency table visualizing the realtionship between number of adults (pd.crosstab())
Visualize the relationship between number of adults, alcohol consumption and income using the right type of plot (among the ones we learned). Think about what should the y variable, the x variable and the hue.
What plot would you use to model the number of adults vs income?
In [ ]:
#import pandas
import pandas as pd
In [ ]:
#Read a standard csv, no strange things
filename = "data/hw1_csv_st.csv"
df_st =
df_st.head()
In [ ]:
#Read a standard csv, careful, the file uses tabs ("\t") as separators
filename = "data/hw1_csv_tab.csv"
df_tab =
df_tab.head()
In [11]:
#Read a standard csv, careful, the file uses a different encoding (no UTF-8)
filename = "data/hw1_csv_enc.csv"
df_enc = pd.read_csv(filename,encoding="iso-8859-1")
df_enc.head()
Out[11]:
In [ ]:
#Read a standard csv, careful, the file does not have a header
filename = "data/hw1_csv_header.csv"
df_hea =
df_hea.head()
In [ ]:
#Read a standard csv, careful, the file uses "m" to indicate a missing value
filename = "data/hw1_csv_weird_na.csv"
df_na =
df_na.head()
In [ ]:
#read stata file (the file is in "data/alcohol.dta" and it is a stata file)
df =
#print top of the file to explore it (.head())
What type of variables are in the dataset?
For help check: https://www.boundless.com/statistics/textbooks/boundless-statistics-textbook/visualizing-data-3/the-histogram-18/types-of-variables-87-4406/
answer here
In [ ]:
#print descriptive statistics and interpret them (.describe())
df.
In [ ]:
#keep only the households without kids and use this dataset for the rest of the assignment
filtering_condition =
df_nokids = df.loc[filtering_condition]
In [49]:
df = pd.read_stata("data/alcohol.dta")
In [31]:
#visualize the relationship between number of adults, alcohol consumption and income using the right type of plot
In [ ]:
#save the plot as a pdf with the name "hw1_plot.pdf"
Interpret the plot you just made
answer here
What plot(s) would you use to model the number of adults vs income? Why?
answer here
In [ ]:
#Visualize the relationship between number of adults and number of kids using a contingency table using pd.crosstab(df[x],df[y])
Interpret the table you just made
answer here
In [4]:
%%html
<!-- TODO -->
<iframe width="560" height="315" src="https://zippy.gfycat.com/ImprobableFemaleBasenji.webm" frameborder="0" allowfullscreen></iframe>
In [7]:
#The only honest pie chart
Image(url="http://www.datavis.ca/gallery/images/pies/PiesIHaveEaten.png")
Out[7]:
answer here
answer here
answer here
In [8]:
import seaborn as sns
sns.set(style="ticks")
# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")
# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
col_wrap=2, ci=None, palette="muted", size=4,
scatter_kws={"s": 50, "alpha": 0})
Out[8]:
In [ ]: