Homework 2

Due date: Thursday 16th 23:59

  • Write your own code in the blanks. It is okay to collaborate with other students, but both students must write their own code and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
  • Complete the blanks, adding comments to explain what you are doing
  • Each plot must have labels

Collaborated with:


In [179]:
sns.jointplot?

In [159]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell but run it

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image,display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))


Assignment 1 (ungraded but important). Read some tutorials

Please play with the these ones (you should have them in the class1 folder if you downloaded them during the hw1):

  • Chapter 4 - Find out on which weekday people bike the most with groupby and aggregate.ipynb
  • Chapter 5 - Combining dataframes and scraping Canadian weather data.ipynb
  • Chapter 6 - String Operations- Which month was the snowiest.ipynb
  • Chapter 7 - Cleaning up messy data.ipynb

Read about outliers: http://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/

Assignment 2

  • Create a dictionary of names and sizes usign two lists

  • Get the heights of "Alice" and "Bob"

  • Add the height of "Holly"


In [ ]:
#Create a dictionary of names and sizes usign the following two lists (tip use dict() and zip())
names = ["Alice","Bob","Chris","Dylan","Esther","Fred","Greg"]
heights = [1.74,1.9,1.6,1.8,1.6,1.8,1.7]
d_names2heights = 

#Get the heights of "Alice" and "Bob" 


#Add the height of "Holly" (1.98)

Assignment 3

  • Create your own function called "mean_column_stata" of with the following inputs and outputs
input: 
filename: the filename of a stata file to read
column: the column name to calculate the mean

output: the mean of the values in that column
  • Use as the name of the function mean_column_stata()
  • This function must use our own function (mean_ours())

  • Use this function to calculate the mean of the "income" column in the dataset "data/alcohol.dta"

    • i.e. mean_column_stata(filename="data/alcohol.dta",column="income") must return 649.528

In [189]:
## Our own functions
def mean_ours(list_numbers): #list_numbers is the arguments
    """
    This is called the docstring, it is a comment describing the function. In this case the function calculates the mean of a list of numbers.
    
    input 
    list_numbers: a list of numbers
    
    output: the mean of the input   
    
    """
    #what gives back
    m_list = sum(list_numbers)/len(list_numbers)
    return m_list

In [190]:
m_list = mean_ours([1,3,2])

In [191]:
print(m_list)


2.0

In [ ]:
#define the function, with the arguments filename and column
def mean_column_stata(filename,column):
    """
    input: 
    filename: the filename of a stata file to read
    column: the column name to calculate the mean

    output: the mean of the values in that column
    """
    #read stata file
    df = 
    
    #keep the values of the column 
    values_column = 
    
    #calculate the mean of that column using mean_ours
    mean_column = 
    
    #return that mean
    return

mean = mean_column_stata(filename="data/alcohol.dta",column="income")
print(mean)

mean = mean_column_stata(filename="data/alcohol2.dta",column="income")
print(mean)

Assignment 4

  • Read the two dataframes: "data/green_area_pc.csv" and "data/pop_dens.csv"
  • Check their heads
  • Merge them using as an argument how="outer" (this will keep missing values)
  • Create a new variable measuring the densitiy of green area (m^2 green)/(km^2 city).
  • Sort the dataset by this new variable (.sort_values(by= ,ascending=)) to find the city with the largest fraction of green area.
  • Sort the dataset by "fraction_green" (.sort_values(by= ,ascending=)) to find the city with the lowest fraction of green area. Use the argument "na_position". To see its possible values run: df.sort_values?
  • Make a column with the country code
  • Keep the metropolitan areas in the Netherlands and Italy
  • Make a plot (choose the appropriate type) of "GREEN_AREA_PC" vs "fraction_green", colouring by country
  • Make a jointplot of "GREEN_AREA_PC" vs "fraction_green" in Italy, of the type hex
  • Plot the distribution of "fraction_green" for both countries

In [ ]:


In [ ]:
#read the two dataframes

#print their heads

#merge the two files using as an argument how="outer" (this will keep missing values)

#Create a new variable "fraction_green" measuring the densitiy of green area (m^2 green)/(m^2 city)

#Sort the dataset by "fraction_green" (.sort_values(by= ,ascending=)) to find the city with the largest fraction of green area.

#Sort the dataset by "fraction_green" (.sort_values(by= ,ascending=)) to find the city with the lowest fraction of green area. Use the argument "na_position" to get rid of the na values. 

#Make a column with the country code

#Keep the metropolitan areas in the Netherlands and Italy

#Make a plot (choose the appropriate type) of year vs "fraction_green", colouring by country

#Make a jointplot of "GREEN_AREA_PC" vs "fraction_green" in Italy, of the type hex

#Plot the distribution of "fraction_green" for both countries

What are the plots telling us?

answer here

Assignment 5

  • Read the dataset on cities using read_our_csv() (don't forget to run it first)
  • Make a column with the country code
  • Keep the metropolitan areas in the following countries keep_countries = ['UK', 'AU', 'IT', 'NL', 'FR', 'PL', 'GR', 'HU', 'PO', 'PT', 'AT', 'DK', 'NO','CZ', 'IE', 'DE', 'FI', 'ES', 'SW', 'CH', 'SK', 'US', 'GB', 'BE']
  • Keep the years 2000 and 2010
  • Create a dictionary matching those countries to a region: "Anglosphere", "South_EU", "Eastern_EU", "Scandinavia","Rest_EU"
  • Make a column with the region of the country
  • Make a plot visualizing "GDP_PC" vs "Year" for each region. The plot must show the variability within region
  • Make a factorplot visualizing "UNEMP_R" vs "Year" for each region. Use ci=10 for the confidence intervals.

In [ ]:
def read_our_csv():
    #reading the raw data from oecd
    df = pd.read_csv("data/CITIES_19122016195113034.csv",sep="\t")

    #fixing the columns (the first one is ""METRO_ID"" instead of "METRO_ID")
    cols = list(df.columns)
    cols[0] = "METRO_ID"
    df.columns = cols
    
    #pivot the table
    column_with_values = "Value"
    column_to_split = ["VAR"]
    variables_already_present = ["METRO_ID","Metropolitan areas","Year"]
    df_fixed = df.pivot_table(column_with_values,
                 variables_already_present,
                 column_to_split).reset_index()
    
    return df_fixed

In [ ]:
#read the dataset on cities using read_our_csv()
df = read_our_csv()
#Make a column with the country code

#Keep the metropolitan areas in the following countries
keep_countries = ['UK', 'AU', 'IT', 'NL', 'FR', 'PL', 'GR', 'HU', 'PO', 'PT', 'AT', 'DK', 'NO', 'EE', 'CZ', 'IE', 'DE', 'FI',  'ES',  'SW', 'CH', 'SK',  'US', 'GB', 'BE']


d_country2region = {"UK" : "Anglosphere",
                   "AU": "Anglosphere",
                   ...}
#Keep the years 2000 and 2010

#Create a dictionary matching those countries to a region: "Anglosphere", "South_EU", "Eastern_EU", "Scandinavia","Rest_EU"

#Make a column with the region of the country

#Make a plot visualizing the distribution of "GDP_PC" for each region. The plot must show the variability within region

#Make a factorplot visualizing "UNEMP_R" vs "Year" for each region. Use ci=10 for the confidence intervals.

In [178]:
df.drop_duplicates("Metropolitan areas").loc[:,["Metropolitan areas","CO2_PC","GREEN_AREA_PC"]].head(10)


Out[178]:
VAR Metropolitan areas CO2_PC GREEN_AREA_PC
0 Vienna 12.77 255.17
15 Graz 10.23 997.39
30 Linz 14.25 1791.92
45 Australia 19.52 NaN
60 Sydney 9.37 55.38
75 Melbourne 11.22 44.00
90 Brisbane 11.03 697.75
105 Perth 7.79 54.60
120 Adelaide 8.15 32.61
135 Gold Coast-Tweed Heads 4.25 285.41

What are the plots telling us?

answer here

Assignment 6

We want to plot labour productivity vs unemployment as a function of the year. For this we are using a custom scatter plot where the color indicates the year.

  • Complete the cell starting with #user read_our_csv() to read the csv.
  • Complete the cell starting with def plot_group(cities,data):

Then answer the following questions

  • Is there any relationship between unemployment and productivity?
  • Can you think of any circunstances where high unemployment causes high productivity next year?
  • Can you think of any circunstances where low unemployment causes high productivity next year?
  • Can you think of any circunstances where high productivity causes high unemployment next year?
  • Can you think of any circunstances where low productivity causes high unemployment next year?
  • Can you think of any way to distinguish the four alternatives? (don't dwell too much about this question)
  • Can you think of any reason for the differences between the Spanish and Italian cities?

In [169]:
def plot_group(cities,data):
    """
    plots the labour productivity vs unemployment as a function of the year
    
    input
    cities: list of cities
    data: dataframe with the columns "LABOUR_PRODUCTIVITY","UNEMP_R","Year"
    
    """
    cmaps = ["Blues","Greens","Reds","Oranges","Purples"]

    for i,city in enumerate(cities):
        #keep the rows where the metropolitan areas == city
        gr = data.loc[data["Metropolitan areas"]==city]

        #make the plot
        plt.plot(gr["LABOUR_PRODUCTIVITY"],gr["UNEMP_R"],color="gray")
        plt.scatter(gr["LABOUR_PRODUCTIVITY"],gr["UNEMP_R"],c=gr["Year"],edgecolor="none",cmap=cmaps[i])

    #Add labels
    
    plt.colorbar()
    #show the plot
    plt.show()

In [168]:


In [170]:
#user read_our_csv() to read the csv
df = read_our_csv()

#keep the years greater than 2004 and save it with the name data
data = df.loc[df["Year"]>2004]

plot_group(["Valencia","Madrid","Barcelona","Zaragoza","Las Palmas"],data)
plot_group(["Rome","Bari","Genova","Turin"],data)


Assignment 7: Data visualization

  • Explain what do you think that is wrong with the following figure. Would you use a different type of visualization? There are many correct ways to answer and I'm not looking for a perfect critique since we haven't covered data visualization theory yet.

Edit the cell below with the answer

This figure is interesting, based on this figure the NASA decided that there was no relationship between temperature and the failure of one of the components of the Challenger. There were also 17 launches with no incidents between 65 and 75 degrees. Because of this "no relationship", they decided to launch even when the temperature was 31F (0C), the component failed and the shuttle exploded: https://en.wikipedia.org/wiki/Space_Shuttle_Challenger

  • What do you think this plot shows?

answer here

answer here

  • What is wrong (or can be improved) with the type of plot?

answer here


In [93]:
display(Image(url="http://www.datavis.ca/gallery/images/nasa01.gif"))