Homework 2

Due date: Thursday 16th 23:59

Write your own code in the blanks. It is okay to collaborate with other students, but both students must write their own code and write the name of the other student in this cell. In case you adapt code from other sources you also must give that user credit (a comment with the link to the source suffices)
Complete the blanks, adding comments to explain what you are doing
Each plot must have labels

Collaborated with:



In [179]:

    
sns.jointplot?



In [159]:

    
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell but run it

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image,display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

Assignment 1 (ungraded but important). Read some tutorials

Please play with the these ones (you should have them in the class1 folder if you downloaded them during the hw1):

Chapter 4 - Find out on which weekday people bike the most with groupby and aggregate.ipynb
Chapter 5 - Combining dataframes and scraping Canadian weather data.ipynb
Chapter 6 - String Operations- Which month was the snowiest.ipynb
Chapter 7 - Cleaning up messy data.ipynb

Read about outliers: http://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/

Assignment 2

Create a dictionary of names and sizes usign two lists
Get the heights of "Alice" and "Bob"
Add the height of "Holly"



In [ ]:

    
#Create a dictionary of names and sizes usign the following two lists (tip use dict() and zip())
names = ["Alice","Bob","Chris","Dylan","Esther","Fred","Greg"]
heights = [1.74,1.9,1.6,1.8,1.6,1.8,1.7]
d_names2heights = 

#Get the heights of "Alice" and "Bob" 


#Add the height of "Holly" (1.98)

Assignment 3

Create your own function called "mean_column_stata" of with the following inputs and outputs

input: 
filename: the filename of a stata file to read
column: the column name to calculate the mean

output: the mean of the values in that column

Use as the name of the function mean_column_stata()
This function must use our own function (mean_ours())
Use this function to calculate the mean of the "income" column in the dataset "data/alcohol.dta"
- i.e. mean_column_stata(filename="data/alcohol.dta",column="income") must return 649.528



In [189]:

    
## Our own functions
def mean_ours(list_numbers): #list_numbers is the arguments
    """
    This is called the docstring, it is a comment describing the function. In this case the function calculates the mean of a list of numbers.
    
    input 
    list_numbers: a list of numbers
    
    output: the mean of the input   
    
    """
    #what gives back
    m_list = sum(list_numbers)/len(list_numbers)
    return m_list



In [190]:

    
m_list = mean_ours([1,3,2])



In [191]:

    
print(m_list)

2.0



In [ ]:

    
#define the function, with the arguments filename and column
def mean_column_stata(filename,column):
    """
    input: 
    filename: the filename of a stata file to read
    column: the column name to calculate the mean

    output: the mean of the values in that column
    """
    #read stata file
    df = 
    
    #keep the values of the column 
    values_column = 
    
    #calculate the mean of that column using mean_ours
    mean_column = 
    
    #return that mean
    return

mean = mean_column_stata(filename="data/alcohol.dta",column="income")
print(mean)

mean = mean_column_stata(filename="data/alcohol2.dta",column="income")
print(mean)

Assignment 4

Read the two dataframes: "data/green_area_pc.csv" and "data/pop_dens.csv"
Check their heads
Merge them using as an argument how="outer" (this will keep missing values)
Create a new variable measuring the densitiy of green area (m^2 green)/(km^2 city).
Sort the dataset by this new variable (.sort_values(by= ,ascending=)) to find the city with the largest fraction of green area.
Sort the dataset by "fraction_green" (.sort_values(by= ,ascending=)) to find the city with the lowest fraction of green area. Use the argument "na_position". To see its possible values run: df.sort_values?
Make a column with the country code
Keep the metropolitan areas in the Netherlands and Italy
Make a plot (choose the appropriate type) of "GREEN_AREA_PC" vs "fraction_green", colouring by country
Make a jointplot of "GREEN_AREA_PC" vs "fraction_green" in Italy, of the type hex
Plot the distribution of "fraction_green" for both countries



In [ ]:



In [ ]:

    
#read the two dataframes

#print their heads

#merge the two files using as an argument how="outer" (this will keep missing values)

#Create a new variable "fraction_green" measuring the densitiy of green area (m^2 green)/(m^2 city)

#Sort the dataset by "fraction_green" (.sort_values(by= ,ascending=)) to find the city with the largest fraction of green area.

#Sort the dataset by "fraction_green" (.sort_values(by= ,ascending=)) to find the city with the lowest fraction of green area. Use the argument "na_position" to get rid of the na values. 

#Make a column with the country code

#Keep the metropolitan areas in the Netherlands and Italy

#Make a plot (choose the appropriate type) of year vs "fraction_green", colouring by country

#Make a jointplot of "GREEN_AREA_PC" vs "fraction_green" in Italy, of the type hex

#Plot the distribution of "fraction_green" for both countries

What are the plots telling us?

answer here

Assignment 5

Read the dataset on cities using read_our_csv() (don't forget to run it first)
Make a column with the country code
Keep the metropolitan areas in the following countries keep_countries = ['UK', 'AU', 'IT', 'NL', 'FR', 'PL', 'GR', 'HU', 'PO', 'PT', 'AT', 'DK', 'NO','CZ', 'IE', 'DE', 'FI', 'ES', 'SW', 'CH', 'SK', 'US', 'GB', 'BE']
Keep the years 2000 and 2010
Create a dictionary matching those countries to a region: "Anglosphere", "South_EU", "Eastern_EU", "Scandinavia","Rest_EU"
Make a column with the region of the country
Make a plot visualizing "GDP_PC" vs "Year" for each region. The plot must show the variability within region
Make a factorplot visualizing "UNEMP_R" vs "Year" for each region. Use ci=10 for the confidence intervals.



In [ ]:

    
def read_our_csv():
    #reading the raw data from oecd
    df = pd.read_csv("data/CITIES_19122016195113034.csv",sep="\t")

    #fixing the columns (the first one is ""METRO_ID"" instead of "METRO_ID")
    cols = list(df.columns)
    cols[0] = "METRO_ID"
    df.columns = cols
    
    #pivot the table
    column_with_values = "Value"
    column_to_split = ["VAR"]
    variables_already_present = ["METRO_ID","Metropolitan areas","Year"]
    df_fixed = df.pivot_table(column_with_values,
                 variables_already_present,
                 column_to_split).reset_index()
    
    return df_fixed



In [ ]:

    
#read the dataset on cities using read_our_csv()
df = read_our_csv()
#Make a column with the country code

#Keep the metropolitan areas in the following countries
keep_countries = ['UK', 'AU', 'IT', 'NL', 'FR', 'PL', 'GR', 'HU', 'PO', 'PT', 'AT', 'DK', 'NO', 'EE', 'CZ', 'IE', 'DE', 'FI',  'ES',  'SW', 'CH', 'SK',  'US', 'GB', 'BE']


d_country2region = {"UK" : "Anglosphere",
                   "AU": "Anglosphere",
                   ...}
#Keep the years 2000 and 2010

#Create a dictionary matching those countries to a region: "Anglosphere", "South_EU", "Eastern_EU", "Scandinavia","Rest_EU"

#Make a column with the region of the country

#Make a plot visualizing the distribution of "GDP_PC" for each region. The plot must show the variability within region

#Make a factorplot visualizing "UNEMP_R" vs "Year" for each region. Use ci=10 for the confidence intervals.



In [178]:

    
df.drop_duplicates("Metropolitan areas").loc[:,["Metropolitan areas","CO2_PC","GREEN_AREA_PC"]].head(10)









    Out[178]:






  
    
      VAR
      Metropolitan areas
      CO2_PC
      GREEN_AREA_PC
    
  
  
    
      0
      Vienna
      12.77
      255.17
    
    
      15
      Graz
      10.23
      997.39
    
    
      30
      Linz
      14.25
      1791.92
    
    
      45
      Australia
      19.52
      NaN
    
    
      60
      Sydney
      9.37
      55.38
    
    
      75
      Melbourne
      11.22
      44.00
    
    
      90
      Brisbane
      11.03
      697.75
    
    
      105
      Perth
      7.79
      54.60
    
    
      120
      Adelaide
      8.15
      32.61
    
    
      135
      Gold Coast-Tweed Heads
      4.25
      285.41

What are the plots telling us?

answer here

Assignment 6

We want to plot labour productivity vs unemployment as a function of the year. For this we are using a custom scatter plot where the color indicates the year.

Complete the cell starting with #user read_our_csv() to read the csv.
Complete the cell starting with def plot_group(cities,data):

Then answer the following questions

Is there any relationship between unemployment and productivity?
Can you think of any circunstances where high unemployment causes high productivity next year?
Can you think of any circunstances where low unemployment causes high productivity next year?
Can you think of any circunstances where high productivity causes high unemployment next year?
Can you think of any circunstances where low productivity causes high unemployment next year?
Can you think of any way to distinguish the four alternatives? (don't dwell too much about this question)
Can you think of any reason for the differences between the Spanish and Italian cities?



In [169]:

    
def plot_group(cities,data):
    """
    plots the labour productivity vs unemployment as a function of the year
    
    input
    cities: list of cities
    data: dataframe with the columns "LABOUR_PRODUCTIVITY","UNEMP_R","Year"
    
    """
    cmaps = ["Blues","Greens","Reds","Oranges","Purples"]

    for i,city in enumerate(cities):
        #keep the rows where the metropolitan areas == city
        gr = data.loc[data["Metropolitan areas"]==city]

        #make the plot
        plt.plot(gr["LABOUR_PRODUCTIVITY"],gr["UNEMP_R"],color="gray")
        plt.scatter(gr["LABOUR_PRODUCTIVITY"],gr["UNEMP_R"],c=gr["Year"],edgecolor="none",cmap=cmaps[i])

    #Add labels
    
    plt.colorbar()
    #show the plot
    plt.show()



In [168]:



In [170]:

    
#user read_our_csv() to read the csv
df = read_our_csv()

#keep the years greater than 2004 and save it with the name data
data = df.loc[df["Year"]>2004]

plot_group(["Valencia","Madrid","Barcelona","Zaragoza","Las Palmas"],data)
plot_group(["Rome","Bari","Genova","Turin"],data)

Assignment 7: Data visualization

Explain what do you think that is wrong with the following figure. Would you use a different type of visualization? There are many correct ways to answer and I'm not looking for a perfect critique since we haven't covered data visualization theory yet.

Edit the cell below with the answer

This figure is interesting, based on this figure the NASA decided that there was no relationship between temperature and the failure of one of the components of the Challenger. There were also 17 launches with no incidents between 65 and 75 degrees. Because of this "no relationship", they decided to launch even when the temperature was 31F (0C), the component failed and the shuttle exploded: https://en.wikipedia.org/wiki/Space_Shuttle_Challenger

What do you think this plot shows?

answer here

What is wrong with the message it gives? (tip: this is all the data https://tamino.files.wordpress.com/2011/12/rogersfig7.jpg)

answer here

What is wrong (or can be improved) with the type of plot?

answer here



In [93]:

    
display(Image(url="http://www.datavis.ca/gallery/images/nasa01.gif"))

VAR	Metropolitan areas	CO2_PC	GREEN_AREA_PC
0	Vienna	12.77	255.17
15	Graz	10.23	997.39
30	Linz	14.25	1791.92
45	Australia	19.52	NaN
60	Sydney	9.37	55.38
75	Melbourne	11.22	44.00
90	Brisbane	11.03	697.75
105	Perth	7.79	54.60
120	Adelaide	8.15	32.61
135	Gold Coast-Tweed Heads	4.25	285.41