Working with data 2017. Class 2

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

Data types, structures and code II
Tidy data: Pivoting and melting dataframes
Merging and concatenating dataframes
Working with missing data
My second plots



In [330]:

    
#Normal inputs
import pandas as pd
import numpy as np
import seaborn as sns
import pylab as plt
%matplotlib inline

from IPython.display import Image, display

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))



In [328]:

    
#Create a toy dataframe
df = pd.DataFrame([[1,2,3],[1,2,3]],columns=["a","b","c"])
df

Save CSV to disk (df.to_csv())

path_or_buf = filename
sep = separator
na_rep (default "") = how do you want to represent missing values
header = do you want to save the column names?
index (default True) = is there an index to save? usually False



In [263]:

    
#Saving the toy dataframe above
df.to_csv("data/toy_df.csv",sep="\t",index=False)

2. Tidy data

2.1 Observations and variables

A variable is any characteristics, number, or quantity that can be measured or counted: e.g. income, height, happiness
- Numeric: Numbers
  - Continuous: Area, height, etc
  - Discrete: The rolls of a die, many answers in surveys, etc
- Categorical: Categories
  - Ordinal: Can be ordered. E.g. "high", "middle" and "low" income countries
  - Nominal: Cannot be ordered. E.g. "blue", "red" and "yellow"
An observation is each of the values of a variable: e.g. 800, 1.93, 8, "middle"



In [264]:

    
Image(url="http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/5869990518bbb7feca257b55002261e0/Body/3.109A!OpenElement&FieldElemFormat=gif")









    Out[264]:

2.2 What is tidy data?

Based on: http://vita.had.co.nz/papers/tidy-data.pdf

Tidy data is a standard way of mapping the meaning of a dataset to its structure. [...] In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Advantages: Facilitate manipulation, visualisation and modelling

Combining tables (merging two tables by the year)
Plotting (df["x"] vs df["y"])
Statistical models ("y" ~ "x" + C("year"))
Examples adapted from http://stat405.had.co.nz/lectures/18-tidy-data.pdf



In [265]:

    
%%html
<h3>There are three variables in this data set. What are they? </h3> 
<table>
    <tr> <th></th> <th>2010</th> <th>2015</th> </tr>
    <tr> <td>Male</td> <td>0</td> <td>5</td> </tr>
    <tr> <td>Female</td> <td>1</td> <td>4</td> </tr>
</table>









    




There are three variables in this data set. What are they?  

      2010  2015  
     Male  0  5  
     Female  1  4



In [268]:

    
%%html
<h3>Tidy data</h3> 
<table>
    <tr> <th>year</th> <th>sex</th> <th>count</th> </tr>
    <tr> <td>2010</td> <td>female</td> <td>1</td> </tr>
    <tr> <td>2010</td> <td>male</td> <td>0</td> </tr>
    <tr> <td>2015</td> <td>female</td> <td>4</td> </tr>
    <tr> <td>2015</td> <td>male</td> <td>5</td> </tr>
</table>









    




Tidy data 

     year  sex  count  
     2010  female  1  
     2010  male  0  
     2015  female  4  
     2015  male  5

2.3 Tidying messy datasets

Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analysing immediately, this is the exception, not the rule:

Column headers are values, not variable names --> Melt
Multiple variables are stored in one column --> Pivot
Variables are stored in both rows and columns --> Melt then pivot



In [269]:

    
#Example of melting (columns to rows)
print("Melting")
Image("figures/melt.png",width=400)









    



Melting






    Out[269]:



In [270]:

    
#Example of pivoting (rows to columns)
print("Pivoting")
Image("figures/pivot.png",width=800)









    



Pivoting






    Out[270]:

Data to study: variations of https://stats.oecd.org/Index.aspx?DataSetCode=CITIES (which has problem 2)

2.4 Column headers are values, not variable names: MELT



In [331]:

    
#Data on population density by year
df = pd.read_csv("data/columns_headers.csv")
df.head()









    Out[331]:






  
    
      
      Unnamed: 0
      METRO_ID
      Metropolitan areas
      2000
      2001
      2002
      2003
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
    
  
  
    
      0
      0
      AT001
      Vienna
      267.27
      269.90
      272.56
      275.26
      277.99
      280.75
      283.54
      286.38
      289.24
      292.15
      295.09
      298.07
      301.08
      304.14
      307.23
    
    
      1
      1
      AT002
      Graz
      179.92
      181.59
      183.28
      185.01
      186.76
      188.55
      190.36
      192.20
      194.07
      195.97
      197.91
      199.87
      201.87
      203.90
      205.96
    
    
      2
      2
      AT003
      Linz
      165.29
      165.94
      166.60
      167.27
      167.94
      168.62
      169.31
      170.00
      170.70
      171.41
      172.12
      172.84
      173.57
      174.31
      175.05
    
    
      3
      3
      AUS
      Australia
      2.47
      2.50
      2.53
      2.56
      2.59
      2.62
      2.65
      2.70
      2.76
      2.82
      2.86
      2.90
      2.95
      3.00
      3.05
    
    
      4
      4
      AUS01
      Sydney
      332.15
      335.48
      338.18
      340.38
      342.20
      344.88
      348.04
      353.71
      360.58
      367.35
      372.51
      376.88
      382.38
      388.94
      395.83



In [332]:

    
### Fix

#Variables present in columns already
variables_already_presents = ['METRO_ID', 'Metropolitan areas']

#Variables to combine in one
columns_combine = ['2000', '2001', '2002','2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011','2012', '2013', '2014']

#Fix
fixed = pd.melt(df,id_vars=variables_already_presents,
           value_vars=columns_combine,
           var_name="Year",
           value_name="POP_DENS")
fixed.head(10)









    Out[332]:






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      POP_DENS
    
  
  
    
      0
      AT001
      Vienna
      2000
      267.27
    
    
      1
      AT002
      Graz
      2000
      179.92
    
    
      2
      AT003
      Linz
      2000
      165.29
    
    
      3
      AUS
      Australia
      2000
      2.47
    
    
      4
      AUS01
      Sydney
      2000
      332.15
    
    
      5
      AUS02
      Melbourne
      2000
      342.35
    
    
      6
      AUS03
      Brisbane
      2000
      107.48
    
    
      7
      AUS04
      Perth
      2000
      226.80
    
    
      8
      AUS05
      Adelaide
      2000
      351.60
    
    
      9
      AUS06
      Gold Coast-Tweed Heads
      2000
      210.14

2.5 Multiple variables are stored in one column: PIVOT_TABLE



In [334]:

    
#Actual data from OECD
df = pd.read_csv("data/multiple_variables.csv")
df.head(2)









    Out[334]:






  
    
      
      Unnamed: 0
      METRO_ID
      Metropolitan areas
      VAR
      Variables
      TIME
      Year
      Unit Code
      Unit
      PowerCode Code
      PowerCode
      Reference Period Code
      Reference Period
      Value
      Flag Codes
      Flags
    
  
  
    
      0
      0
      AT001
      Vienna
      POP_DENS
      Population density (persons per km2)
      2000
      2000
      RATIO
      Ratio
      0
      Units
      NaN
      NaN
      267.27
      NaN
      NaN
    
    
      1
      1
      AT001
      Vienna
      POP_DENS
      Population density (persons per km2)
      2001
      2001
      RATIO
      Ratio
      0
      Units
      NaN
      NaN
      269.90
      NaN
      NaN



In [340]:

    
### Fix

#Columns already present
variables_already_present = ["METRO_ID","Metropolitan areas","Year"]

#Column with the variables
column_to_split = ["VAR"]

#Column with the values
column_with_values = "Value"

#Fix
df_fixed = df.pivot_table(column_with_values,
               variables_already_present,
               column_to_split)

df_fixed.head()









    Out[340]:






  
    
      
      
      VAR
      GREEN_AREA_PC
      POP_DENS
    
    
      METRO_ID
      Metropolitan areas
      Year
      
      
    
  
  
    
      AT001
      Vienna
      2000
      255.17
      267.27
    
    
      2001
      252.68
      269.90
    
    
      2002
      250.22
      272.56
    
    
      2003
      247.77
      275.26
    
    
      2004
      245.34
      277.99



In [341]:

    
#use .reset_index() to convert index to columns
df_fixed.reset_index()









    Out[341]:






  
    
      VAR
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      AT001
      Vienna
      2000
      255.17
      267.27
    
    
      1
      AT001
      Vienna
      2001
      252.68
      269.90
    
    
      2
      AT001
      Vienna
      2002
      250.22
      272.56
    
    
      3
      AT001
      Vienna
      2003
      247.77
      275.26
    
    
      4
      AT001
      Vienna
      2004
      245.34
      277.99
    
    
      5
      AT001
      Vienna
      2005
      242.92
      280.75
    
    
      6
      AT001
      Vienna
      2006
      240.53
      283.54
    
    
      7
      AT001
      Vienna
      2007
      238.15
      286.38
    
    
      8
      AT001
      Vienna
      2008
      235.79
      289.24
    
    
      9
      AT001
      Vienna
      2009
      233.44
      292.15
    
    
      10
      AT001
      Vienna
      2010
      231.12
      295.09
    
    
      11
      AT001
      Vienna
      2011
      228.81
      298.07
    
    
      12
      AT001
      Vienna
      2012
      226.52
      301.08
    
    
      13
      AT001
      Vienna
      2013
      224.24
      304.14
    
    
      14
      AT001
      Vienna
      2014
      221.99
      307.23
    
    
      15
      AT002
      Graz
      2000
      997.39
      179.92
    
    
      16
      AT002
      Graz
      2001
      988.23
      181.59
    
    
      17
      AT002
      Graz
      2002
      979.10
      183.28
    
    
      18
      AT002
      Graz
      2003
      969.96
      185.01
    
    
      19
      AT002
      Graz
      2004
      960.85
      186.76
    
    
      20
      AT002
      Graz
      2005
      951.75
      188.55
    
    
      21
      AT002
      Graz
      2006
      942.72
      190.36
    
    
      22
      AT002
      Graz
      2007
      933.67
      192.20
    
    
      23
      AT002
      Graz
      2008
      924.67
      194.07
    
    
      24
      AT002
      Graz
      2009
      915.69
      195.97
    
    
      25
      AT002
      Graz
      2010
      906.73
      197.91
    
    
      26
      AT002
      Graz
      2011
      897.83
      199.87
    
    
      27
      AT002
      Graz
      2012
      888.95
      201.87
    
    
      28
      AT002
      Graz
      2013
      880.11
      203.90
    
    
      29
      AT002
      Graz
      2014
      871.29
      205.96
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4620
      US261
      Mcallen
      2000
      190.36
      138.83
    
    
      4621
      US261
      Mcallen
      2001
      184.59
      143.17
    
    
      4622
      US261
      Mcallen
      2002
      178.99
      147.64
    
    
      4623
      US261
      Mcallen
      2003
      173.57
      152.26
    
    
      4624
      US261
      Mcallen
      2004
      168.30
      157.02
    
    
      4625
      US261
      Mcallen
      2005
      163.20
      161.93
    
    
      4626
      US261
      Mcallen
      2006
      158.25
      166.99
    
    
      4627
      US261
      Mcallen
      2007
      153.45
      172.21
    
    
      4628
      US261
      Mcallen
      2008
      148.80
      177.60
    
    
      4629
      US261
      Mcallen
      2009
      144.29
      183.15
    
    
      4630
      US261
      Mcallen
      2010
      139.92
      188.88
    
    
      4631
      US261
      Mcallen
      2011
      135.67
      194.78
    
    
      4632
      US261
      Mcallen
      2012
      131.56
      200.87
    
    
      4633
      US261
      Mcallen
      2013
      127.57
      207.15
    
    
      4634
      US261
      Mcallen
      2014
      123.70
      213.63
    
    
      4635
      USA
      United States
      2000
      NaN
      30.80
    
    
      4636
      USA
      United States
      2001
      NaN
      31.10
    
    
      4637
      USA
      United States
      2002
      NaN
      31.39
    
    
      4638
      USA
      United States
      2003
      NaN
      31.66
    
    
      4639
      USA
      United States
      2004
      NaN
      31.96
    
    
      4640
      USA
      United States
      2005
      NaN
      32.25
    
    
      4641
      USA
      United States
      2006
      NaN
      32.57
    
    
      4642
      USA
      United States
      2007
      NaN
      32.88
    
    
      4643
      USA
      United States
      2008
      NaN
      33.19
    
    
      4644
      USA
      United States
      2009
      NaN
      33.48
    
    
      4645
      USA
      United States
      2010
      NaN
      33.76
    
    
      4646
      USA
      United States
      2011
      NaN
      34.02
    
    
      4647
      USA
      United States
      2012
      NaN
      34.28
    
    
      4648
      USA
      United States
      2013
      NaN
      34.54
    
    
      4649
      USA
      United States
      2014
      NaN
      34.80
    
  

4650 rows × 5 columns

2.6 Variables are stored in both rows and columns: MELT then PIVOT



In [290]:









    Out[290]:





Index(['METRO_ID', 'Metropolitan areas', 'VAR', 'Year', 'Value'], dtype='object')



In [297]:

    
df = pd.read_csv("data/multiple_variables_and_headers.csv")
df.head()









    Out[297]:






  
    
      
      Unnamed: 0
      METRO_ID
      Metropolitan areas
      VAR
      2000
      2001
      2002
      2003
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
    
  
  
    
      0
      0
      AT001
      Vienna
      GREEN_AREA_PC
      255.17
      252.68
      250.22
      247.77
      245.34
      242.92
      240.53
      238.15
      235.79
      233.44
      231.12
      228.81
      226.52
      224.24
      221.99
    
    
      1
      1
      AT001
      Vienna
      POP_DENS
      267.27
      269.90
      272.56
      275.26
      277.99
      280.75
      283.54
      286.38
      289.24
      292.15
      295.09
      298.07
      301.08
      304.14
      307.23
    
    
      2
      2
      AT002
      Graz
      GREEN_AREA_PC
      997.39
      988.23
      979.10
      969.96
      960.85
      951.75
      942.72
      933.67
      924.67
      915.69
      906.73
      897.83
      888.95
      880.11
      871.29
    
    
      3
      3
      AT002
      Graz
      POP_DENS
      179.92
      181.59
      183.28
      185.01
      186.76
      188.55
      190.36
      192.20
      194.07
      195.97
      197.91
      199.87
      201.87
      203.90
      205.96
    
    
      4
      4
      AT003
      Linz
      GREEN_AREA_PC
      1791.92
      1784.87
      1777.80
      1770.71
      1763.59
      1756.49
      1749.38
      1742.23
      1735.10
      1727.91
      1720.77
      1713.59
      1706.40
      1699.20
      1691.99



In [298]:

    
print(list(df.columns)[4:])









    



['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014']



In [299]:

    
cols = []
for column in df.columns:
    try: 
        int(column)
        cols.append(column)
    except:
        pass
print(cols)









    



['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014']



In [300]:

    
### Fix setp 1
variables_already_presents = ['METRO_ID', 'Metropolitan areas','VAR']
columns_combine = cols
df = pd.melt(df,
           id_vars=variables_already_presents,
           value_vars=columns_combine,
           var_name="Year",
           value_name="Value")
df.head()









    Out[300]:






  
    
      
      METRO_ID
      Metropolitan areas
      VAR
      Year
      Value
    
  
  
    
      0
      AT001
      Vienna
      GREEN_AREA_PC
      2000
      255.17
    
    
      1
      AT001
      Vienna
      POP_DENS
      2000
      267.27
    
    
      2
      AT002
      Graz
      GREEN_AREA_PC
      2000
      997.39
    
    
      3
      AT002
      Graz
      POP_DENS
      2000
      179.92
    
    
      4
      AT003
      Linz
      GREEN_AREA_PC
      2000
      1791.92



In [301]:

    
### Fix step 2
column_with_values = "Value"
column_to_split = ["VAR"]
variables_already_present = ["METRO_ID","Metropolitan areas","Year"]
df.pivot_table(column_with_values,
               variables_already_present,
               column_to_split).reset_index().head()









    Out[301]:






  
    
      VAR
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      AT001
      Vienna
      2000
      255.17
      267.27
    
    
      1
      AT001
      Vienna
      2001
      252.68
      269.90
    
    
      2
      AT001
      Vienna
      2002
      250.22
      272.56
    
    
      3
      AT001
      Vienna
      2003
      247.77
      275.26
    
    
      4
      AT001
      Vienna
      2004
      245.34
      277.99

3. Merging and concatenating datasets

Concatenate:

Some times you obtain more data (with the same variables or different)

Merging:

Many times you have several datasets that you have to combine according to a shared variable

3.1 Concatenate files (add more observations)



In [302]:

    
#Add more observations
Image(url="http://pandas.pydata.org/pandas-docs/stable/_images/merging_concat_basic.png")









    Out[302]:



In [304]:

    
madrid = pd.read_csv("data/mad_twovariables.csv").head(2)
amsterdam = pd.read_csv("data/ams_twovariables.csv").head(2)
display(madrid)
display(amsterdam)









    






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      ES001
      Madrid
      2000
      31.5
      471.88
    
    
      1
      ES001
      Madrid
      2001
      31.0
      479.60
    
  








    






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      NL002
      Amsterdam
      2000
      235.63
      764.71
    
    
      1
      NL002
      Amsterdam
      2001
      233.57
      771.45



In [305]:

    
combined = pd.concat([madrid,amsterdam])
combined









    Out[305]:






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      ES001
      Madrid
      2000
      31.50
      471.88
    
    
      1
      ES001
      Madrid
      2001
      31.00
      479.60
    
    
      0
      NL002
      Amsterdam
      2000
      235.63
      764.71
    
    
      1
      NL002
      Amsterdam
      2001
      233.57
      771.45

Note that it's not a problem if the variables are not all the same

But in that case you'll have missing values



In [306]:

    
a = pd.DataFrame([[1,2],[3,4]],columns=["a","b"])
b = pd.DataFrame([[1,2],[3,4]],columns=["c","b"])
display(a)
display(b)



In [307]:

    
pd.concat([a,b])

3.2 Merge datasets



In [308]:

    
Image(url="http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key.png",width=700)









    Out[308]:



In [309]:

    
ams_green = pd.read_csv("data/ams_green.csv")
ams_pop = pd.read_csv("data/ams_popdens.csv")
display(ams_green.head(2))
display(ams_pop.head(2))









    






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
    
  
  
    
      0
      NL002
      Amsterdam
      2000
      235.63
    
    
      1
      NL002
      Amsterdam
      2001
      233.57
    
  








    






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      POP_DENS
    
  
  
    
      0
      NL002
      Amsterdam
      2000
      764.71
    
    
      1
      NL002
      Amsterdam
      2001
      771.45



In [343]:

    
combined = pd.merge(ams_green,ams_pop,
         left_on=["METRO_ID","Metropolitan areas","Year"],
         right_on=["METRO_ID","Metropolitan areas","Year"],
         how="inner")
combined.head()









    Out[343]:






  
    
      
      METRO_ID
      Metropolitan areas
      Year
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      NL002
      Amsterdam
      2000
      235.63
      764.71
    
    
      1
      NL002
      Amsterdam
      2001
      233.57
      771.45
    
    
      2
      NL002
      Amsterdam
      2002
      231.51
      778.30
    
    
      3
      NL002
      Amsterdam
      2003
      229.46
      785.26
    
    
      4
      NL002
      Amsterdam
      2004
      227.41
      792.34



In [311]:

    
print("Inner join pd.merge(left_df,right_df,how=\"inner\")")
display(Image(url="http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_inner.png"))
print("Right join pd.merge(left_df,right_df,how=\"right\")", "The same with left")
display(Image(url="http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_right.png"))
print("Outer join pd.merge(left_df,right_df,how=\"outer\")")
display(Image(url="http://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_outer.png"))









    



Inner join pd.merge(left_df,right_df,how="inner")






    











    



Right join pd.merge(left_df,right_df,how="right") The same with left






    











    



Outer join pd.merge(left_df,right_df,how="outer")

4. Change columns, replace and apply functions

Some times we want to replace some categories with numbers
This is especially important with text (and we'll talk more in the class about text)



In [350]:

    
def mean_(some_numbers):
    return np.mean(some_numbers)



In [351]:

    
a = mean_([1,2,3])
print(a)

2.0



In [ ]:



In [ ]:



In [ ]:

    
np.mean([1,2,3])



In [ ]:



In [357]:

    
def read_our_csv():
    #reading the raw data from oecd
    df = pd.read_csv("data/CITIES_19122016195113034.csv",sep="\t")

    #fixing the columns (the first one is ""METRO_ID"" instead of "METRO_ID")
    cols = list(df.columns)
    cols[0] = "METRO_ID"
    df.columns = cols
    
    #pivot the table
    column_with_values = "Value"
    column_to_split = ["VAR"]
    variables_already_present = ["METRO_ID","Metropolitan areas","Year"]
    df_fixed = df.pivot_table(column_with_values,
                 variables_already_present,
                 column_to_split).reset_index()
    
    return df_fixed



In [352]:

    
df_fixed = read_our_csv()
df_fixed.head()









    Out[352]:






  
    
      VAR
      METRO_ID
      Metropolitan areas
      Year
      CO2_PC
      ENTROPY_1000M
      EQU_HOU_DISP_INC
      GDP_PC
      GINI_INC
      GREEN_AREA_PC
      LABOUR_PRODUCTIVITY
      PCT_INTENSITY
      POP_DENS
      SPRAWL
      UNEMP_R
    
  
  
    
      0
      AT001
      Vienna
      2000
      12.77
      NaN
      NaN
      45434.48
      NaN
      255.17
      94254.60
      0.90
      267.27
      NaN
      4.77
    
    
      1
      AT001
      Vienna
      2001
      NaN
      NaN
      NaN
      45479.15
      NaN
      252.68
      94805.82
      0.97
      269.90
      NaN
      4.83
    
    
      2
      AT001
      Vienna
      2002
      NaN
      NaN
      NaN
      46065.50
      NaN
      250.22
      100463.10
      1.06
      272.56
      NaN
      5.82
    
    
      3
      AT001
      Vienna
      2003
      NaN
      NaN
      NaN
      45470.92
      NaN
      247.77
      97851.61
      1.15
      275.26
      NaN
      6.24
    
    
      4
      AT001
      Vienna
      2004
      NaN
      NaN
      NaN
      46148.38
      NaN
      245.34
      102259.13
      1.34
      277.99
      NaN
      7.20

4.1 Multiply columns



In [316]:

    
#Read and fix the data
df_fixed = read_our_csv()
df_fixed = df_fixed.loc[df_fixed["Year"]==2000,["GREEN_AREA_PC","POP_DENS"]]    
df_fixed.head()









    Out[316]:






  
    
      VAR
      GREEN_AREA_PC
      POP_DENS
    
  
  
    
      0
      255.17
      267.27
    
    
      15
      997.39
      179.92
    
    
      30
      1791.92
      165.29
    
    
      45
      NaN
      2.47
    
    
      60
      55.38
      332.15



In [317]:

    
plt.figure(figsize=(6,4))
plt.scatter(df_fixed["GREEN_AREA_PC"],df_fixed["POP_DENS"],edgecolor="none",color=(70/255,140/255,210/255))
plt.yscale("log")
plt.xscale("log")
plt.ylim(10,10000)
plt.xlim(10,10000)
plt.xlabel("Green area (m^2/million people)")
plt.ylabel("Density population (people/km^s)")









    Out[317]:





<matplotlib.text.Text at 0x7f62440ff5f8>

Is there a relationship between green area per capita and population density? Why?

Imagine we want to calculate the fraction of green area

$$\frac{10^{6} m^2_{green}}{people} \cdot \frac{people}{km^2_{city}} = \frac{10^{6} m^2_{green}}{km^2_{city}}$$



In [318]:

    
#fraction of green area
df_fixed["fraction_green"] = df_fixed["GREEN_AREA_PC"]*df_fixed["POP_DENS"]/1E6



In [319]:

    
#make the figure
plt.figure(figsize=(6,4))
plt.scatter(df_fixed["fraction_green"],df_fixed["POP_DENS"],edgecolor="none",color=(70/255,140/255,210/255))
plt.yscale("log")
plt.xscale("log")
plt.ylim(10,10000)
plt.xlim(0.0001,1)
plt.xlabel("Percentage green (m^2 green / km^2 land)")
plt.ylabel("Density population (population/km^2)")









    Out[319]:





<matplotlib.text.Text at 0x7f6237eb3e80>

4.2 Categories to numbers



In [358]:

    
#Imagine we want to convert the income to a category
df = pd.DataFrame({"kids": [2,3,3,1,3,1,4,4,2], "income": [">100",">100","<40","<40","40-100","<40","40-100",">100","40-100"]})
df



In [359]:

    
d_cat2income = {">100": 100, 
                "40-100": 70, 
                "<40": 40}
d_cat2income









    Out[359]:





{'40-100': 70, '<40': 40, '>100': 100}



In [376]:

    
d_cat2income.get(">90")



In [ ]:



In [374]:

    
df["income_number"] = df["income"].apply(lambda x: 
                                         x[0])



In [375]:

    
df.head()









    Out[375]:






  
    
      
      income
      kids
      income_number
    
  
  
    
      0
      >100
      2
      >
    
    
      1
      >100
      3
      >
    
    
      2
      <40
      3
      <
    
    
      3
      <40
      1
      <
    
    
      4
      40-100
      3
      4



In [321]:

    
#First we need a dictionary
d_cat2income = {">100": 100, "40-100": 70, "<40": 40}



In [322]:

    
def category2income(value):
    return d_cat2income.get(value)

df = pd.DataFrame({"kids": [2,3,3,1,3,1,4,4,2], "income": [">100",">100","40-100","<40","40-100","<40","40-100",">100","40-100"]})
df["income"] = df["income"].apply(category2income)
df.head()



In [323]:

    
#Or without creating a new function
df = pd.DataFrame({"kids": [2,3,3,1,3,1,4,4,2], "income": [">100",">100","40-100","<40","40-100","<40","40-100",">100","40-100"]})
df["income"] = df["income"].apply(lambda x: d_cat2income.get(x))
df.head()

4.2 Keeping part of a the values in a column (text)

We'll talk more about this when we talk about text analysis



In [324]:

    
df_fixed = read_our_csv().loc[:,["METRO_ID","Year"]]
df_fixed.head()



In [325]:

    
def keep_2_letters(value):
    return value[:2]

df_fixed["Country"] = df_fixed["METRO_ID"].apply(keep_2_letters)
df_fixed.head()



In [326]:

    
#or in one line
df_fixed["Country"] = df_fixed["METRO_ID"].apply(lambda x: x[:2])
df_fixed.head()

4.3 Doing math with the column



In [327]:

    
#for example doing the log of a column. This is useful when the distribution is lognormal (such as income, gdp).
#But somehow confusing, the distribution of the income of many people IS LOGNORMAL.
#The distribution of the mean income of many cities IS NOT LOGNORMAL.
df_fixed = read_our_csv().loc[:,["METRO_ID","Year","GDP_PC"]]
df_fixed["GDP_PC_log"] = np.log10(df_fixed["GDP_PC"])
df_fixed.head()

5. Dealing with missing data

Let's imagine we have a data of a survey, with age, income, education, political ideas, location, and if you will vote for Trump or Hillary.

We may have some missing values. This missing values can be:

MCAR (missing completely at random), which means that we have a representative sample.
- This for example could happen if during the survey collection there were some IT problems.
- It is usually a strong assumption but it is usually made.
- However imagine if young voters of Trump are less likely to answer --> Then your data is MAR.
- Usually in surveys you make sure you ask to a percentage of people of age and location that correspond with the real population. But you may be missing an important variable (for example US pollsters didn't ask to a representative sample in terms of education).
- We can either ignore the rows or impute the values based on people that did answer.

MAR (missing at random), which means that we don't have a representative sample, but we can use another column to impute missing values.
- We can correct the data by using other people that did answer. For instance, two people living in the same area, with the same age, income, education and political ideas are likely to vote similar, so if you only know how one of them intends to vote you can say that the other one will vote the same (there are methods for this, don't do it by hand!)

MNAR (missing not at random), which means that we don't have a representative sample, and imputation is very very hard. This can happen for example if Trump voters are less likely to open the door, then they are not even in your sample.
- We are in trouble and the methods to correct for this are way beyond the scope of the class.

What are the strategies to correct for missing values?

4.1 Ignore those values (only if your data is MCAR)
Impute those values (always better but more complicated)
- 4.2 Use the mean/median/mode as the value (only works well in MCAR)
- 4.3 Use similar values -> fancyimpute (another time)



In [ ]:

    
#Read and fix the data
df_fixed = read_our_csv()

#Remove rows with missing values
cols = ["LABOUR_PRODUCTIVITY","UNEMP_R","GDP_PC"]
df_fixed = df_fixed.dropna(subset=cols)

#Creating a column for country
df_fixed["C"] = df_fixed["METRO_ID"].apply(lambda x: x[:2])

#Keeping italy
df_fixed = df_fixed.loc[df_fixed["C"]=="IT",["C","METRO_ID","Metropolitan areas"] +cols]

#We are going to normalize values dividing by the mean (so new values have a mean of 1)
df_fixed.loc[:,cols] = df_fixed[cols]/np.nanmean(df_fixed[cols],0)

#Make a copy of the data
df_original = df_fixed.copy()

#Take a random sample of 20 values of productivity
sample = set(df_fixed.loc[:,"LABOUR_PRODUCTIVITY"].sample(20))

#Deleting those values (saying that they are np.NaN (missing))
df_fixed.loc[df_fixed["LABOUR_PRODUCTIVITY"].isin(sample),"LABOUR_PRODUCTIVITY"] = np.NaN
df_fixed.head()

4.1 Ignoring data



In [ ]:

    
#How to fix by ignoring the rows
ignoring = df_fixed.dropna(subset=["LABOUR_PRODUCTIVITY"])

4.3 Imputing with mean/median/mode



In [ ]:

    
#How to fix by imputing with mean/median/mode
mean_inputed = df_fixed.fillna(df_fixed.mean())

4.4 Imputing using neighbous

We may go over this another day, it's not an easy topic.
But the basic idea is that you are probably similar to your neighbors (variables are correlated)



In [ ]:

    
#Based on this. Similar points for unemployment have similar points for productivity
sns.lmplot(x="LABOUR_PRODUCTIVITY",y="UNEMP_R",data=df_fixed,fit_reg=False)



In [ ]:

    
Image("figures/labels.png")



In [ ]:

    
print("Using a random sample => MCAR DATA")
Image("figures/kmeans_vs_mean.png")



In [ ]:

    
print("Using a biasad sample => MAR DATA")
Image("figures/kmeans_vs_mean_worst_case.png")

But first of all we need to make sure we have missing values instead of "-9" or other strings

The way to do it is to explore the file and then read it:

pd.read_csv(filename,...,na_values=[here the list of missing values]

5. Dealing with outliers

In short:

If the outlier is obviously wrong you should treat it as missing data. For instance if you record a value of 19 in a 0-10 scale (it was problably 1 or 9, but we don't know).
If the outlier is *not obviously wrong", you should keep it. For instance if most people say 3 in a 0-10 scale but a couple say 9, then you need to keep it. You can use robust statistics, that are not so affected by outliers.

More info at another time, but read this: http://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/

6. Plotting

Another scatter plot
Histogram, violinplot and jointplot
Parallel coordinates
Factorplot



In [ ]:

    
#Read and fix the data. also drop missing values
df = read_our_csv().loc[:,["METRO_ID","Year","Metropolitan areas","LABOUR_PRODUCTIVITY","UNEMP_R"]]
#Drop rows with missing values
df = df.dropna()
#Keep years 2000 and 2012
df = df.loc[df["Year"].isin([2000,2012])]
#Add a column for country
df["C"] = df["METRO_ID"].apply(lambda x: x[:2] if isinstance(x,str) else "")
df.head()

6.0 SCATTER using sns.lmplot

Use to plot the relationship between two quantitative variables (and one qualitative variable using color)

x = x axis
y = y axis
h = separate using color qualitative variable
data = dataframe
fit_reg = True to fit regression lines
col = separate by another qualitative variable
row = separate by anoyther qualitative variable
size = size
aspect = aspect (1 = square)
sharex = share x axis
sharey = share y axis



In [ ]:

    
sns.lmplot?



In [ ]:

    
keep_countries = ['IT', 'NL',  'DE',  'ES', 'GR']



In [ ]:

    
sns.lmplot(x="LABOUR_PRODUCTIVITY",y="UNEMP_R",hue="C",col="Year"
           ,data=df.loc[df["C"].isin(keep_countries)],fit_reg=False,
          size=3,aspect=1.4,sharex=True,sharey=True)
plt.show()

6.1a HISTOGRAM: sns.distplot

Creates a histogram. Important to see the distribution of your data, to find outliers, etc.

x: required, the data
rug (default True): plot ticks indicating where the value exists
kde (default Ture): use kernel distribution estimator to plot the shape
hist_kws={"edgecolor":"none"}
kde_kws={"cut":0}
fit_kws={"cut":0,"color":"gray"})

What's a histogram?

A representation of the distribution of a variable
For example. If you draw two dice, then you will get a lot of 7s, many 6s and 8s, some 5s and 9s, a few 4s and 10st, very few32s and 11st and almost no 2s ans 12st.



In [ ]:

    
#Roll two dices 100 times
dice_rolls = np.random.randint(1,7,1000) + np.random.randint(1,7,1000)
dice_rolls



In [ ]:

    
#Count the number of each element to create the distribution
from collections import Counter
Counter(dice_rolls)



In [ ]:

    
#And we can visualize it with a histogram
sns.distplot(dice_rolls, kde=False, rug=False)

A more useful example using OECD data



In [ ]:

    
#Read and fix the data. also drop missing values
df = read_our_csv().loc[:,["METRO_ID","Year","Metropolitan areas","LABOUR_PRODUCTIVITY","UNEMP_R"]]
#Drop rows with missing values
df = df.dropna()
#Keep years 2000 and 2012
df = df.loc[df["Year"].isin([2000,2012])]
#Add a column for country
df["C"] = df["METRO_ID"].apply(lambda x: x[:2] if isinstance(x,str) else "")
df.head()



In [ ]:

    
#Keep x and y values
x = df["LABOUR_PRODUCTIVITY"]

#Make plot
plt.figure(figsize=(6,4))
sns.distplot(x, kde=True, rug=False,hist_kws={"edgecolor":"none"},kde_kws={"cut":0})
plt.ylabel("Frequency")
plt.xlabel("Labour productivity")



In [ ]:

    
#fitting a distribution
from scipy.stats import norm,lognorm,expon
#Keep x and y values
x = df["LABOUR_PRODUCTIVITY"]

#Make plot
plt.figure(figsize=(6,4))
sns.distplot(x, kde=False, fit=norm, rug=False, 
             hist_kws={"edgecolor":"none","normed":True},fit_kws={"cut":0,"color":"gray"})
plt.ylabel("Frequency")
plt.xlabel("Labour productivity")

6.1b TWO HISTOGRAMS: sns.jointplot

Creates two histogrmams one vs the other. Important to see the distribution of your data, to find outliers, etc. Also important when plotting events in a map

x: required, the data
y: required, the other data
kind (default "scatter"): kind of plot to draw, scatter and hex are nice
marginal_kws=dict(bins=20, rug=True, kde=True) -> To modify the outside number of bins
gridsize = 10 -> To modify the inside number of bins

http://seaborn.pydata.org/generated/seaborn.jointplot.html



In [ ]:

    
sns.jointplot?



In [ ]:

    
#Make plot
sns.jointplot(x="LABOUR_PRODUCTIVITY", y="UNEMP_R", data=df,size=6)
plt.ylabel("Frequency")
plt.xlabel("Labour productivity")
plt.show()



In [ ]:

    
sns.jointplot(x="LABOUR_PRODUCTIVITY", y="UNEMP_R", data=df,kind="hex",
              marginal_kws=dict(bins=20, rug=False, kde=True, kde_kws={"cut":0}), gridsize = 15,size=6)

6.1c sns.violinplot

Plot many histograms vis-a-vis

x = x scale
y = variable to measure
hue (optional) = divide violin in two parts
data = data
width = size of violin
inner = show quartiles or boxplot or all data

http://seaborn.pydata.org/generated/seaborn.violinplot.html



In [ ]:

    
cols_get = ["METRO_ID","Metropolitan areas","Year","UNEMP_R",'GDP_PC',"LABOUR_PRODUCTIVITY"]
df_fixed = read_our_csv().loc[:,cols_get]
df_fixed["C"] = df_fixed["METRO_ID"].apply(lambda x: x[:2] if isinstance(x,str) else "")
south = df_fixed.loc[df_fixed["C"].isin(["IT","ES","GR","DE"])]
south.head()


plt.figure(figsize=(6,4))
sns.violinplot(x="C", y="LABOUR_PRODUCTIVITY", data=south,width=1,inner="quartiles")

6.2 parallel coordinates

To plot how many quantitative variables relate to each other.

frame: the data
class_column: column containing the labels (for different colors)
cols: The list of columns to use
colormap: colormap to use



In [ ]:

    
from pandas.tools.plotting import parallel_coordinates



In [ ]:

    
cols_get = ["METRO_ID","Metropolitan areas","Year","UNEMP_R",'GDP_PC',"LABOUR_PRODUCTIVITY"]
df_fixed = read_our_csv().loc[:,cols_get]
df_fixed["C"] = df_fixed["METRO_ID"].apply(lambda x: x[:2] if isinstance(x,str) else "")
italy = df_fixed.loc[df_fixed["C"].isin(["IT"])]
italy = italy.loc[italy["Year"].isin([2001])]
italy.head()

We are going to normalize values using standard scores

https://en.wikipedia.org/wiki/Standard_score

A series of values have a mean m and a standard deviation d
If you substract all values by m and divide them by d ($\frac{x - mean(x)}{std(x)}$) your new values have a mean of 0 and a standard deviation of 1



In [ ]:

    
#Here we normalize the variables so th
cols = ["UNEMP_R",'GDP_PC',"LABOUR_PRODUCTIVITY"]
for c in cols: 
    italy[c] = (italy[c]- np.mean(italy[c]))/np.std(italy[c]) 
    
parallel_coordinates(italy.loc[(italy["Metropolitan areas"] !="Italy")],"Year",cols=cols,colormap="Paired")



In [ ]:

    
d_city2regio = {"Bari": "S",
"Bologna": "C",
"Catania": "S",
"Florence": "C",
"Genova": "N",
"Milan": "N",
"Naples": "S",
"Palermo": "S",
"Rome": "C",
"Turin": "N",
"Venice": "N"}
italy["Area"] = italy["Metropolitan areas"].apply(lambda x: d_city2regio.get(x))

parallel_coordinates(italy.loc[(italy["Metropolitan areas"] !="Italy")],"Area",cols=cols,colormap="Paired")

6.3 Factorplot

http://seaborn.pydata.org/generated/seaborn.factorplot.html To visualize interactions between variables, as well as many distributions.

x = x axis
y = y axis
data = dataframe
hue = divide into several lines
size = size of figure
aspect = aspect of figure (1 = square)
ci = size of confidence intervals (0 for no confidence intervals)
See examples for documentation



In [ ]:

    
cols_get = ["METRO_ID","Metropolitan areas","Year","UNEMP_R",'GDP_PC',"LABOUR_PRODUCTIVITY"]
df_fixed = read_our_csv().loc[:,cols_get]
df_fixed["C"] = df_fixed["METRO_ID"].apply(lambda x: x[:2] if isinstance(x,str) else "")
south = df_fixed.loc[df_fixed["C"].isin(["IT","ES","GR","DE"])]
south.head()



In [ ]:

    
sns.factorplot(x="Year", y="UNEMP_R", hue="C", size=4,aspect=1.8,data=south,ci=95)



In [ ]:

    
sns.factorplot(x="Year", y="UNEMP_R", hue="C", size=4,aspect=3,
               data=south.loc[south["Year"].isin([2000,2010])],kind="violin",width=0.9,inner="quartiles")



In [ ]:

Plan for next class

Learn how to group our dataframe by a variable without iterating.
Learn about data visualization theory (which plot works for what kind of data, what to do, what not to do)
Learn inkscape
Learn some statistics

	Unnamed: 0	METRO_ID	Metropolitan areas	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014
0	0	AT001	Vienna	267.27	269.90	272.56	275.26	277.99	280.75	283.54	286.38	289.24	292.15	295.09	298.07	301.08	304.14	307.23
1	1	AT002	Graz	179.92	181.59	183.28	185.01	186.76	188.55	190.36	192.20	194.07	195.97	197.91	199.87	201.87	203.90	205.96
2	2	AT003	Linz	165.29	165.94	166.60	167.27	167.94	168.62	169.31	170.00	170.70	171.41	172.12	172.84	173.57	174.31	175.05
3	3	AUS	Australia	2.47	2.50	2.53	2.56	2.59	2.62	2.65	2.70	2.76	2.82	2.86	2.90	2.95	3.00	3.05
4	4	AUS01	Sydney	332.15	335.48	338.18	340.38	342.20	344.88	348.04	353.71	360.58	367.35	372.51	376.88	382.38	388.94	395.83

	METRO_ID	Metropolitan areas	Year	POP_DENS
0	AT001	Vienna	2000	267.27
1	AT002	Graz	2000	179.92
2	AT003	Linz	2000	165.29
3	AUS	Australia	2000	2.47
4	AUS01	Sydney	2000	332.15
5	AUS02	Melbourne	2000	342.35
6	AUS03	Brisbane	2000	107.48
7	AUS04	Perth	2000	226.80
8	AUS05	Adelaide	2000	351.60
9	AUS06	Gold Coast-Tweed Heads	2000	210.14

	Unnamed: 0	METRO_ID	Metropolitan areas	VAR	Variables	TIME	Year	Unit Code	Unit	PowerCode Code	PowerCode	Reference Period Code	Reference Period	Value	Flag Codes	Flags
0	0	AT001	Vienna	POP_DENS	Population density (persons per km2)	2000	2000	RATIO	Ratio	0	Units	NaN	NaN	267.27	NaN	NaN
1	1	AT001	Vienna	POP_DENS	Population density (persons per km2)	2001	2001	RATIO	Ratio	0	Units	NaN	NaN	269.90	NaN	NaN

		VAR	GREEN_AREA_PC	POP_DENS
METRO_ID	Metropolitan areas	Year
AT001	Vienna	2000	255.17	267.27
		2001	252.68	269.90
		2002	250.22	272.56
		2003	247.77	275.26
		2004	245.34	277.99

	Unnamed: 0	METRO_ID	Metropolitan areas	VAR	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014
0	0	AT001	Vienna	GREEN_AREA_PC	255.17	252.68	250.22	247.77	245.34	242.92	240.53	238.15	235.79	233.44	231.12	228.81	226.52	224.24	221.99
1	1	AT001	Vienna	POP_DENS	267.27	269.90	272.56	275.26	277.99	280.75	283.54	286.38	289.24	292.15	295.09	298.07	301.08	304.14	307.23
2	2	AT002	Graz	GREEN_AREA_PC	997.39	988.23	979.10	969.96	960.85	951.75	942.72	933.67	924.67	915.69	906.73	897.83	888.95	880.11	871.29
3	3	AT002	Graz	POP_DENS	179.92	181.59	183.28	185.01	186.76	188.55	190.36	192.20	194.07	195.97	197.91	199.87	201.87	203.90	205.96
4	4	AT003	Linz	GREEN_AREA_PC	1791.92	1784.87	1777.80	1770.71	1763.59	1756.49	1749.38	1742.23	1735.10	1727.91	1720.77	1713.59	1706.40	1699.20	1691.99

	METRO_ID	Metropolitan areas	Year	GREEN_AREA_PC	POP_DENS
0	NL002	Amsterdam	2000	235.63	764.71
1	NL002	Amsterdam	2001	233.57	771.45

	METRO_ID	Metropolitan areas	Year	GREEN_AREA_PC	POP_DENS
0	ES001	Madrid	2000	31.50	471.88
1	ES001	Madrid	2001	31.00	479.60
0	NL002	Amsterdam	2000	235.63	764.71
1	NL002	Amsterdam	2001	233.57	771.45

VAR	METRO_ID	Year	GDP_PC	GDP_PC_log
0	AT001	2000	45434.48	4.657386
1	AT001	2001	45479.15	4.657812
2	AT001	2002	46065.50	4.663376
3	AT001	2003	45470.92	4.657734
4	AT001	2004	46148.38	4.664156