Data Preprocessing with Pandas

Import Modules



In [18]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Import data

Data is generated from Australia Bureau of statistics, some cells are removed (set to NaN) manually in order to serve this notebook.



In [19]:

    
df = pd.read_csv('./asset/Median Price of Established House Transfers.txt', sep='\t') # row 3 has a null value
df.head()









    Out[19]:






  
    
      
      City
      Quarter
      Price
    
  
  
    
      0
      Sydney
      Q1
      700.0
    
    
      1
      Rest of NSW
      Q1
      370.0
    
    
      2
      Melbourne
      Q1
      525.0
    
    
      3
      Rest of Vic.
      Q1
      NaN
    
    
      4
      Brisbane
      Q1
      460.0

Finding and Handling Missing Data



In [20]:

    
# find rows that Price is null
df[pd.isnull(df['Price'])]









    Out[20]:






  
    
      
      City
      Quarter
      Price
    
  
  
    
      3
      Rest of Vic.
      Q1
      NaN
    
    
      13
      Rest of NT
      Q1
      NaN
    
    
      21
      Adelaide
      Q2
      NaN
    
    
      27
      Darwin
      Q2
      NaN
    
    
      40
      Hobart
      Q3
      NaN
    
    
      49
      Brisbane
      Q4
      NaN



In [21]:

    
index_with_null = df[pd.isnull(df['Price'])].index
index_with_null









    Out[21]:





Int64Index([3, 13, 21, 27, 40, 49], dtype='int64')

We can specify a value (e.g., 0) to replace those null values, through the fillna() method



In [22]:

    
df2 = df.fillna(0) # price value of row 3 is set to 0.0
df2.ix[index_with_null]
# df2.index in index_with_null









    Out[22]:






  
    
      
      City
      Quarter
      Price
    
  
  
    
      3
      Rest of Vic.
      Q1
      0.0
    
    
      13
      Rest of NT
      Q1
      0.0
    
    
      21
      Adelaide
      Q2
      0.0
    
    
      27
      Darwin
      Q2
      0.0
    
    
      40
      Hobart
      Q3
      0.0
    
    
      49
      Brisbane
      Q4
      0.0

We can also propagate non-null values forward or backward



In [23]:

    
df2 = df.fillna(method='pad', axis=0)
df2.head() # The price of row 3 is the same as that of row 2









    Out[23]:






  
    
      
      City
      Quarter
      Price
    
  
  
    
      0
      Sydney
      Q1
      700.0
    
    
      1
      Rest of NSW
      Q1
      370.0
    
    
      2
      Melbourne
      Q1
      525.0
    
    
      3
      Rest of Vic.
      Q1
      525.0
    
    
      4
      Brisbane
      Q1
      460.0

We can even drop the rows (or columns) with null values



In [24]:

    
df2 = df.dropna(axis=0) # if axis = 1 then the column will be dropped
df2.head() # Note that row 3 is dropped









    Out[24]:






  
    
      
      City
      Quarter
      Price
    
  
  
    
      0
      Sydney
      Q1
      700.0
    
    
      1
      Rest of NSW
      Q1
      370.0
    
    
      2
      Melbourne
      Q1
      525.0
    
    
      4
      Brisbane
      Q1
      460.0
    
    
      5
      Rest of Qld.
      Q1
      400.0

Obviously, none of the above solutions are appropriate.

A better way to deal with the null value is to replace them with the mean value of the prices of the corresponding city over the whole year.



In [25]:

    
df["Price"] = df.groupby("City").transform(lambda x: x.fillna(x.mean()))
df.ix[index_with_null]









    Out[25]:






  
    
      
      City
      Quarter
      Price
    
  
  
    
      3
      Rest of Vic.
      Q1
      294.666667
    
    
      13
      Rest of NT
      Q1
      422.500000
    
    
      21
      Adelaide
      Q2
      420.433333
    
    
      27
      Darwin
      Q2
      579.166667
    
    
      40
      Hobart
      Q3
      359.000000
    
    
      49
      Brisbane
      Q4
      468.633333

Binning

Equal-width Partitioning

We use the table with all null values filled



In [26]:

    
pd.cut(df['Price'],5).head() # equally partition Price into 5 bins









    Out[26]:





0        (597, 716]
1        (359, 478]
2        (478, 597]
3    (239.405, 359]
4        (359, 478]
Name: Price, dtype: category
Categories (5, object): [(239.405, 359] < (359, 478] < (478, 597] < (597, 716] < (716, 835]]



In [27]:

    
# We could label the bins and add new column
df['Bin'] = pd.cut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])
df.head()









    Out[27]:






  
    
      
      City
      Quarter
      Price
      Bin
    
  
  
    
      0
      Sydney
      Q1
      700.000000
      High
    
    
      1
      Rest of NSW
      Q1
      370.000000
      Low
    
    
      2
      Melbourne
      Q1
      525.000000
      Medium
    
    
      3
      Rest of Vic.
      Q1
      294.666667
      Very Low
    
    
      4
      Brisbane
      Q1
      460.000000
      Low

Equal-depth Partitioining



In [28]:

    
pd.qcut(df['Price'],5).head() # Note the difference from the Equal-width Partitioning case









    Out[28]:





0      (555.1, 835]
1      (336.4, 395]
2    (440.5, 555.1]
3      [240, 336.4]
4    (440.5, 555.1]
Name: Price, dtype: category
Categories (5, object): [[240, 336.4] < (336.4, 395] < (395, 440.5] < (440.5, 555.1] < (555.1, 835]]



In [29]:

    
# Let's check the depth of each bin
df['Bin'] = pd.qcut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])
df.groupby('Bin').size()









    Out[29]:





Bin
Very Low     12
Low          12
Medium       12
High         12
Very High    12
dtype: int64

Smoothing

Smoothing by Bin Means



In [30]:

    
df.head()









    Out[30]:






  
    
      
      City
      Quarter
      Price
      Bin
    
  
  
    
      0
      Sydney
      Q1
      700.000000
      Very High
    
    
      1
      Rest of NSW
      Q1
      370.000000
      Low
    
    
      2
      Melbourne
      Q1
      525.000000
      High
    
    
      3
      Rest of Vic.
      Q1
      294.666667
      Very Low
    
    
      4
      Brisbane
      Q1
      460.000000
      High



In [31]:

    
df['Price-Smoothing-mean'] = df.groupby('Bin')['Price'].transform('mean')
df.head()









    Out[31]:






  
    
      
      City
      Quarter
      Price
      Bin
      Price-Smoothing-mean
    
  
  
    
      0
      Sydney
      Q1
      700.000000
      Very High
      634.763889
    
    
      1
      Rest of NSW
      Q1
      370.000000
      Low
      372.416667
    
    
      2
      Melbourne
      Q1
      525.000000
      High
      519.961111
    
    
      3
      Rest of Vic.
      Q1
      294.666667
      Very Low
      270.805556
    
    
      4
      Brisbane
      Q1
      460.000000
      High
      519.961111

Smoothing by Bin Max



In [32]:

    
df['Price-Smoothing-max'] = df.groupby('Bin')['Price'].transform('max')
df.head()









    Out[32]:






  
    
      
      City
      Quarter
      Price
      Bin
      Price-Smoothing-mean
      Price-Smoothing-max
    
  
  
    
      0
      Sydney
      Q1
      700.000000
      Very High
      634.763889
      835.0
    
    
      1
      Rest of NSW
      Q1
      370.000000
      Low
      372.416667
      387.5
    
    
      2
      Melbourne
      Q1
      525.000000
      High
      519.961111
      555.0
    
    
      3
      Rest of Vic.
      Q1
      294.666667
      Very Low
      270.805556
      302.0
    
    
      4
      Brisbane
      Q1
      460.000000
      High
      519.961111
      555.0

Normalization



In [33]:

    
df = pd.read_csv('./asset/Median Price of Established House.txt', sep='\t')
df.head()









    Out[33]:






  
    
      
      City
      Q1
      Q2
      Q3
      Q4
    
  
  
    
      0
      Sydney
      700.0
      750.0
      750.0
      835.0
    
    
      1
      Rest of NSW
      370.0
      375.0
      373.0
      385.0
    
    
      2
      Melbourne
      525.0
      555.5
      540.0
      580.0
    
    
      3
      Rest of Vic.
      300.0
      295.0
      287.0
      302.0
    
    
      4
      Brisbane
      460.0
      474.9
      471.0
      485.0

Standard Scaler



In [34]:

    
from sklearn import preprocessing

min_max_scaler = preprocessing.StandardScaler()
x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column


df_standard = pd.DataFrame(x_scaled)
df_standard.insert(0, 'City', df.City)
df_standard









    Out[34]:






  
    
      
      City
      0
      1
      2
      3
    
  
  
    
      0
      Sydney
      2.150188
      2.360452
      2.379522
      2.611997
    
    
      1
      Rest of NSW
      -0.544010
      -0.477782
      -0.506402
      -0.488501
    
    
      2
      Melbourne
      0.721446
      0.888355
      0.771978
      0.855048
    
    
      3
      Rest of Vic.
      -1.115507
      -1.083272
      -1.164729
      -1.060370
    
    
      4
      Brisbane
      0.190771
      0.278324
      0.243785
      0.200499
    
    
      5
      Rest of Qld.
      -0.299083
      -0.250723
      -0.299718
      -0.309361
    
    
      6
      Adelaide
      -0.176620
      -0.212880
      -0.155804
      -0.195676
    
    
      7
      Rest of SA
      -1.401255
      -1.340605
      -1.409688
      -1.246400
    
    
      8
      Perth
      0.925552
      0.771041
      0.848528
      0.682798
    
    
      9
      Rest of WA
      -0.401136
      -0.462645
      -0.452817
      -0.485056
    
    
      10
      Hobart
      -0.568503
      -0.704841
      -0.682466
      -0.626300
    
    
      11
      Rest of Tas.
      -1.523719
      -1.499546
      -1.333139
      -1.349750
    
    
      12
      Darwin
      1.252122
      0.986747
      0.963353
      0.872273
    
    
      13
      Rest of NT
      -0.176620
      -0.099351
      -0.127481
      -0.247351
    
    
      14
      Canberra
      0.966373
      0.846727
      0.925078
      0.786148

Robust Scaler



In [35]:

    
from sklearn import preprocessing

min_max_scaler = preprocessing.RobustScaler()
x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column

df_robust = pd.DataFrame(x_scaled)
df_robust.insert(0, 'City', df.City)
df_robust









    Out[35]:






  
    
      
      City
      0
      1
      2
      3
    
  
  
    
      0
      Sydney
      1.686391
      1.837838
      1.804905
      2.213333
    
    
      1
      Rest of NSW
      -0.266272
      -0.189189
      -0.249591
      -0.186667
    
    
      2
      Melbourne
      0.650888
      0.786486
      0.660490
      0.853333
    
    
      3
      Rest of Vic.
      -0.680473
      -0.621622
      -0.718256
      -0.629333
    
    
      4
      Brisbane
      0.266272
      0.350811
      0.284469
      0.346667
    
    
      5
      Rest of Qld.
      -0.088757
      -0.027027
      -0.102452
      -0.048000
    
    
      6
      Adelaide
      0.000000
      0.000000
      0.000000
      0.040000
    
    
      7
      Rest of SA
      -0.887574
      -0.805405
      -0.892643
      -0.773333
    
    
      8
      Perth
      0.798817
      0.702703
      0.714986
      0.720000
    
    
      9
      Rest of WA
      -0.162722
      -0.178378
      -0.211444
      -0.184000
    
    
      10
      Hobart
      -0.284024
      -0.351351
      -0.374932
      -0.293333
    
    
      11
      Rest of Tas.
      -0.976331
      -0.918919
      -0.838147
      -0.853333
    
    
      12
      Darwin
      1.035503
      0.856757
      0.796730
      0.866667
    
    
      13
      Rest of NT
      0.000000
      0.081081
      0.020163
      0.000000
    
    
      14
      Canberra
      0.828402
      0.756757
      0.769482
      0.800000

Histograms



In [36]:

    
df = pd.read_csv('./asset/Median Price of Established House.txt', sep='\t')
df.head()









    Out[36]:






  
    
      
      City
      Q1
      Q2
      Q3
      Q4
    
  
  
    
      0
      Sydney
      700.0
      750.0
      750.0
      835.0
    
    
      1
      Rest of NSW
      370.0
      375.0
      373.0
      385.0
    
    
      2
      Melbourne
      525.0
      555.5
      540.0
      580.0
    
    
      3
      Rest of Vic.
      300.0
      295.0
      287.0
      302.0
    
    
      4
      Brisbane
      460.0
      474.9
      471.0
      485.0



In [37]:

    
# use bins=x to control the number of bins
df.hist(column=['Q1','Q3'],bins=6,alpha=0.5,figsize=(16, 6))









    Out[37]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11473b358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11479b668>]], dtype=object)

Scatter



In [38]:

    
df.plot.scatter(x='Q1', y='Q3');

Scatter matrix provide a better way to discover the relationships in data



In [39]:

    
from pandas.tools.plotting import scatter_matrix

scatter_matrix(df, alpha=0.9, figsize=(12, 12), diagonal='hist') # set the diagonal figures to be histograms









    Out[39]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x116e70f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x117611e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x117662828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11769e3c8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1176e8400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x118028ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1180772e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1180af9e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1180fcc18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x118145978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11818be80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1181d4fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x118216278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11825fd30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11829c518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1182e8c88>]], dtype=object)



In [ ]:

	City	Quarter	Price
0	Sydney	Q1	700.0
1	Rest of NSW	Q1	370.0
2	Melbourne	Q1	525.0
3	Rest of Vic.	Q1	NaN
4	Brisbane	Q1	460.0

	City	Quarter	Price
3	Rest of Vic.	Q1	NaN
13	Rest of NT	Q1	NaN
21	Adelaide	Q2	NaN
27	Darwin	Q2	NaN
40	Hobart	Q3	NaN
49	Brisbane	Q4	NaN

	City	Quarter	Price
3	Rest of Vic.	Q1	294.666667
13	Rest of NT	Q1	422.500000
21	Adelaide	Q2	420.433333
27	Darwin	Q2	579.166667
40	Hobart	Q3	359.000000
49	Brisbane	Q4	468.633333

	City	Quarter	Price	Bin
0	Sydney	Q1	700.000000	High
1	Rest of NSW	Q1	370.000000	Low
2	Melbourne	Q1	525.000000	Medium
3	Rest of Vic.	Q1	294.666667	Very Low
4	Brisbane	Q1	460.000000	Low

	City	Quarter	Price	Bin
0	Sydney	Q1	700.000000	Very High
1	Rest of NSW	Q1	370.000000	Low
2	Melbourne	Q1	525.000000	High
3	Rest of Vic.	Q1	294.666667	Very Low
4	Brisbane	Q1	460.000000	High

	City	Q1	Q2	Q3	Q4
0	Sydney	700.0	750.0	750.0	835.0
1	Rest of NSW	370.0	375.0	373.0	385.0
2	Melbourne	525.0	555.5	540.0	580.0
3	Rest of Vic.	300.0	295.0	287.0	302.0
4	Brisbane	460.0	474.9	471.0	485.0

	City	0	1	2	3
0	Sydney	2.150188	2.360452	2.379522	2.611997
1	Rest of NSW	-0.544010	-0.477782	-0.506402	-0.488501
2	Melbourne	0.721446	0.888355	0.771978	0.855048
3	Rest of Vic.	-1.115507	-1.083272	-1.164729	-1.060370
4	Brisbane	0.190771	0.278324	0.243785	0.200499
5	Rest of Qld.	-0.299083	-0.250723	-0.299718	-0.309361
6	Adelaide	-0.176620	-0.212880	-0.155804	-0.195676
7	Rest of SA	-1.401255	-1.340605	-1.409688	-1.246400
8	Perth	0.925552	0.771041	0.848528	0.682798
9	Rest of WA	-0.401136	-0.462645	-0.452817	-0.485056
10	Hobart	-0.568503	-0.704841	-0.682466	-0.626300
11	Rest of Tas.	-1.523719	-1.499546	-1.333139	-1.349750
12	Darwin	1.252122	0.986747	0.963353	0.872273
13	Rest of NT	-0.176620	-0.099351	-0.127481	-0.247351
14	Canberra	0.966373	0.846727	0.925078	0.786148

	City	0	1	2	3
0	Sydney	1.686391	1.837838	1.804905	2.213333
1	Rest of NSW	-0.266272	-0.189189	-0.249591	-0.186667
2	Melbourne	0.650888	0.786486	0.660490	0.853333
3	Rest of Vic.	-0.680473	-0.621622	-0.718256	-0.629333
4	Brisbane	0.266272	0.350811	0.284469	0.346667
5	Rest of Qld.	-0.088757	-0.027027	-0.102452	-0.048000
6	Adelaide	0.000000	0.000000	0.000000	0.040000
7	Rest of SA	-0.887574	-0.805405	-0.892643	-0.773333
8	Perth	0.798817	0.702703	0.714986	0.720000
9	Rest of WA	-0.162722	-0.178378	-0.211444	-0.184000
10	Hobart	-0.284024	-0.351351	-0.374932	-0.293333
11	Rest of Tas.	-0.976331	-0.918919	-0.838147	-0.853333
12	Darwin	1.035503	0.856757	0.796730	0.866667
13	Rest of NT	0.000000	0.081081	0.020163	0.000000
14	Canberra	0.828402	0.756757	0.769482	0.800000