Data preprocessing

Overview

Before we get deeply involved in this project, we need to get our data into a useable form. In this notebook we walk through that process for one file, and we will then use that logic to create a module that will go through the rest of the files.


In [1]:
import pandas as pd
import numpy as np

Import data

First we need to import our data from a CSV.


In [72]:
state_data_filepath = '.\\Data\\State\\DaysOnZillow_Public_State.csv'

state_days_on_zillow = pd.read_csv(state_data_filepath)
state_days_on_zillow.head()


Out[72]:
RegionName StateName RegionType CBSA Title SizeRank 2010-01 2010-02 2010-03 2010-04 2010-05 ... 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02
0 United States NaN Country NaN 1 144.0 145.750 146.25 139.750 125.75 ... 81.500 79.50 80.50 83.00 86.00 89.00 92.500 97.750 103.000 105.0
1 Alaska AK State NaN 45 130.0 143.000 139.50 117.250 99.50 ... 77.000 79.00 82.25 87.25 95.00 101.50 109.500 115.125 118.500 127.0
2 Alabama AL State NaN 23 165.5 164.125 169.25 173.375 159.25 ... 102.625 97.75 100.00 103.75 105.75 104.75 108.875 118.500 123.875 124.0
3 Arkansas AR State NaN 32 133.0 135.500 141.00 144.000 138.25 ... 97.000 94.50 94.25 97.50 100.25 101.75 109.500 121.250 127.250 136.0
4 Arizona AZ State NaN 16 115.0 120.500 122.00 117.500 110.00 ... 75.000 74.75 74.50 75.00 75.00 73.25 72.500 76.000 80.500 81.0

5 rows × 91 columns

Limit columns

From outside analysis, I've found that the 2011-01 through 2016-01 periods are the best populated with data, so we limit ourselves to that period for now.

Below we drop all the other, unnecessary columns, with the exception of RegionName, which will become an index.


In [73]:
state_days_on_zillow__limited = state_days_on_zillow[[col for col in state_days_on_zillow.axes[1] if ((col >= '2011-01' and col <= '2016-01') or (col == 'RegionName'))]]
state_days_on_zillow__limited = state_days_on_zillow__limited.set_index(['RegionName'])

state_days_on_zillow__limited.head()


Out[73]:
2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10 ... 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01
RegionName
United States 128.250 130.750 129.50 124.500 117.50 113.25 113.250 115.75 120.00 124.000 ... 94.00 86.00 84.75 86.250 89.00 92.250 96.25 100.500 105.500 111.750
Alaska 134.875 140.125 122.75 103.875 99.75 96.25 88.875 89.50 93.25 98.625 ... 67.75 68.50 70.50 71.875 73.00 76.125 82.00 91.125 101.375 101.625
Alabama 152.875 150.000 141.75 129.500 120.75 122.75 132.000 139.25 140.00 138.375 ... 123.00 111.50 108.75 109.500 114.25 123.500 130.00 131.500 133.750 138.250
Arkansas 141.625 138.500 128.25 117.000 114.25 116.25 119.250 122.50 127.75 132.875 ... 133.25 118.50 112.00 112.250 110.00 106.750 108.25 116.000 125.375 130.625
Arizona 108.750 110.375 106.75 100.500 96.50 94.25 92.000 90.50 90.25 91.500 ... 77.50 77.25 77.25 76.250 75.50 75.750 76.25 77.500 81.500 87.250

5 rows × 61 columns

Flatten the array

Right now we have an array for a single metric. We don't really want this - we want one row per month per state, so we can join other metrics on by those indicies.

Below we test an approach to turning this array into a flatter form, which will allow us to join with other metrics.


In [98]:
test = pd.DataFrame(state_days_on_zillow__limited[['2011-01']], index=[state_days_on_zillow__limited.axes[0]])
test['Period'] = pd.Series({state:'2011-01' for state in state_days_on_zillow__limited.axes[0]})
test = test.set_index('Period', append=True).rename_axis({'2011-01':'var_name'}, axis='columns')

test2 = pd.DataFrame(state_days_on_zillow__limited[['2011-02']], index=[state_days_on_zillow__limited.axes[0]])
test2['Period'] = pd.Series({state:'2011-02' for state in state_days_on_zillow__limited.axes[0]})
test2 = test2.set_index('Period', append=True).rename_axis({'2011-02':'var_name'}, axis='columns')

test.append(test2)


Out[98]:
var_name
RegionName Period
United States 2011-01 128.250
Alaska 2011-01 134.875
Alabama 2011-01 152.875
Arkansas 2011-01 141.625
Arizona 2011-01 108.750
California 2011-01 106.250
Colorado 2011-01 120.000
Connecticut 2011-01 143.750
District of Columbia 2011-01 109.375
Delaware 2011-01 142.625
Florida 2011-01 126.250
Georgia 2011-01 144.500
Hawaii 2011-01 103.625
Iowa 2011-01 123.250
Idaho 2011-01 101.375
Illinois 2011-01 154.750
Indiana 2011-01 135.500
Kansas 2011-01 133.000
Kentucky 2011-01 135.625
Louisiana 2011-01 116.250
Massachusetts 2011-01 138.750
Maryland 2011-01 149.500
Michigan 2011-01 122.000
Minnesota 2011-01 139.750
Missouri 2011-01 125.750
Mississippi 2011-01 148.625
Montana 2011-01 126.375
North Carolina 2011-01 148.250
North Dakota 2011-01 122.625
Nebraska 2011-01 110.750
... ... ...
Massachusetts 2011-02 151.250
Maryland 2011-02 155.250
Michigan 2011-02 121.500
Minnesota 2011-02 142.250
Missouri 2011-02 132.250
Mississippi 2011-02 154.000
Montana 2011-02 126.125
North Carolina 2011-02 148.250
North Dakota 2011-02 124.250
Nebraska 2011-02 120.500
New Hampshire 2011-02 141.750
New Jersey 2011-02 185.500
New Mexico 2011-02 133.750
Nevada 2011-02 111.000
New York 2011-02 181.500
Ohio 2011-02 129.250
Oklahoma 2011-02 131.000
Oregon 2011-02 139.875
Pennsylvania 2011-02 151.000
Rhode Island 2011-02 164.750
South Carolina 2011-02 169.000
Tennessee 2011-02 145.750
Texas 2011-02 124.250
Utah 2011-02 107.625
Virginia 2011-02 127.750
Vermont 2011-02 142.875
Washington 2011-02 135.250
Wisconsin 2011-02 154.000
West Virginia 2011-02 157.500
Wyoming 2011-02 120.250

100 rows × 1 columns

Complete process for all columns

It worked!

Now we want to turn that into a process that goes through all columns in this dataframe.


In [108]:
state_days_on_zillow_flat = pd.DataFrame(state_days_on_zillow__limited[['2011-01']], index=[state_days_on_zillow__limited.axes[0]])
state_days_on_zillow_flat['Period'] = pd.Series({state:'2011-01' for state in state_days_on_zillow__limited.axes[0]})
state_days_on_zillow_flat = state_days_on_zillow_flat.set_index('Period', append=True).rename_axis({'2011-01':'days_on_zillow'}, axis='columns')

for col_name in state_days_on_zillow__limited.axes[1][1:]:
    temp = pd.DataFrame(state_days_on_zillow__limited[[col_name]], index=[state_days_on_zillow__limited.axes[0]])
    temp['Period'] = pd.Series({state:col_name for state in state_days_on_zillow__limited.axes[0]})
    temp = temp.set_index('Period', append=True).rename_axis({col_name:'days_on_zillow'}, axis='columns')
    
    state_days_on_zillow_flat = state_days_on_zillow_flat.append(temp)
    
state_days_on_zillow_flat


Out[108]:
days_on_zillow
RegionName Period
United States 2011-01 128.250000
Alaska 2011-01 134.875000
Alabama 2011-01 152.875000
Arkansas 2011-01 141.625000
Arizona 2011-01 108.750000
California 2011-01 106.250000
Colorado 2011-01 120.000000
Connecticut 2011-01 143.750000
District of Columbia 2011-01 109.375000
Delaware 2011-01 142.625000
Florida 2011-01 126.250000
Georgia 2011-01 144.500000
Hawaii 2011-01 103.625000
Iowa 2011-01 123.250000
Idaho 2011-01 101.375000
Illinois 2011-01 154.750000
Indiana 2011-01 135.500000
Kansas 2011-01 133.000000
Kentucky 2011-01 135.625000
Louisiana 2011-01 116.250000
Massachusetts 2011-01 138.750000
Maryland 2011-01 149.500000
Michigan 2011-01 122.000000
Minnesota 2011-01 139.750000
Missouri 2011-01 125.750000
Mississippi 2011-01 148.625000
Montana 2011-01 126.375000
North Carolina 2011-01 148.250000
North Dakota 2011-01 122.625000
Nebraska 2011-01 110.750000
... ... ...
Massachusetts 2016-01 110.750000
Maryland 2016-01 134.250000
Michigan 2016-01 121.500000
Minnesota 2016-01 112.500000
Missouri 2016-01 105.375000
Mississippi 2016-01 122.333333
Montana 2016-01 103.500000
North Carolina 2016-01 119.250000
North Dakota 2016-01 93.250000
Nebraska 2016-01 88.250000
New Hampshire 2016-01 125.750000
New Jersey 2016-01 157.250000
New Mexico 2016-01 120.500000
Nevada 2016-01 93.250000
New York 2016-01 172.500000
Ohio 2016-01 113.875000
Oklahoma 2016-01 108.500000
Oregon 2016-01 91.000000
Pennsylvania 2016-01 127.000000
Rhode Island 2016-01 125.750000
South Carolina 2016-01 128.625000
Tennessee 2016-01 118.250000
Texas 2016-01 82.250000
Utah 2016-01 93.250000
Virginia 2016-01 119.250000
Vermont 2016-01 170.375000
Washington 2016-01 89.250000
Wisconsin 2016-01 127.750000
West Virginia 2016-01 149.250000
Wyoming 2016-01 94.250000

3050 rows × 1 columns

Export processed data

Time to save our work!


In [110]:
export_path = '.\\Data\\Processed Data\\State - Days on Zillow.csv'
state_days_on_zillow_flat.to_csv(export_path)