Zillow Local Data Interaction Tutorial

If you've been provided with a Ztrax dataset then this notebook should walk you through some useful tools contained within this repository that will allow you to create useful dataframe objects for data-science from the otherwise useless text files that have been provided!

Quick Look at ZTrax and Motivations

The ZTrax data-set from Zillow is split up between two smaller forks: ZAsmt and ZTrans.

ZAsmt

These data files seem to indicate that they are representative of assessments of value that have been made using characteristics surrounding properties. While this is an interesting data-set, the team has determined that, at least for now, it's not particularly worth exploring.

ZTrans

These data-sets seem to indicate that they are representative of hard-transactions that have been made between buyers and sellers, as well as valuable data about lending and borrowing of capital to make these transactions. The group has decided that this would be valuable exploring first, so most of the interaction in this tutorial focuses on this sub-set.

Important Usage Note: This data is secure with an agreement with Zillow. While we can explore the data and its layout, any datasets added to the data folder in this repository are removed by default, any additional datasets should be removed by appending them to the .gitignore file.

Below we will import our libraries we'll be using. Please note that the utils are utilities included in this repository and should be downloaded to your computer. You will need to move them to your python binaries if you wish them to be used across all your python projects.


In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import operator
import sys
import os
# Importing our contained utilities.
from utils import zillow_helpers as zh
sys.path.insert(0,'../')
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

I've moved some sub-sets of data from Ztrax into our data/ folder. You can specify other paths as well if your data lives elsewhere on your computer as well.


In [4]:
datadir1 = 'data/46/ZTrans/'
datadir2 = 'data/50/ZTrans/'

Exploring Files in Notebook

Lets explore the files a ZTrans sub-dataset:


In [14]:
file_list = os.listdir(datadir1)
print(file_list)


['BKManagedSpecific.txt', 'BorrowerMailAddress.txt', 'BorrowerName.txt', 'BorrowerNameDescriptionCode.txt', 'BuyerMailAddress.txt', 'BuyerName.txt', 'BuyerNameDescriptionCode.txt', 'ForeclosureNameAddress.txt', 'ForeclosureNODNOSDoc.txt', 'ForeclosureOriginalLoan.txt', 'HawaiiBorrowerNotes.txt', 'HawaiiGranteeNotes.txt', 'Legacy.txt', 'Main.txt', 'Modification.txt', 'PropertyInfo.txt', 'RegionSpecific.txt', 'SellerMailAddress.txt', 'SellerName.txt', 'SellerNameDescriptionCode.txt']

We don't need to leave the workspace to explore any one of these .txt datasets! There are good python libraries for this already and we can use python iterations to read small subsets to observe what this data looks like:


In [16]:
# A lot of files must be opened to be read
file = open(datadir1+'Main.txt', 'r')

# Lets print and observer just the first few lines
for lines in range(3):
    line = file.readline()
    print(line)


315852373|46009|SD|BON HOMME|D|P|2009-01-06||96|0001|||||||WRDE||||| |||499.0000|NO|.0000|.0000||.0000||||||| ||||| ||||| ||| ||||||| || |.0000| |||| ||||||.000||||||||||||||||||||||||||| ||| ||||||| |||||||||2012-06-15||FKY|| |NM|| |2411791749|| |1702496|1|BKF|1651814435

315852374|46009|SD|BON HOMME|D|P|2009-01-07||96|0002|||||||QCDE||||| |||.0000|NO|.0000|.0000||.0000||||||| ||||| ||||| ||| ||||||| || |.0000| |||| ||||||.000||||||||||||||||||||||||||| ||| ||||||| |||||||||2012-06-15||FKY|| |NM|| |2411791750|| |1702496|2|BKF|153175723

315852375|46009|SD|BON HOMME|D|P|2009-01-09||96|0003|||||||WRDE||||| |||.0000|NO|.0000|.0000||.0000||||||| ||||| ||||| ||| ||||||| || |.0000| |||| ||||||.000||||||||||||||||||||||||||| ||| ||||||| |||||||||2012-06-15||FKY|| |NM|| |2411791751|| |1702496|3|BKF|-657077398

Creating Useful Data Objects

While you can easily create dataframes from any one of these text files using

df = pd.read_csv('path/to/file.txt', sep="!", index_col=False, header=None, low_memory=False)

It should be noted that these text files, do not have headers! Luckily, we have ordered lists of headers available to us in a compiled spreadsheet, which we can then use to create small .txt or .csv files to make a line separated list of headers!

The files indicated below have been hand-pulled from a layout spreadsheet, but an iterative process is possible as well.


In [17]:
column_headers = open('column_helper/main.csv', 'r')

# Lets observe a few of the column headers
for lines in range(3):
    line = column_headers.readline()
    print(line)


TransId

FIPS

State

Here is where some of our utility functions from this repository will come in handy! A function, demonstrated below, can be used to combine the headers from header list file and the data file itself to make a useful dataframe that can actually be explored!

Keep in mind, this dataframe creation can take a LONG time!


In [20]:
# We create our dataframe with the headers attached.
dfMain = zh.txt_to_column_df('data/46/ZTrans/Main.txt', 'column_helper/main.csv')

In [21]:
# We observe the first few rows of the dataframe to verify
dfMain.head(3)


Out[21]:
TransId FIPS State County DataClassStndCode RecordTypeStndCode RecordingDate RecordingDocumentNumber RecordingBookNumber RecordingPageNumber ... MatchStndCode REOStndCode UpdateOwnershipFlag LoadID StatusInd TransactionTypeStndCode BatchID BKFSPID ZVendorStndCode SourceChkSum
0 315852373 46009 SD BON HOMME D P 2009-01-06 NaN 96 0001 ... NM NaN 2411791749 NaN 1702496 1 BKF 1.651814e+09
1 315852374 46009 SD BON HOMME D P 2009-01-07 NaN 96 0002 ... NM NaN 2411791750 NaN 1702496 2 BKF 1.531757e+08
2 315852375 46009 SD BON HOMME D P 2009-01-09 NaN 96 0003 ... NM NaN 2411791751 NaN 1702496 3 BKF -6.570774e+08

3 rows × 131 columns

Merging and Concatenating

Now of course our data will likely only be useful when we observe it across various characterisitcs of these data-sets. While it is possible to do good data-science while keeping dataframes separate and just iterating through them, for simplification we will go over merging and concatenating these dataframe objects.

Merging

In this sense, merging can be imagined as combining horizontally, or appending additional columns of information. In this case, you would need a primary key, which for the ZTrans data-sets happens to be the TransId column, which can be found in all dataset .txt files.


In [22]:
# We create a second dataframe to combine with our previous dataframe.
dfPropInfo = zh.txt_to_column_df('data/46/ZTrans/PropertyInfo.txt', 'column_helper/propertyinfo.csv')

In [23]:
# Again, we observe the first few rows to verify
dfPropInfo.head(3)


Out[23]:
TransId AssessorParcelNumber APNIndicatorStndCode TaxIDNumber TaxIDIndicatorStndCode UnformattedAssessorParcelNumber AlternateParcelNumber HawaiiCondoCPRCode PropertyHouseNumber PropertyHouseNumberExt ... PropertyAddressMatchType PropertyAddressDPV PropertyGeocodeQualityCode PropertyAddressQualityCode FIPS LoadID ImportParcelID BKFSPID AssessmentRecordMatchFlag BatchID
0 315997382 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 46061 2565142377 NaN 2 0 1872223
1 315997383 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 46061 2565142378 NaN 3 0 1872223
2 315997384 103-57-022-004-000-02 NaN NaN NaN 1035702200400002 NaN NaN NaN NaN ... NaN NaN NaN NaN 46061 2565142119 119206404.0 1 1 1872223

3 rows × 68 columns

After this merge, we should have a new table with 198 columns (TransId carries across both data-sets!)


In [24]:
# It will be important to recognize the primary key ids in both tables.
dfMerge = dfMain.merge(dfPropInfo, left_on='TransId', right_on='TransId', how='outer')

In [25]:
# Again, we print the first few lines to verify
dfMerge.head(3)


Out[25]:
TransId FIPS_x State County DataClassStndCode RecordTypeStndCode RecordingDate RecordingDocumentNumber RecordingBookNumber RecordingPageNumber ... PropertyAddressMatchType PropertyAddressDPV PropertyGeocodeQualityCode PropertyAddressQualityCode FIPS_y LoadID_y ImportParcelID BKFSPID_y AssessmentRecordMatchFlag BatchID_y
0 315852373 46009 SD BON HOMME D P 2009-01-06 NaN 96 0001 ... NaN NaN NaN NaN 46009.0 2.411792e+09 NaN 1.0 0.0 1702496.0
1 315852374 46009 SD BON HOMME D P 2009-01-07 NaN 96 0002 ... NaN NaN NaN NaN 46009.0 2.411792e+09 NaN 2.0 0.0 1702496.0
2 315852375 46009 SD BON HOMME D P 2009-01-09 NaN 96 0003 ... NaN NaN NaN NaN 46009.0 2.411792e+09 NaN 3.0 0.0 1702496.0

3 rows × 198 columns

It has 198 columns, so we know it worked pretty well!

Concatenating

Concatentating can be imagined as combining vertically, or adding additional rows of other dataframes of the same schema. Luckily, the function for it comes right from pandas and is super easy to use!


In [26]:
# We create a new "Main" dataframe with the headers attached from another data sub-set.
dfMain2 = zh.txt_to_column_df('data/50/ZTrans/Main.txt', 'column_helper/main.csv')

In [36]:
# Lets see the shapes of the two dataframes to figure out how many rows the concatenate should have
print(dfMain.shape)
print(dfMain2.shape)
print("Sum of rows should be " + str(dfMain.shape[0]+dfMain2.shape[0]))


(434222, 131)
(789800, 131)
Sum of rows should be 1224022

In [37]:
# Lets concatentate and count the rows!
dfMainConcat = pd.concat([dfMain, dfMain2])
print(dfMainConcat.shape[0])


1224022

The concatenating above seemed to work, Great!

Through a combination of applying these functions iteratively, concatenating dataframes that have already been merged, or even merging dataframes that have already been concatenated, we can really create useful data-sets for some basic local analysis!

Data Object Serialization

While we've been working with dataframes in this tutorial, it is important to note that dataframes are volatile object hierarchies that only exist during the life of the python application's run-time (aka, as long as this notebook file is running). If you'd rather not go through these steps every single time to re-create the dataframe, it is very useful to serialize them into binary files to be de-serialized back into these objects at a later time. In the Python lexicon, this has been referred to as "pickling"

Luckily, various tools from the pandas library make this incredibly easy to use and remember!


In [38]:
# Lets pickle our merged data-frame and give it a useful/descriptive name!
dfMerge.to_pickle('main&propinfo46.pickle')

Keep in mind, the .pickle file-type is not the official file-type. Any non-common file-type indicator will work, .pickle is just useful as it has already been added to the .gitignore for purposes of working with these data-sets!

We can save valuable time spent making dataframes and just re-open from previously saved pickles:


In [39]:
dfPickle = pd.read_pickle('main&propinfo46.pickle')

In [40]:
dfPickle.head(3)


Out[40]:
TransId FIPS_x State County DataClassStndCode RecordTypeStndCode RecordingDate RecordingDocumentNumber RecordingBookNumber RecordingPageNumber ... PropertyAddressMatchType PropertyAddressDPV PropertyGeocodeQualityCode PropertyAddressQualityCode FIPS_y LoadID_y ImportParcelID BKFSPID_y AssessmentRecordMatchFlag BatchID_y
0 315852373 46009 SD BON HOMME D P 2009-01-06 NaN 96 0001 ... NaN NaN NaN NaN 46009.0 2.411792e+09 NaN 1.0 0.0 1702496.0
1 315852374 46009 SD BON HOMME D P 2009-01-07 NaN 96 0002 ... NaN NaN NaN NaN 46009.0 2.411792e+09 NaN 2.0 0.0 1702496.0
2 315852375 46009 SD BON HOMME D P 2009-01-09 NaN 96 0003 ... NaN NaN NaN NaN 46009.0 2.411792e+09 NaN 3.0 0.0 1702496.0

3 rows × 198 columns

More Data Work

As you can probably tell, the pandas library is very powerful! While a lot of work has gone into making this tutorial, please feel free to expand even more by checking out more modules at http://pandas.pydata.org/