Zillow Local Data Interaction Tutorial

If you've been provided with a Ztrax dataset then this notebook should walk you through some useful tools contained within this repository that will allow you to create useful dataframe objects for data-science from the otherwise useless text files that have been provided!

Quick Look at ZTrax and Motivations

The ZTrax data-set from Zillow is split up between two smaller forks: ZAsmt and ZTrans.

ZAsmt

These data files seem to indicate that they are representative of assessments of value that have been made using characteristics surrounding properties. While this is an interesting data-set, the team has determined that, at least for now, it's not particularly worth exploring.

ZTrans

These data-sets seem to indicate that they are representative of hard-transactions that have been made between buyers and sellers, as well as valuable data about lending and borrowing of capital to make these transactions. The group has decided that this would be valuable exploring first, so most of the interaction in this tutorial focuses on this sub-set.

Important Usage Note: This data is secure with an agreement with Zillow. While we can explore the data and its layout, any datasets added to the data folder in this repository are removed by default, any additional datasets should be removed by appending them to the .gitignore file.

Below we will import our libraries we'll be using. Please note that the utils are utilities included in this repository and should be downloaded to your computer. You will need to move them to your python binaries if you wish them to be used across all your python projects.



In [3]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import operator
import sys
import os
# Importing our contained utilities.
from utils import zillow_helpers as zh
sys.path.insert(0,'../')
%load_ext autoreload
%autoreload 2









    



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

I've moved some sub-sets of data from Ztrax into our data/ folder. You can specify other paths as well if your data lives elsewhere on your computer as well.



In [4]:

    
datadir1 = 'data/46/ZTrans/'
datadir2 = 'data/50/ZTrans/'

Exploring Files in Notebook

Lets explore the files a ZTrans sub-dataset:



In [14]:

    
file_list = os.listdir(datadir1)
print(file_list)









    



['BKManagedSpecific.txt', 'BorrowerMailAddress.txt', 'BorrowerName.txt', 'BorrowerNameDescriptionCode.txt', 'BuyerMailAddress.txt', 'BuyerName.txt', 'BuyerNameDescriptionCode.txt', 'ForeclosureNameAddress.txt', 'ForeclosureNODNOSDoc.txt', 'ForeclosureOriginalLoan.txt', 'HawaiiBorrowerNotes.txt', 'HawaiiGranteeNotes.txt', 'Legacy.txt', 'Main.txt', 'Modification.txt', 'PropertyInfo.txt', 'RegionSpecific.txt', 'SellerMailAddress.txt', 'SellerName.txt', 'SellerNameDescriptionCode.txt']

We don't need to leave the workspace to explore any one of these .txt datasets! There are good python libraries for this already and we can use python iterations to read small subsets to observe what this data looks like:



In [16]:

    
# A lot of files must be opened to be read
file = open(datadir1+'Main.txt', 'r')

# Lets print and observer just the first few lines
for lines in range(3):
    line = file.readline()
    print(line)









    



315852373|46009|SD|BON HOMME|D|P|2009-01-06||96|0001|||||||WRDE||||| |||499.0000|NO|.0000|.0000||.0000||||||| ||||| ||||| ||| ||||||| || |.0000| |||| ||||||.000||||||||||||||||||||||||||| ||| ||||||| |||||||||2012-06-15||FKY|| |NM|| |2411791749|| |1702496|1|BKF|1651814435

315852374|46009|SD|BON HOMME|D|P|2009-01-07||96|0002|||||||QCDE||||| |||.0000|NO|.0000|.0000||.0000||||||| ||||| ||||| ||| ||||||| || |.0000| |||| ||||||.000||||||||||||||||||||||||||| ||| ||||||| |||||||||2012-06-15||FKY|| |NM|| |2411791750|| |1702496|2|BKF|153175723

315852375|46009|SD|BON HOMME|D|P|2009-01-09||96|0003|||||||WRDE||||| |||.0000|NO|.0000|.0000||.0000||||||| ||||| ||||| ||| ||||||| || |.0000| |||| ||||||.000||||||||||||||||||||||||||| ||| ||||||| |||||||||2012-06-15||FKY|| |NM|| |2411791751|| |1702496|3|BKF|-657077398

Creating Useful Data Objects

While you can easily create dataframes from any one of these text files using

df = pd.read_csv('path/to/file.txt', sep="!", index_col=False, header=None, low_memory=False)

It should be noted that these text files, do not have headers! Luckily, we have ordered lists of headers available to us in a compiled spreadsheet, which we can then use to create small .txt or .csv files to make a line separated list of headers!

The files indicated below have been hand-pulled from a layout spreadsheet, but an iterative process is possible as well.



In [17]:

    
column_headers = open('column_helper/main.csv', 'r')

# Lets observe a few of the column headers
for lines in range(3):
    line = column_headers.readline()
    print(line)









    



TransId

FIPS

State

Here is where some of our utility functions from this repository will come in handy! A function, demonstrated below, can be used to combine the headers from header list file and the data file itself to make a useful dataframe that can actually be explored!

Keep in mind, this dataframe creation can take a LONG time!



In [20]:

    
# We create our dataframe with the headers attached.
dfMain = zh.txt_to_column_df('data/46/ZTrans/Main.txt', 'column_helper/main.csv')



In [21]:

    
# We observe the first few rows of the dataframe to verify
dfMain.head(3)









    Out[21]:






  
    
      
      TransId
      FIPS
      State
      County
      DataClassStndCode
      RecordTypeStndCode
      RecordingDate
      RecordingDocumentNumber
      RecordingBookNumber
      RecordingPageNumber
      ...
      MatchStndCode
      REOStndCode
      UpdateOwnershipFlag
      LoadID
      StatusInd
      TransactionTypeStndCode
      BatchID
      BKFSPID
      ZVendorStndCode
      SourceChkSum
    
  
  
    
      0
      315852373
      46009
      SD
      BON HOMME
      D
      P
      2009-01-06
      NaN
      96
      0001
      ...
      NM
      NaN
      
      2411791749
      NaN
      
      1702496
      1
      BKF
      1.651814e+09
    
    
      1
      315852374
      46009
      SD
      BON HOMME
      D
      P
      2009-01-07
      NaN
      96
      0002
      ...
      NM
      NaN
      
      2411791750
      NaN
      
      1702496
      2
      BKF
      1.531757e+08
    
    
      2
      315852375
      46009
      SD
      BON HOMME
      D
      P
      2009-01-09
      NaN
      96
      0003
      ...
      NM
      NaN
      
      2411791751
      NaN
      
      1702496
      3
      BKF
      -6.570774e+08
    
  

3 rows × 131 columns

Merging and Concatenating

Now of course our data will likely only be useful when we observe it across various characterisitcs of these data-sets. While it is possible to do good data-science while keeping dataframes separate and just iterating through them, for simplification we will go over merging and concatenating these dataframe objects.

Merging

In this sense, merging can be imagined as combining horizontally, or appending additional columns of information. In this case, you would need a primary key, which for the ZTrans data-sets happens to be the TransId column, which can be found in all dataset .txt files.



In [22]:

    
# We create a second dataframe to combine with our previous dataframe.
dfPropInfo = zh.txt_to_column_df('data/46/ZTrans/PropertyInfo.txt', 'column_helper/propertyinfo.csv')



In [23]:

    
# Again, we observe the first few rows to verify
dfPropInfo.head(3)









    Out[23]:






  
    
      
      TransId
      AssessorParcelNumber
      APNIndicatorStndCode
      TaxIDNumber
      TaxIDIndicatorStndCode
      UnformattedAssessorParcelNumber
      AlternateParcelNumber
      HawaiiCondoCPRCode
      PropertyHouseNumber
      PropertyHouseNumberExt
      ...
      PropertyAddressMatchType
      PropertyAddressDPV
      PropertyGeocodeQualityCode
      PropertyAddressQualityCode
      FIPS
      LoadID
      ImportParcelID
      BKFSPID
      AssessmentRecordMatchFlag
      BatchID
    
  
  
    
      0
      315997382
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      46061
      2565142377
      NaN
      2
      0
      1872223
    
    
      1
      315997383
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      46061
      2565142378
      NaN
      3
      0
      1872223
    
    
      2
      315997384
      103-57-022-004-000-02
      NaN
      NaN
      NaN
      1035702200400002
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      46061
      2565142119
      119206404.0
      1
      1
      1872223
    
  

3 rows × 68 columns

After this merge, we should have a new table with 198 columns (TransId carries across both data-sets!)



In [24]:

    
# It will be important to recognize the primary key ids in both tables.
dfMerge = dfMain.merge(dfPropInfo, left_on='TransId', right_on='TransId', how='outer')



In [25]:

    
# Again, we print the first few lines to verify
dfMerge.head(3)









    Out[25]:






  
    
      
      TransId
      FIPS_x
      State
      County
      DataClassStndCode
      RecordTypeStndCode
      RecordingDate
      RecordingDocumentNumber
      RecordingBookNumber
      RecordingPageNumber
      ...
      PropertyAddressMatchType
      PropertyAddressDPV
      PropertyGeocodeQualityCode
      PropertyAddressQualityCode
      FIPS_y
      LoadID_y
      ImportParcelID
      BKFSPID_y
      AssessmentRecordMatchFlag
      BatchID_y
    
  
  
    
      0
      315852373
      46009
      SD
      BON HOMME
      D
      P
      2009-01-06
      NaN
      96
      0001
      ...
      NaN
      NaN
      NaN
      NaN
      46009.0
      2.411792e+09
      NaN
      1.0
      0.0
      1702496.0
    
    
      1
      315852374
      46009
      SD
      BON HOMME
      D
      P
      2009-01-07
      NaN
      96
      0002
      ...
      NaN
      NaN
      NaN
      NaN
      46009.0
      2.411792e+09
      NaN
      2.0
      0.0
      1702496.0
    
    
      2
      315852375
      46009
      SD
      BON HOMME
      D
      P
      2009-01-09
      NaN
      96
      0003
      ...
      NaN
      NaN
      NaN
      NaN
      46009.0
      2.411792e+09
      NaN
      3.0
      0.0
      1702496.0
    
  

3 rows × 198 columns

It has 198 columns, so we know it worked pretty well!

Concatenating

Concatentating can be imagined as combining vertically, or adding additional rows of other dataframes of the same schema. Luckily, the function for it comes right from pandas and is super easy to use!



In [26]:

    
# We create a new "Main" dataframe with the headers attached from another data sub-set.
dfMain2 = zh.txt_to_column_df('data/50/ZTrans/Main.txt', 'column_helper/main.csv')



In [36]:

    
# Lets see the shapes of the two dataframes to figure out how many rows the concatenate should have
print(dfMain.shape)
print(dfMain2.shape)
print("Sum of rows should be " + str(dfMain.shape[0]+dfMain2.shape[0]))









    



(434222, 131)
(789800, 131)
Sum of rows should be 1224022



In [37]:

    
# Lets concatentate and count the rows!
dfMainConcat = pd.concat([dfMain, dfMain2])
print(dfMainConcat.shape[0])

The concatenating above seemed to work, Great!

Through a combination of applying these functions iteratively, concatenating dataframes that have already been merged, or even merging dataframes that have already been concatenated, we can really create useful data-sets for some basic local analysis!

Data Object Serialization

While we've been working with dataframes in this tutorial, it is important to note that dataframes are volatile object hierarchies that only exist during the life of the python application's run-time (aka, as long as this notebook file is running). If you'd rather not go through these steps every single time to re-create the dataframe, it is very useful to serialize them into binary files to be de-serialized back into these objects at a later time. In the Python lexicon, this has been referred to as "pickling"

Luckily, various tools from the pandas library make this incredibly easy to use and remember!



In [38]:

    
# Lets pickle our merged data-frame and give it a useful/descriptive name!
dfMerge.to_pickle('main&propinfo46.pickle')

Keep in mind, the .pickle file-type is not the official file-type. Any non-common file-type indicator will work, .pickle is just useful as it has already been added to the .gitignore for purposes of working with these data-sets!

We can save valuable time spent making dataframes and just re-open from previously saved pickles:



In [39]:

    
dfPickle = pd.read_pickle('main&propinfo46.pickle')



In [40]:

    
dfPickle.head(3)









    Out[40]:






  
    
      
      TransId
      FIPS_x
      State
      County
      DataClassStndCode
      RecordTypeStndCode
      RecordingDate
      RecordingDocumentNumber
      RecordingBookNumber
      RecordingPageNumber
      ...
      PropertyAddressMatchType
      PropertyAddressDPV
      PropertyGeocodeQualityCode
      PropertyAddressQualityCode
      FIPS_y
      LoadID_y
      ImportParcelID
      BKFSPID_y
      AssessmentRecordMatchFlag
      BatchID_y
    
  
  
    
      0
      315852373
      46009
      SD
      BON HOMME
      D
      P
      2009-01-06
      NaN
      96
      0001
      ...
      NaN
      NaN
      NaN
      NaN
      46009.0
      2.411792e+09
      NaN
      1.0
      0.0
      1702496.0
    
    
      1
      315852374
      46009
      SD
      BON HOMME
      D
      P
      2009-01-07
      NaN
      96
      0002
      ...
      NaN
      NaN
      NaN
      NaN
      46009.0
      2.411792e+09
      NaN
      2.0
      0.0
      1702496.0
    
    
      2
      315852375
      46009
      SD
      BON HOMME
      D
      P
      2009-01-09
      NaN
      96
      0003
      ...
      NaN
      NaN
      NaN
      NaN
      46009.0
      2.411792e+09
      NaN
      3.0
      0.0
      1702496.0
    
  

3 rows × 198 columns

More Data Work

As you can probably tell, the pandas library is very powerful! While a lot of work has gone into making this tutorial, please feel free to expand even more by checking out more modules at http://pandas.pydata.org/

	TransId	FIPS	State	County	DataClassStndCode	RecordTypeStndCode	RecordingDate	RecordingDocumentNumber	RecordingBookNumber	RecordingPageNumber	...	MatchStndCode	REOStndCode	LoadID	StatusInd	BatchID	BKFSPID	ZVendorStndCode	SourceChkSum
0	315852373	46009	SD	BON HOMME	D	P	2009-01-06	NaN	96	0001	...	NM	NaN	2411791749	NaN	1702496	1	BKF	1.651814e+09
1	315852374	46009	SD	BON HOMME	D	P	2009-01-07	NaN	96	0002	...	NM	NaN	2411791750	NaN	1702496	2	BKF	1.531757e+08
2	315852375	46009	SD	BON HOMME	D	P	2009-01-09	NaN	96	0003	...	NM	NaN	2411791751	NaN	1702496	3	BKF	-6.570774e+08

	TransId	AssessorParcelNumber	APNIndicatorStndCode	TaxIDNumber	TaxIDIndicatorStndCode	UnformattedAssessorParcelNumber	AlternateParcelNumber	HawaiiCondoCPRCode	PropertyHouseNumber	PropertyHouseNumberExt	...	PropertyAddressMatchType	PropertyAddressDPV	PropertyGeocodeQualityCode	PropertyAddressQualityCode	FIPS	LoadID	ImportParcelID	BKFSPID	AssessmentRecordMatchFlag	BatchID
0	315997382	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	46061	2565142377	NaN	2	0	1872223
1	315997383	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	46061	2565142378	NaN	3	0	1872223
2	315997384	103-57-022-004-000-02	NaN	NaN	NaN	1035702200400002	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	46061	2565142119	119206404.0	1	1	1872223