If you've been provided with a Ztrax dataset then this notebook should walk you through some useful tools contained within this repository that will allow you to create useful dataframe objects for data-science from the otherwise useless text files that have been provided!
The ZTrax data-set from Zillow is split up between two smaller forks: ZAsmt and ZTrans.
These data files seem to indicate that they are representative of assessments of value that have been made using characteristics surrounding properties. While this is an interesting data-set, the team has determined that, at least for now, it's not particularly worth exploring.
These data-sets seem to indicate that they are representative of hard-transactions that have been made between buyers and sellers, as well as valuable data about lending and borrowing of capital to make these transactions. The group has decided that this would be valuable exploring first, so most of the interaction in this tutorial focuses on this sub-set.
Important Usage Note: This data is secure with an agreement with Zillow. While we can explore the data and its layout, any datasets added to the data
folder in this repository are removed by default, any additional datasets should be removed by appending them to the .gitignore
file.
Below we will import our libraries we'll be using. Please note that the utils
are utilities included in this repository and should be downloaded to your computer. You will need to move them to your python binaries if you wish them to be used across all your python projects.
In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import operator
import sys
import os
# Importing our contained utilities.
from utils import zillow_helpers as zh
sys.path.insert(0,'../')
%load_ext autoreload
%autoreload 2
I've moved some sub-sets of data from Ztrax into our data/
folder. You can specify other paths as well if your data lives elsewhere on your computer as well.
In [4]:
datadir1 = 'data/46/ZTrans/'
datadir2 = 'data/50/ZTrans/'
In [14]:
file_list = os.listdir(datadir1)
print(file_list)
We don't need to leave the workspace to explore any one of these .txt datasets! There are good python libraries for this already and we can use python iterations to read small subsets to observe what this data looks like:
In [16]:
# A lot of files must be opened to be read
file = open(datadir1+'Main.txt', 'r')
# Lets print and observer just the first few lines
for lines in range(3):
line = file.readline()
print(line)
While you can easily create dataframes from any one of these text files using
df = pd.read_csv('path/to/file.txt', sep="!", index_col=False, header=None, low_memory=False)
It should be noted that these text files, do not have headers! Luckily, we have ordered lists of headers available to us in a compiled spreadsheet, which we can then use to create small .txt or .csv files to make a line separated list of headers!
The files indicated below have been hand-pulled from a layout spreadsheet, but an iterative process is possible as well.
In [17]:
column_headers = open('column_helper/main.csv', 'r')
# Lets observe a few of the column headers
for lines in range(3):
line = column_headers.readline()
print(line)
Here is where some of our utility functions from this repository will come in handy! A function, demonstrated below, can be used to combine the headers from header list file and the data file itself to make a useful dataframe that can actually be explored!
Keep in mind, this dataframe creation can take a LONG time!
In [20]:
# We create our dataframe with the headers attached.
dfMain = zh.txt_to_column_df('data/46/ZTrans/Main.txt', 'column_helper/main.csv')
In [21]:
# We observe the first few rows of the dataframe to verify
dfMain.head(3)
Out[21]:
Now of course our data will likely only be useful when we observe it across various characterisitcs of these data-sets. While it is possible to do good data-science while keeping dataframes separate and just iterating through them, for simplification we will go over merging and concatenating these dataframe objects.
In this sense, merging can be imagined as combining horizontally, or appending additional columns of information. In this case, you would need a primary key, which for the ZTrans data-sets happens to be the TransId column, which can be found in all dataset .txt files.
In [22]:
# We create a second dataframe to combine with our previous dataframe.
dfPropInfo = zh.txt_to_column_df('data/46/ZTrans/PropertyInfo.txt', 'column_helper/propertyinfo.csv')
In [23]:
# Again, we observe the first few rows to verify
dfPropInfo.head(3)
Out[23]:
After this merge, we should have a new table with 198 columns (TransId carries across both data-sets!)
In [24]:
# It will be important to recognize the primary key ids in both tables.
dfMerge = dfMain.merge(dfPropInfo, left_on='TransId', right_on='TransId', how='outer')
In [25]:
# Again, we print the first few lines to verify
dfMerge.head(3)
Out[25]:
It has 198 columns, so we know it worked pretty well!
In [26]:
# We create a new "Main" dataframe with the headers attached from another data sub-set.
dfMain2 = zh.txt_to_column_df('data/50/ZTrans/Main.txt', 'column_helper/main.csv')
In [36]:
# Lets see the shapes of the two dataframes to figure out how many rows the concatenate should have
print(dfMain.shape)
print(dfMain2.shape)
print("Sum of rows should be " + str(dfMain.shape[0]+dfMain2.shape[0]))
In [37]:
# Lets concatentate and count the rows!
dfMainConcat = pd.concat([dfMain, dfMain2])
print(dfMainConcat.shape[0])
The concatenating above seemed to work, Great!
Through a combination of applying these functions iteratively, concatenating dataframes that have already been merged, or even merging dataframes that have already been concatenated, we can really create useful data-sets for some basic local analysis!
While we've been working with dataframes in this tutorial, it is important to note that dataframes are volatile object hierarchies that only exist during the life of the python application's run-time (aka, as long as this notebook file is running). If you'd rather not go through these steps every single time to re-create the dataframe, it is very useful to serialize them into binary files to be de-serialized back into these objects at a later time. In the Python lexicon, this has been referred to as "pickling"
Luckily, various tools from the pandas library make this incredibly easy to use and remember!
In [38]:
# Lets pickle our merged data-frame and give it a useful/descriptive name!
dfMerge.to_pickle('main&propinfo46.pickle')
Keep in mind, the .pickle file-type is not the official file-type. Any non-common file-type indicator will work, .pickle is just useful as it has already been added to the .gitignore for purposes of working with these data-sets!
We can save valuable time spent making dataframes and just re-open from previously saved pickles:
In [39]:
dfPickle = pd.read_pickle('main&propinfo46.pickle')
In [40]:
dfPickle.head(3)
Out[40]:
As you can probably tell, the pandas library is very powerful! While a lot of work has gone into making this tutorial, please feel free to expand even more by checking out more modules at http://pandas.pydata.org/