In this three part series I'm going to be looking at how we can use data science to measure the finacial performance of real-estate (property) assets. I'll be introducing concepts from high finance (alpha $\alpha$ in part 2. and beta $\beta$ in part 3.), but adapting them for everyday use. In part 1. I first clean and prepare the Land Registry transaction data and explore the distribution of returns (pre-tax profits) we can expect form the UK housing market.
I download the 'Prices Paid' data as a .CSV file from http://landregistry.data.gov.uk/. Here you can download the full dataset, but its 3+GB. For convinience I limit my sample to properties bought and sold in 2016. Below in the Land Registry's description of what is included. Each element on the list below corresponds to a column in the data file.
Transaction unique identifier - A reference number which is generated automatically recording each published sale. The number is unique and will change each time a sale is recorded.
Price - Sale price stated on the transfer deed.
Date of Transfer - Date when the sale was completed, as stated on the transfer deed.
Postcode - This is the postcode used at the time of the original transaction. Note that postcodes can be reallocated and these changes are not reflected in the Price Paid Dataset.
Property Type - Note that they do not separately identify bungalows. End-of-terrace properties are included in the Terraced category. ‘Other’ is only valid where the transaction relates to a property type that is not covered by existing values.
Old/New Indicates the age of the property and applies to all price paid transactions, residential and non-residential.
Duration Relates to the tenure:
PAON, Primary Addressable Object Name. If there is a sub-building for example the building is divided into flats, see Secondary Addressable Object Name (SAON).
SAON Secondary Addressable Object Name. If there is a sub-building, for example the building is divided into flats, there will be a SAON.
Street
Locality
Town/City
District
County
PPD Category Type - Indicates the type of Price Paid transaction. Note that category B does not separately identify the transaction types stated. Land Registry has been collecting information on Category A transactions from January 1995. Category B transactions were identified from October 2013
Record Status - Monthly file only. Indicates additions, changes and deletions to the records.(see guide below).
In [1]:
import pandas as pd # pandas dataframe lib
import datetime as dt # datetime lib for handling dates
import matplotlib.pyplot as plt # matplotlib for plots and graphs
import seaborn.apionly as sns # seaborn for nicer looking plots
sns.set()
df = pd.read_csv(open('pp-2016-part1.csv','rb')) # read the .csv file
cols = ['transaction_id','sale_price','sale_date','postcode','type','new_build','lease','PAON','SAON','street',
'locality','town','district','county','trans_type','record_type']
df.columns = [cols] # rename the columns to be more readable
There are quite a few columns that we are not going to need for this analysis, so we can use .drop to get rid of them stright away.
In [2]:
# drop unused cols
df = df.drop(['town','county','trans_type','record_type','locality',
'new_build','lease','district','street','transaction_id'], axis=1)
df.shape # number of rows and cols
Out[2]:
Each of the ~400,000 entries relates to a specific transaction, and not a specific property. So I create a way to identify each unique property based on the house number (['PAON'] and ['SAON']) and ['postcode']. I then filter the transactions to keep those rows with duplicate identities so that all the properties remaining must have been bought and sold at least once within the period under study.
In [3]:
identity = ['postcode','PAON','SAON'] # unique identifier information
df = df[df[identity].duplicated(keep=False)] # only keep transactions with duplicate identities
df.shape # down to ~5000 properties
Out[3]:
I noticed that some transactions are duplicated despite having different ['transaction_id'] values! So I filter these out using the property ID and the sale date.
In [4]:
# filter duplicate transactions based on unique identifier information and ['end_date']
df = df[~df[['postcode','PAON','SAON','sale_date']].duplicated(keep=False)]
df.shape # down to 3616 properties
Out[4]:
Finally, I give each property a unique ID and sort transactions by date.
In [5]:
# add unique property id
df2 = df[identity].drop_duplicates() # get unique values
df2 = df2.reset_index(drop = True).reset_index() # reset index to create a col named index
df2=df2.rename(columns = {'index':'property_id'}) # rename index
df = pd.merge(df,df2,on = identity,how = 'left') # append ID col with merge
cols = ['property_id','PAON','SAON','postcode','type','sale_price','sale_date'] # re-arrange cols
df = df[cols]
df['sale_date'] = pd.DatetimeIndex(df['sale_date']).normalize()
df = df.sort_values(['property_id','sale_date'], ascending=True) # sort cols on new ID + end date
df = df.reset_index(drop = True) # reset index
In the interest of privicy I drop all remainig identifying information
In [6]:
df = df.drop(['PAON','SAON','postcode'], axis=1)
We are left with a simple and clean dataframe to start doing the calulations.
In [7]:
df.head()
Out[7]:
In [8]:
df_unique = df.groupby(df['property_id'])
df_unique # grouped transactions for each unique property
Out[8]:
I first find the purchase date and purchase price of each property. Naturally the purchase date/price is just the sale date/price of the pervious transaction
In [9]:
df['purchase_date'] = df_unique['sale_date'].shift(1)
df['purchase_price'] = df_unique['sale_price'].shift(1)
The 'returns' are just the difference in purchase price and sale price. Given that the data is sorted cronologically, we can find this value just using .diff on the sale prices.
In [10]:
df['returns'] = df_unique['sale_price'].diff()
I also calculate the time difference (delta $\Delta$) between purchase and sale, measured as a % of a year
In [11]:
delta = df_unique['sale_date'].diff()
df['delta'] = delta / np.timedelta64(1, 'Y') # timedelta to % year
We can't measure returns on the very first sale of each property, so I delete the first transaction in each group (hat tip to unutbu @ stack exchange for fast mask method).
In [12]:
def mask_first(x):
result = np.ones_like(x)
result[0] = 0
return result
mask = df.groupby(['property_id'])['property_id'].transform(mask_first).astype(bool)
df = df.loc[mask]
Finally, I do some spring cleaning to sort out the columns and datatypes
In [15]:
cols = ['property_id','type','purchase_date','sale_date','delta','purchase_price','sale_price','returns'] # re-arrange cols
df = df[cols]
df[['property_id','purchase_price','sale_price','returns']] = df[['property_id','purchase_price','sale_price','returns']].apply(pd.Int64Index)
df = df.reset_index(drop = True) # reset index
df.head()
Out[15]:
In [20]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df['returns'].describe()
Out[20]:
Ouch, that's a loss of £44 million at the low end! On the plus side, someone made off with a £23 million in profit last year.
The mean return in negative here, but I'm guessing the data is being skewed by these million pound property deals.
To get a more accurate picture, I've plotted the distribution of returns with outliers removed.
In [49]:
def reject_outliers(data, m=0.1):
return data[abs(data - np.mean(data)) < m * np.std(data)]
reject_outliers(df['returns']).hist(bins=40)
plt.ylabel('n')
plt.xlabel('returns')
plt.show()
losses = (df['returns']<0).sum()
N = df.shape[0]
print("Made losses in 2016 = {:.0%}".format((losses/N)))
Still, we can say that 1/5th of people buying and selling properties in 2016 made a loss doing so. But house prices always go up, right?
In [50]:
df = df[df['type'] != 'O'] # there are very few samples of 'other' type properties to warrent inclusion
reject_outliers(df['returns']).hist(by=df['type'],bins=40,alpha=0.75)
plt.show()
grouped = df.groupby('type')
for index, group in grouped:
losses = (group.returns<0).sum()
N = group.returns.shape[0]
print("For type:",index,"- {:.0%} made losses in 2016".format((losses/N)))
Interestingly, Detached and Semi-Detached properties performed the worst. I was actually expecting flats to do worse.
Returns are not the only way of measuring performance. In the next part of this series, I will be looking at a more sifisticated measure called alpha ($\alpha$) that shows the relative performance of an asset compared the a market benchmark.
In [ ]: