In this notebook, let us try and explore the data given for Zillow prize competition. Before we dive deep into the data, let us know a little more about the competition.
Zillow:
Zillow is an online real estate database company founded in 2006 - Wikipedia
Zestimate:
“Zestimates” are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today),
Objective:
Building a model to improve the Zestimate residual error.
The competition is in two stages. This public competition will go on till Jan 2018 and has $50,000 in prize. Please make sure to read about the Prize details and Competition overview since it is quite different in this one.
Let us first import the necessary modules.
In [2]:
import numpy as np # linear algebra, the main (only real?) option for handling vectors/matrices
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting library
import seaborn as sns # plotting, specialized for statistical fata (distributions, etc) and colorful visualization
color = sns.color_palette()
# for displaying plots inside the notebook
%matplotlib inline
Let us list the files present in the input folder.
In [3]:
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
In [4]:
# read the data from a CSV file, and put it in a Pandas "dataframe" (~ a table, but with lots of fancy features)
train_df = pd.read_csv("../input/train_2016_v2.csv", parse_dates=["transactiondate"])
train_df.shape
Out[4]:
In [8]:
train_df.head(10)
Out[8]:
Logerror:
Target variable for this competition is "logerror" field. So let us do some analysis on this field first.
In [9]:
plt.figure(figsize=(12,9))
# notice that we sort the y-values = the log error
plt.scatter(range(train_df.shape[0]), np.sort(train_df.logerror.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('logerror', fontsize=12)
plt.show()
Looks like most data points (=houses) have a log error between -1 and +1, with several outliers on both ends. Next common step: plot a histogram to see how many points are actually in each bucket.
In [10]:
plt.figure(figsize=(12,8))
sns.distplot(train_df.logerror.values, bins=50, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()
Outliers are making it hard to see the distribution for the majority of the data. We are going to remove outliers (top and bottom percentiles; this is an arbitrary choice, should be re-examined always)
In [11]:
ulimit = np.percentile(train_df.logerror.values, 99)
llimit = np.percentile(train_df.logerror.values, 1)
train_df_without_outliers = train_df[(train_df['logerror']<ulimit) & (train_df['logerror']>llimit)].copy()
train_df_outliers = train_df[(train_df['logerror']>=ulimit) | (train_df['logerror']<=llimit)].copy()
plt.figure(figsize=(12,8))
sns.distplot(train_df_without_outliers.logerror.values, bins=100, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()
In [12]:
plt.figure(figsize=(12,8))
sns.distplot(train_df_outliers.logerror.values, bins=100, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()
Transaction Date:
Now let us explore the date field. Let us first check the number of transactions in each month.
In [13]:
train_df['transaction_month'] = train_df['transactiondate'].dt.month
count_event_per_month = train_df['transaction_month'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(count_event_per_month.index, count_event_per_month.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.xlabel('Month of transaction', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.show()
As we could see from the data page as well The train data has all the transactions before October 15, 2016, plus some of the transactions after October 15, 2016.
So we have shorter bars in the last three months.
Parcel Id:
In [14]:
# We want to count how many times each parcel appears in the train dataset
grouped = train_df.groupby('parcelid')
counted = grouped['parcelid'].agg(['count'])
counted['count'].value_counts()
Out[14]:
So most of the parcel ids are appearing only once in the dataset.
In [15]:
prop_df = pd.read_csv("../input/properties_2016.csv")
prop_df.shape
Out[15]:
So, almost 3M properties listed, and 58 features
In [16]:
prop_df.head()
Out[16]:
There are so many NaN values in the dataset. So let us first do some exploration on that one.
In [17]:
# convert each cell value to True/False (isnull?), then sum over the columns (=count the "True"'s for each column)
missing_df = prop_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
# only keep the rows that have at least 1 missing value
missing_df = missing_df[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')
missing_df.tail(10)
Out[17]:
In [18]:
# Create an index vector
ind = np.arange(missing_df.shape[0])
fig, ax = plt.subplots(figsize=(12,18))
# doing a horizontal bar plot, with value shown being the missing value count from previous cell
rects = ax.barh(ind, missing_df.missing_count.values, color='blue')
ax.set_yticks(ind)
ax.set_yticklabels(missing_df.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column")
plt.show()
This is a lot of missing values for some features (near 100%). We'll have to decide what to do with those: (1)remove them entirely or (2)fill the missing value with mean/something else.
In [19]:
plt.figure(figsize=(12,12))
# The jointplot method is used for doing bivariate analysis (it has a lot of cool options!)
sns.jointplot(x=prop_df.latitude.values, y=prop_df.longitude.values, size=10)
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.show()
From the data page, we are provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016.
We have about 90,811 rows in train but we have about 2,985,217 rows in properties file. So let us merge the two files and then carry out our analysis. This will add all the property info to the transaction data (=train data)
In [20]:
train_df_without_outliers = pd.merge(train_df_without_outliers, prop_df, on='parcelid', how='left')
train_df_without_outliers.head()
Out[20]:
In [21]:
pd.options.display.max_rows = 65 #for readability
# Get the Series of pairs (column names, its data type)
dtype_df = train_df_without_outliers.dtypes.reset_index()
# Rename correctly (often it's messed up by the reset_index)
dtype_df.columns = ["Feature name", "dtype"]
dtype_df
Out[21]:
Almost all are float variables with few object (categorical) variables. Let us get the count.
In [22]:
# Like before, we want to count the number of occurences of certain values (here, the dtype values)
dtype_count = dtype_df.groupby('dtype').aggregate('count').reset_index()
dtype_count.columns = ['dtype','number of occurences']
dtype_count.head()
Out[22]:
In [24]:
# convert each cell value to True/False (isnull?), then sum over the columns (=count the "True"'s for each column)
missing_df = train_df_without_outliers.isnull().sum(axis=0).reset_index()
missing_df.columns = ['feature_name', 'missing_count']
# also get the ratio of the null values for each feature
missing_df['missing_ratio'] = missing_df['missing_count'] / train_df.shape[0]
missing_df[missing_df['missing_ratio']>0.9]
Out[24]:
Ten columns have missing values 96% of the times ! Let's look into these missing values:
In [25]:
missing_df['nb_unique_values'] = missing_df.apply(lambda x: len(train_df_without_outliers[x['feature_name']].unique()), axis=1)
missing_df['dtype'] = missing_df.apply(lambda x: train_df_without_outliers.dtypes[x['feature_name']], axis=1)
missing_df = missing_df.sort_values('missing_ratio', ascending=False)
missing_df.head(20)
Out[25]:
In [26]:
missing_df[missing_df['feature_name'] == 'fireplaceflag']
Out[26]:
In [27]:
train_df_without_outliers.fireplaceflag.unique()
Out[27]:
In [28]:
# Let us just impute the missing values with mean values to compute correlation coefficients #
median_values = train_df_without_outliers.median(axis=0)
train_df_null_to_median = train_df_without_outliers.copy()
train_df_null_to_median.fillna(median_values, inplace=True)
Out[28]:
In [29]:
pearson = train_df_null_to_median.corr(method='pearson')
pearson
Out[29]:
In [30]:
# Generate a mask for the upper triangle
mask = np.zeros_like(pearson, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 12))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(pearson, mask=mask, cmap=cmap, vmax=.3,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Out[30]:
In [31]:
# Now let us look at the correlation coefficient of each of these variables #
x_cols = [col for col in train_df_null_to_median.columns if col not in ['logerror'] if train_df_null_to_median[col].dtype=='float64']
labels = []
values = []
for col in x_cols:
labels.append(col)
values.append(np.corrcoef(train_df_null_to_median[col].values, train_df_null_to_median.logerror.values)[0,1])
corr_df = pd.DataFrame({'col_labels':labels, 'corr_values':values})
corr_df = corr_df.sort_values(by='corr_values')
In [32]:
ind = np.arange(len(labels))
width = 0.9
fig, ax = plt.subplots(figsize=(12,40))
rects = ax.barh(ind, np.array(corr_df.corr_values.values), color='r')
ax.set_yticks(ind)
ax.set_yticklabels(corr_df.col_labels.values, rotation='horizontal')
ax.set_xlabel("Correlation coefficient")
ax.set_title("Correlation coefficient of the variables")
#autolabel(rects)
plt.show()
The correlation of the target variable with the given set of variables are low overall.
There are few variables at the top of this graph without any correlation values. I guess they have only one unique value and hence no correlation value. Let us confirm the same.
In [33]:
corr_zero_cols = ['assessmentyear', 'storytypeid', 'pooltypeid2', 'pooltypeid7', 'pooltypeid10', 'poolcnt', 'decktypeid', 'buildingclasstypeid']
for col in corr_zero_cols:
print(col, len(train_df_null_to_median[col].unique()))
Let us take the variables with high correlation values and then do some analysis on them.
In [34]:
corr_df_sel = corr_df[(corr_df['corr_values']>0.02) | (corr_df['corr_values'] < -0.01)]
corr_df_sel
Out[34]:
In [35]:
cols_to_use = corr_df_sel.col_labels.tolist()
temp_df = train_df_null_to_median[cols_to_use]
corrmat = temp_df.corr(method='spearman')
f, ax = plt.subplots(figsize=(8, 8))
# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True)
plt.title("Important variables correlation map", fontsize=15)
plt.show()
The important variables themselves are very highly correlated.! Let us now look at each of them.
Note : for better visibility, we'll use train_df_without_outliers
, i.e. the dataset where we removed outliers with respect to the target, logerror.
Finished SquareFeet 12:
Let us seee how the finished square feet 12 varies with the log error.
In [37]:
col = "finishedsquarefeet12"
plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median.finishedsquarefeet12.values, y=train_df_null_to_median.logerror.values, size=10, color=color[4])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Finished Square Feet 12', fontsize=12)
plt.title("Finished square feet 12 Vs Log error", fontsize=15)
plt.show()
Seems the range of logerror narrows down with increase in finished square feet 12 variable. Probably larger houses are easy to predict?
Calculated finished square feet:
In [38]:
col = "calculatedfinishedsquarefeet"
plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median.calculatedfinishedsquarefeet.values, y=train_df_null_to_median.logerror.values, size=10, color=color[5])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated finished square feet', fontsize=12)
plt.title("Calculated finished square feet Vs Log error", fontsize=15)
plt.show()
Here as well the distribution is very similar to the previous one. No wonder the correlation between the two variables are also high.
Bathroom Count:
In [39]:
plt.figure(figsize=(12,8))
sns.countplot(x="bathroomcnt", data=train_df_null_to_median)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathroom', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bathroom count", fontsize=15)
plt.show()
In [40]:
plt.figure(figsize=(12,8))
sns.boxplot(x="bathroomcnt", y="logerror", data=train_df_null_to_median)
plt.ylabel('Log error', fontsize=12)
plt.xlabel('Bathroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("How log error changes with bathroom count?", fontsize=15)
plt.show()
Bedroom count:
In [41]:
plt.figure(figsize=(12,8))
sns.countplot(x="bedroomcnt", data=train_df_null_to_median)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Bedroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bedroom count", fontsize=15)
plt.show()
In [42]:
plt.figure(figsize=(12,8))
sns.violinplot(x='bedroomcnt', y='logerror', data=train_df_null_to_median)
plt.xlabel('Bedroom count', fontsize=12)
plt.ylabel('Log Error', fontsize=12)
plt.show()
In [43]:
col = "taxamount"
plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median['taxamount'].values, y=train_df_null_to_median['logerror'].values, size=10, color='g')
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Tax Amount', fontsize=12)
plt.title("Tax Amount Vs Log error", fontsize=15)
plt.show()
YearBuilt:
Let us explore how the error varies with the yearbuilt variable.
In [44]:
from ggplot import *
ggplot(aes(x='yearbuilt', y='logerror'), data=train_df_null_to_median) + \
geom_point(color='steelblue', size=1) + \
stat_smooth()
Out[44]:
There is a minor incremental trend seen with respect to built year.
Now let us see how the logerror varies with respect to latitude and longitude.
In [45]:
ggplot(aes(x='latitude', y='longitude', color='logerror'), data=train_df_null_to_median) + \
geom_point() + \
scale_color_gradient(low = 'red', high = 'blue')
Out[45]:
There are no visible pockets as such with respect to latitude or longitude atleast with the naked eye.
Let us take the variables with highest positive correlation and highest negative correlation to see if we can see some visible patterns.
In [46]:
ggplot(aes(x='finishedsquarefeet12', y='taxamount', color='logerror'), data=train_df_null_to_median) + \
geom_point(alpha=0.7) + \
scale_color_gradient(low = 'pink', high = 'blue')
Out[46]:
In [47]:
train_y = train_df_null_to_median['logerror'].values
cat_cols = ["hashottuborspa", "propertycountylandusecode", "propertyzoningdesc", "fireplaceflag", "taxdelinquencyflag"]
train_df_blind = train_df_null_to_median.copy()
# train_df_blind = train_df_blind.drop(['parcelid', 'logerror', 'transactiondate', 'transaction_month']+cat_cols, axis=1)
train_df_blind = train_df_blind.drop(['parcelid', 'logerror', 'transactiondate']+cat_cols, axis=1)
feat_names = train_df_blind.columns.values
from sklearn import ensemble
model = ensemble.ExtraTreesRegressor(n_estimators=25, max_depth=30, max_features=0.3, n_jobs=-1, random_state=0)
model.fit(train_df_blind, train_y)
## plot the importances ##
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0)
indices = np.argsort(importances)[::-1][:20]
plt.figure(figsize=(12,12))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical')
plt.xlim([-1, len(indices)])
plt.show()
Seems "tax amount" is the most importanct variable followed by "structure tax value dollar count" and "land tax value dollor count"
In [ ]: