In this notebook, let us try and explore the data given for Zillow prize competition. Before we dive deep into the data, let us know a little more about the competition.
Zillow:
Zillow is an online real estate database company founded in 2006 - Wikipedia
Zestimate:
“Zestimates” are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today),
Objective:
Building a model to improve the Zestimate residual error.
The competition is in two stages. This public competition will go on till Jan 2018 and has $50,000 in prize. Please make sure to read about the Prize details and Competition overview since it is quite different in this one.
Let us first import the necessary modules.
In [11]:
import numpy as np
import pandas as pd
import xgboost as xgb
import gc
import matplotlib.pyplot as plt # plotting library
import seaborn as sns # plotting, specialized for statistical fata (distributions, etc) and colorful visualization
color = sns.color_palette()
# for displaying plots inside the notebook
%matplotlib inline
In [12]:
print('Loading data ...')
train = pd.read_csv('../input/train_2016_v2.csv')
prop = pd.read_csv('../input/properties_2016.csv')
sample = pd.read_csv('../input/sample_submission.csv')
In [13]:
print('Binding to float32')
for c, dtype in zip(prop.columns, prop.dtypes):
if dtype == np.float64:
prop[c] = prop[c].astype(np.float32)
In [14]:
print('Creating training set ...')
df_train = train.merge(prop, how='left', on='parcelid')
x_train = df_train.drop(['parcelid', 'logerror', 'transactiondate', 'propertyzoningdesc', 'propertycountylandusecode'], axis=1)
y_train = df_train['logerror'].values
print(x_train.shape, y_train.shape)
train_columns = x_train.columns
for c in x_train.dtypes[x_train.dtypes == object].index.values:
x_train[c] = (x_train[c] == True)
del df_train; gc.collect()
Out[14]:
In [15]:
split = 80000
x_train, y_train, x_valid, y_valid = x_train[:split], y_train[:split], x_train[split:], y_train[split:]
print('Building DMatrix...')
d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)
del x_train, x_valid; gc.collect()
Out[15]:
In [16]:
print('Training ...')
params = {}
params['eta'] = 0.02
params['objective'] = 'reg:linear'
params['eval_metric'] = 'mae'
params['max_depth'] = 4
params['silent'] = 1
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
clf = xgb.train(params, d_train, 10000, watchlist, early_stopping_rounds=100, verbose_eval=10)
del d_train, d_valid
In [17]:
# plot the important features #
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(clf, height=0.8, ax=ax)
plt.show()
In [18]:
print('Building test set ...')
sample['parcelid'] = sample['ParcelId']
df_test = sample.merge(prop, on='parcelid', how='left')
# del prop; gc.collect()
x_test = df_test.loc[:,train_columns]
for c in x_test.dtypes[x_test.dtypes == object].index.values:
x_test.loc[:,c] = (x_test.loc[:,c] == True)
# del df_test, sample; gc.collect()
d_test = xgb.DMatrix(x_test)
del x_test; gc.collect()
Out[18]:
In [19]:
print('Predicting on test ...')
p_test = clf.predict(d_test)
del d_test; gc.collect()
Out[19]:
In [20]:
sub = pd.read_csv('../input/sample_submission.csv')
for c in sub.columns[sub.columns != 'ParcelId']:
sub[c] = p_test
print('Writing csv ...')
sub.to_csv('../output/xgb_starter.csv', index=False, float_format='%.4f')
In [ ]: