In this notebook, let us try and explore the data given for Zillow prize competition. Before we dive deep into the data, let us know a little more about the competition.

Zillow:

Zillow is an online real estate database company founded in 2006 - Wikipedia

Zestimate:

“Zestimates” are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today),

Objective:

Building a model to improve the Zestimate residual error.

The competition is in two stages. This public competition will go on till Jan 2018 and has $50,000 in prize. Please make sure to read about the Prize details and Competition overview since it is quite different in this one.

Let us first import the necessary modules.


In [2]:
import numpy as np # linear algebra, the main (only real?) option for handling vectors/matrices
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting library
import seaborn as sns # plotting, specialized for statistical fata (distributions, etc) and colorful visualization
color = sns.color_palette()

# for displaying plots inside the notebook
%matplotlib inline

Let us list the files present in the input folder.


In [3]:
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


properties_2016.csv
sample_submission.csv
train_2016_v2.csv
zillow_data_dictionary.xlsx

Explore training data


In [4]:
# read the data from a CSV file, and put it in a Pandas "dataframe" (~ a table, but with lots of fancy features)
train_df = pd.read_csv("../input/train_2016_v2.csv", parse_dates=["transactiondate"])
train_df.shape


Out[4]:
(90275, 3)

In [8]:
train_df.head(10)


Out[8]:
parcelid logerror transactiondate
0 11016594 0.0276 2016-01-01
1 14366692 -0.1684 2016-01-01
2 12098116 -0.0040 2016-01-01
3 12643413 0.0218 2016-01-02
4 14432541 -0.0050 2016-01-02
5 11509835 -0.2705 2016-01-02
6 12286022 0.0440 2016-01-02
7 17177301 0.1638 2016-01-02
8 14739064 -0.0030 2016-01-02
9 14677559 0.0843 2016-01-03

Logerror:

Target variable for this competition is "logerror" field. So let us do some analysis on this field first.


In [9]:
plt.figure(figsize=(12,9))
# notice that we sort the y-values = the log error
plt.scatter(range(train_df.shape[0]), np.sort(train_df.logerror.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('logerror', fontsize=12)
plt.show()


Looks like most data points (=houses) have a log error between -1 and +1, with several outliers on both ends. Next common step: plot a histogram to see how many points are actually in each bucket.


In [10]:
plt.figure(figsize=(12,8))
sns.distplot(train_df.logerror.values, bins=50, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()


Outliers are making it hard to see the distribution for the majority of the data. We are going to remove outliers (top and bottom percentiles; this is an arbitrary choice, should be re-examined always)


In [11]:
ulimit = np.percentile(train_df.logerror.values, 99)
llimit = np.percentile(train_df.logerror.values, 1)
train_df_without_outliers = train_df[(train_df['logerror']<ulimit) & (train_df['logerror']>llimit)].copy()
train_df_outliers = train_df[(train_df['logerror']>=ulimit) | (train_df['logerror']<=llimit)].copy()

plt.figure(figsize=(12,8))
sns.distplot(train_df_without_outliers.logerror.values, bins=100, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()



In [12]:
plt.figure(figsize=(12,8))
sns.distplot(train_df_outliers.logerror.values, bins=100, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()


Transaction Date:

Now let us explore the date field. Let us first check the number of transactions in each month.


In [13]:
train_df['transaction_month'] = train_df['transactiondate'].dt.month

count_event_per_month = train_df['transaction_month'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(count_event_per_month.index, count_event_per_month.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.xlabel('Month of transaction', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.show()


As we could see from the data page as well The train data has all the transactions before October 15, 2016, plus some of the transactions after October 15, 2016.

So we have shorter bars in the last three months.

Parcel Id:


In [14]:
# We want to count how many times each parcel appears in the train dataset
grouped = train_df.groupby('parcelid')
counted = grouped['parcelid'].agg(['count'])
counted['count'].value_counts()


Out[14]:
1    90026
2      123
3        1
Name: count, dtype: int64

So most of the parcel ids are appearing only once in the dataset.

Look at the properties data:

Now let us explore the properties_2016 file.


In [15]:
prop_df = pd.read_csv("../input/properties_2016.csv")
prop_df.shape


/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (22,32,34,49,55) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[15]:
(2985217, 58)

So, almost 3M properties listed, and 58 features


In [16]:
prop_df.head()


Out[16]:
parcelid airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid ... numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 10754147 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN NaN 9.0 2015.0 9.0 NaN NaN NaN NaN
1 10759547 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN NaN 27516.0 2015.0 27516.0 NaN NaN NaN NaN
2 10843547 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN 650756.0 1413387.0 2015.0 762631.0 20800.37 NaN NaN NaN
3 10859147 NaN NaN NaN 0.0 0.0 3.0 7.0 NaN NaN ... 1.0 NaN 571346.0 1156834.0 2015.0 585488.0 14557.57 NaN NaN NaN
4 10879947 NaN NaN NaN 0.0 0.0 4.0 NaN NaN NaN ... NaN NaN 193796.0 433491.0 2015.0 239695.0 5725.17 NaN NaN NaN

5 rows × 58 columns

There are so many NaN values in the dataset. So let us first do some exploration on that one.


In [17]:
# convert each cell value to True/False (isnull?), then sum over the columns (=count the "True"'s for each column)
missing_df = prop_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
# only keep the rows that have at least 1 missing value
missing_df = missing_df[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')
missing_df.tail(10)


Out[17]:
column_name missing_count
16 finishedsquarefeet6 2963216
9 decktypeid 2968121
6 buildingclasstypeid 2972588
13 finishedsquarefeet13 2977545
43 typeconstructiontypeid 2978470
2 architecturalstyletypeid 2979156
49 fireplaceflag 2980054
46 yardbuildingsqft26 2982570
3 basementsqft 2983589
41 storytypeid 2983593

In [18]:
# Create an index vector
ind = np.arange(missing_df.shape[0])
fig, ax = plt.subplots(figsize=(12,18))
# doing a horizontal bar plot, with value shown being the missing value count from previous cell
rects = ax.barh(ind, missing_df.missing_count.values, color='blue')
ax.set_yticks(ind)
ax.set_yticklabels(missing_df.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column")
plt.show()


This is a lot of missing values for some features (near 100%). We'll have to decide what to do with those: (1)remove them entirely or (2)fill the missing value with mean/something else.

Bivariate analysis

Let us explore the latitude and longitude variable to begin with.


In [19]:
plt.figure(figsize=(12,12))
# The jointplot method is used for doing bivariate analysis (it has a lot of cool options!)
sns.jointplot(x=prop_df.latitude.values, y=prop_df.longitude.values, size=10)
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.show()


<matplotlib.figure.Figure at 0x129373f98>

From the data page, we are provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016.

We have about 90,811 rows in train but we have about 2,985,217 rows in properties file. So let us merge the two files and then carry out our analysis. This will add all the property info to the transaction data (=train data)


In [20]:
train_df_without_outliers = pd.merge(train_df_without_outliers, prop_df, on='parcelid', how='left')
train_df_without_outliers.head()


Out[20]:
parcelid logerror transactiondate airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid ... numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 11016594 0.0276 2016-01-01 1.0 NaN NaN 2.0 3.0 NaN 4.0 ... NaN NaN 122754.0 360170.0 2015.0 237416.0 6735.88 NaN NaN 6.037107e+13
1 14366692 -0.1684 2016-01-01 NaN NaN NaN 3.5 4.0 NaN NaN ... NaN NaN 346458.0 585529.0 2015.0 239071.0 10153.02 NaN NaN NaN
2 12098116 -0.0040 2016-01-01 1.0 NaN NaN 3.0 2.0 NaN 4.0 ... NaN NaN 61994.0 119906.0 2015.0 57912.0 11484.48 NaN NaN 6.037464e+13
3 12643413 0.0218 2016-01-02 1.0 NaN NaN 2.0 2.0 NaN 4.0 ... NaN NaN 171518.0 244880.0 2015.0 73362.0 3048.74 NaN NaN 6.037296e+13
4 14432541 -0.0050 2016-01-02 NaN NaN NaN 2.5 4.0 NaN NaN ... 2.0 NaN 169574.0 434551.0 2015.0 264977.0 5488.96 NaN NaN 6.059042e+13

5 rows × 60 columns

Quickly check the data types

Now let us check the dtypes of different types of variable.


In [21]:
pd.options.display.max_rows = 65 #for readability

# Get the Series of pairs (column names, its data type)
dtype_df = train_df_without_outliers.dtypes.reset_index()
# Rename correctly (often it's messed up by the reset_index)
dtype_df.columns = ["Feature name", "dtype"]
dtype_df


Out[21]:
Feature name dtype
0 parcelid int64
1 logerror float64
2 transactiondate datetime64[ns]
3 airconditioningtypeid float64
4 architecturalstyletypeid float64
5 basementsqft float64
6 bathroomcnt float64
7 bedroomcnt float64
8 buildingclasstypeid float64
9 buildingqualitytypeid float64
10 calculatedbathnbr float64
11 decktypeid float64
12 finishedfloor1squarefeet float64
13 calculatedfinishedsquarefeet float64
14 finishedsquarefeet12 float64
15 finishedsquarefeet13 float64
16 finishedsquarefeet15 float64
17 finishedsquarefeet50 float64
18 finishedsquarefeet6 float64
19 fips float64
20 fireplacecnt float64
21 fullbathcnt float64
22 garagecarcnt float64
23 garagetotalsqft float64
24 hashottuborspa object
25 heatingorsystemtypeid float64
26 latitude float64
27 longitude float64
28 lotsizesquarefeet float64
29 poolcnt float64
30 poolsizesum float64
31 pooltypeid10 float64
32 pooltypeid2 float64
33 pooltypeid7 float64
34 propertycountylandusecode object
35 propertylandusetypeid float64
36 propertyzoningdesc object
37 rawcensustractandblock float64
38 regionidcity float64
39 regionidcounty float64
40 regionidneighborhood float64
41 regionidzip float64
42 roomcnt float64
43 storytypeid float64
44 threequarterbathnbr float64
45 typeconstructiontypeid float64
46 unitcnt float64
47 yardbuildingsqft17 float64
48 yardbuildingsqft26 float64
49 yearbuilt float64
50 numberofstories float64
51 fireplaceflag object
52 structuretaxvaluedollarcnt float64
53 taxvaluedollarcnt float64
54 assessmentyear float64
55 landtaxvaluedollarcnt float64
56 taxamount float64
57 taxdelinquencyflag object
58 taxdelinquencyyear float64
59 censustractandblock float64

Almost all are float variables with few object (categorical) variables. Let us get the count.


In [22]:
# Like before, we want to count the number of occurences of certain values (here, the dtype values)
dtype_count = dtype_df.groupby('dtype').aggregate('count').reset_index()
dtype_count.columns = ['dtype','number of occurences']
dtype_count.head()


Out[22]:
dtype number of occurences
0 int64 1
1 float64 53
2 datetime64[ns] 1
3 object 5

Looking into missing values

Now let us check the number of Nulls in this new merged dataset.


In [24]:
# convert each cell value to True/False (isnull?), then sum over the columns (=count the "True"'s for each column)
missing_df = train_df_without_outliers.isnull().sum(axis=0).reset_index()
missing_df.columns = ['feature_name', 'missing_count']
# also get the ratio of the null values for each feature
missing_df['missing_ratio'] = missing_df['missing_count'] / train_df.shape[0]
missing_df[missing_df['missing_ratio']>0.9]


Out[24]:
feature_name missing_count missing_ratio
4 architecturalstyletypeid 88206 0.977081
5 basementsqft 88424 0.979496
8 buildingclasstypeid 88453 0.979817
11 decktypeid 87818 0.972783
12 finishedfloor1squarefeet 81700 0.905012
15 finishedsquarefeet13 88435 0.979618
16 finishedsquarefeet15 85121 0.942908
17 finishedsquarefeet50 81700 0.905012
18 finishedsquarefeet6 88075 0.975630
24 hashottuborspa 86138 0.954173
30 poolsizesum 87506 0.969327
31 pooltypeid10 87320 0.967267
32 pooltypeid2 87283 0.966857
43 storytypeid 88424 0.979496
45 typeconstructiontypeid 88168 0.976660
47 yardbuildingsqft17 85855 0.951038
48 yardbuildingsqft26 88374 0.978942
51 fireplaceflag 88245 0.977513
57 taxdelinquencyflag 86765 0.961119
58 taxdelinquencyyear 86765 0.961119

Ten columns have missing values 96% of the times ! Let's look into these missing values:

  • are they just accidents? ("We forgot to type in the square meters for the kitchen") -> replace by mean
  • are they hidden binary? (1.0 = there is a fireplace, Null = there is no fireplace) -> replace by 0
  • are they something else? (building type is either 4.0 or 7.0... Null is what??) -> :(

In [25]:
missing_df['nb_unique_values'] = missing_df.apply(lambda x: len(train_df_without_outliers[x['feature_name']].unique()), axis=1)
missing_df['dtype'] = missing_df.apply(lambda x: train_df_without_outliers.dtypes[x['feature_name']], axis=1)
missing_df = missing_df.sort_values('missing_ratio', ascending=False)
missing_df.head(20)


Out[25]:
feature_name missing_count missing_ratio nb_unique_values dtype
8 buildingclasstypeid 88453 0.979817 2 float64
15 finishedsquarefeet13 88435 0.979618 12 float64
43 storytypeid 88424 0.979496 2 float64
5 basementsqft 88424 0.979496 38 float64
48 yardbuildingsqft26 88374 0.978942 72 float64
51 fireplaceflag 88245 0.977513 2 object
4 architecturalstyletypeid 88206 0.977081 7 float64
45 typeconstructiontypeid 88168 0.976660 4 float64
18 finishedsquarefeet6 88075 0.975630 337 float64
11 decktypeid 87818 0.972783 2 float64
30 poolsizesum 87506 0.969327 273 float64
31 pooltypeid10 87320 0.967267 2 float64
32 pooltypeid2 87283 0.966857 2 float64
58 taxdelinquencyyear 86765 0.961119 12 float64
57 taxdelinquencyflag 86765 0.961119 2 object
24 hashottuborspa 86138 0.954173 2 object
47 yardbuildingsqft17 85855 0.951038 564 float64
16 finishedsquarefeet15 85121 0.942908 1831 float64
12 finishedfloor1squarefeet 81700 0.905012 1874 float64
17 finishedsquarefeet50 81700 0.905012 1886 float64

In [26]:
missing_df[missing_df['feature_name'] == 'fireplaceflag']


Out[26]:
feature_name missing_count missing_ratio nb_unique_values dtype
51 fireplaceflag 88245 0.977513 2 object

In [27]:
train_df_without_outliers.fireplaceflag.unique()


Out[27]:
array([nan, True], dtype=object)

Question: can we come up with an automated way of dealing with the missing data? Test your logic on the above variables

For example, does it work on both poolsizesum and fireplaceflag? And buildingclasstypeid ?

Correlation between variables:


In [28]:
# Let us just impute the missing values with mean values to compute correlation coefficients #
median_values = train_df_without_outliers.median(axis=0)
train_df_null_to_median = train_df_without_outliers.copy()
train_df_null_to_median.fillna(median_values, inplace=True)


Out[28]:
parcelid logerror transactiondate airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid ... numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 11016594 0.0276 2016-01-01 1.0 7.0 608.0 2.0 3.0 4.0 4.0 ... 1.0 1 122754.0 360170.0 2015.0 237416.0 6735.88 NaN 14.0 6.037107e+13
1 14366692 -0.1684 2016-01-01 1.0 7.0 608.0 3.5 4.0 4.0 7.0 ... 1.0 1 346458.0 585529.0 2015.0 239071.0 10153.02 NaN 14.0 6.037620e+13
2 12098116 -0.0040 2016-01-01 1.0 7.0 608.0 3.0 2.0 4.0 4.0 ... 1.0 1 61994.0 119906.0 2015.0 57912.0 11484.48 NaN 14.0 6.037464e+13
3 12643413 0.0218 2016-01-02 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 171518.0 244880.0 2015.0 73362.0 3048.74 NaN 14.0 6.037296e+13
4 14432541 -0.0050 2016-01-02 1.0 7.0 608.0 2.5 4.0 4.0 7.0 ... 2.0 1 169574.0 434551.0 2015.0 264977.0 5488.96 NaN 14.0 6.059042e+13
5 11509835 -0.2705 2016-01-02 1.0 7.0 608.0 4.0 4.0 4.0 1.0 ... 1.0 1 880650.0 2447951.0 2015.0 1567301.0 27126.57 NaN 14.0 6.037621e+13
6 12286022 0.0440 2016-01-02 1.0 7.0 608.0 1.0 2.0 4.0 7.0 ... 1.0 1 64549.0 111521.0 2015.0 46972.0 2304.97 NaN 14.0 6.037542e+13
7 17177301 0.1638 2016-01-02 1.0 7.0 608.0 2.5 3.0 4.0 7.0 ... 2.0 1 107000.0 306000.0 2015.0 199000.0 3745.50 NaN 14.0 6.111003e+13
8 14739064 -0.0030 2016-01-02 1.0 7.0 608.0 1.0 2.0 4.0 7.0 ... 1.0 1 66834.0 210064.0 2015.0 143230.0 2172.88 NaN 14.0 6.059042e+13
9 14677559 0.0843 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 109977.0 190960.0 2015.0 80983.0 1940.26 NaN 14.0 6.059063e+13
10 10854446 0.3825 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 55396.0 105954.0 2015.0 50558.0 1443.69 NaN 14.0 6.037141e+13
11 11672170 -0.0161 2016-01-03 1.0 7.0 608.0 4.0 5.0 4.0 1.0 ... 1.0 1 559040.0 1090127.0 2015.0 531087.0 13428.94 NaN 14.0 6.037263e+13
12 10949761 0.0344 2016-01-03 1.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 29551.0 70119.0 2015.0 40568.0 832.35 NaN 14.0 6.037311e+13
13 14665244 0.0816 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 398747.0 601000.0 2015.0 202253.0 6876.70 NaN 14.0 6.059063e+13
14 14148540 -0.0450 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 78565.0 254817.0 2015.0 176252.0 3192.28 NaN 14.0 6.059002e+13
15 14057104 0.0020 2016-01-03 5.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 89850.0 345023.0 2015.0 255173.0 4036.32 NaN 14.0 6.059099e+13
16 17296797 0.0440 2016-01-03 1.0 7.0 608.0 2.5 3.0 4.0 7.0 ... 2.0 1 168000.0 480000.0 2015.0 312000.0 6383.74 NaN 14.0 6.111007e+13
17 12524288 -0.0419 2016-01-03 1.0 7.0 608.0 1.0 1.0 4.0 7.0 ... 1.0 1 56233.0 70316.0 2015.0 14083.0 913.17 NaN 14.0 6.037575e+13
18 14091502 -0.0020 2016-01-03 1.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 75119.0 253138.0 2015.0 178019.0 3329.14 NaN 14.0 6.059100e+13
19 10815909 0.0100 2016-01-03 1.0 7.0 608.0 3.0 3.0 4.0 4.0 ... 1.0 1 108985.0 416279.0 2015.0 307294.0 5194.00 NaN 14.0 6.037140e+13
20 12453853 -0.0387 2016-01-03 1.0 7.0 608.0 1.0 2.0 4.0 7.0 ... 1.0 1 122952.0 260249.0 2015.0 137297.0 3211.31 NaN 14.0 6.037570e+13
21 14253461 0.0218 2016-01-03 13.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 109293.0 258960.0 2015.0 149667.0 3090.72 NaN 14.0 6.059075e+13
22 14001034 0.0149 2016-01-03 1.0 7.0 608.0 2.0 4.0 4.0 7.0 ... 1.0 1 92991.0 212077.0 2015.0 119086.0 3020.24 NaN 14.0 6.059110e+13
23 11458863 0.1151 2016-01-03 1.0 7.0 608.0 4.0 3.0 4.0 4.0 ... 1.0 1 319449.0 1194279.0 2015.0 874830.0 14747.98 NaN 14.0 6.037277e+13
24 13957980 -0.1696 2016-01-03 1.0 7.0 608.0 2.5 2.0 4.0 7.0 ... 1.0 1 157171.0 335790.0 2015.0 178619.0 4269.90 NaN 14.0 6.059089e+13
25 12752047 -0.0440 2016-01-03 1.0 7.0 608.0 4.0 3.0 4.0 7.0 ... 1.0 1 154186.0 239456.0 2015.0 85270.0 3249.37 NaN 14.0 6.037504e+13
26 11188425 -0.0747 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 80600.0 174000.0 2015.0 93400.0 3895.76 NaN 14.0 6.037910e+13
27 14646051 0.0080 2016-01-03 1.0 7.0 608.0 2.5 2.0 4.0 7.0 ... 2.0 True 145291.0 825000.0 2015.0 679709.0 9553.56 NaN 14.0 6.059042e+13
28 10805063 0.0010 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 135218.0 250245.0 2015.0 115027.0 3117.19 NaN 14.0 6.037140e+13
29 12524725 -0.0534 2016-01-03 1.0 7.0 608.0 3.0 2.0 4.0 4.0 ... 1.0 1 117504.0 215097.0 2015.0 97593.0 2648.96 NaN 14.0 6.037577e+13
30 11104966 -0.0398 2016-01-03 1.0 7.0 608.0 2.0 4.0 4.0 7.0 ... 1.0 1 176891.0 335093.0 2015.0 158202.0 4499.03 NaN 14.0 6.037920e+13
31 11624877 -0.0294 2016-01-03 1.0 7.0 608.0 2.0 2.0 4.0 1.0 ... 1.0 1 202000.0 673000.0 2015.0 471000.0 8048.18 NaN 14.0 6.037269e+13
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
88433 17182690 0.0030 2016-12-30 1.0 7.0 608.0 2.5 4.0 4.0 7.0 ... 2.0 1 138475.0 276950.0 2015.0 138475.0 3530.94 NaN 14.0 6.111004e+13
88434 17182881 -0.0111 2016-12-30 1.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 90938.0 113965.0 2015.0 23027.0 1394.72 NaN 14.0 6.111005e+13
88435 11143993 -0.0294 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 4.0 ... 1.0 1 294677.0 399596.0 2015.0 104919.0 5692.71 NaN 14.0 6.037920e+13
88436 10757551 0.0953 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 4.0 ... 1.0 1 200424.0 324821.0 2015.0 124397.0 6427.03 NaN 14.0 6.037137e+13
88437 12558267 0.0139 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 315000.0 464000.0 2015.0 149000.0 6286.13 NaN 14.0 6.037576e+13
88438 11710734 0.2095 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 35465.0 69303.0 2015.0 33838.0 964.14 NaN 14.0 6.037234e+13
88439 14347592 0.0080 2016-12-30 13.0 7.0 608.0 2.5 4.0 4.0 7.0 ... 2.0 1 233044.0 615410.0 2015.0 382366.0 6379.72 NaN 14.0 6.059052e+13
88440 11341061 -0.1312 2016-12-30 1.0 7.0 608.0 1.0 1.0 4.0 7.0 ... 1.0 1 13234.0 20756.0 2015.0 7522.0 1295.30 NaN 14.0 6.037920e+13
88441 14027878 0.0421 2016-12-30 1.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 99182.0 257206.0 2015.0 158024.0 3148.18 NaN 14.0 6.059100e+13
88442 12756771 0.0658 2016-12-30 1.0 7.0 608.0 1.0 3.0 4.0 7.0 ... 1.0 1 65728.0 307167.0 2015.0 241439.0 4038.70 NaN 14.0 6.037553e+13
88443 12758333 0.0030 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 7.0 ... 1.0 1 134709.0 203256.0 2015.0 68547.0 2761.24 NaN 14.0 6.037553e+13
88444 11295458 -0.0294 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 40163.0 50203.0 2015.0 10040.0 1263.39 Y 15.0 6.037901e+13
88445 12757889 -0.0111 2016-12-30 1.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 147672.0 344882.0 2015.0 197210.0 4493.66 NaN 14.0 6.037553e+13
88446 11308315 0.0070 2016-12-30 1.0 7.0 608.0 3.0 5.0 4.0 4.0 ... 1.0 1 248378.0 331525.0 2015.0 83147.0 6461.79 NaN 14.0 6.037901e+13
88447 14353215 0.0218 2016-12-30 1.0 7.0 608.0 2.5 3.0 4.0 7.0 ... 2.0 1 194714.0 713656.0 2015.0 518942.0 7380.64 NaN 14.0 6.059052e+13
88448 10753416 0.0526 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 221809.0 295743.0 2015.0 73934.0 4330.78 NaN 14.0 6.037800e+13
88449 11756902 0.0889 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 1.0 ... 1.0 1 259644.0 504864.0 2015.0 245220.0 6286.52 NaN 14.0 6.037213e+13
88450 11994047 0.0178 2016-12-30 1.0 7.0 608.0 2.0 1.0 4.0 1.0 ... 1.0 1 140145.0 448790.0 2015.0 308645.0 5459.11 NaN 14.0 6.037192e+13
88451 11440746 0.0237 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 58100.0 220600.0 2015.0 162500.0 3596.07 NaN 14.0 6.037604e+13
88452 11703478 0.0431 2016-12-30 1.0 7.0 608.0 1.0 3.0 4.0 7.0 ... 1.0 1 17520.0 39934.0 2015.0 22414.0 627.91 NaN 14.0 6.037232e+13
88453 10965821 -0.2231 2016-12-30 1.0 7.0 608.0 2.0 3.0 4.0 7.0 ... 1.0 1 140984.0 297456.0 2015.0 156472.0 3738.38 NaN 14.0 6.037107e+13
88454 12067066 0.1398 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 7.0 ... 1.0 1 117807.0 361072.0 2015.0 243265.0 4044.68 NaN 14.0 6.037302e+13
88455 11683263 0.0478 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 10.0 ... 1.0 1 532796.0 1331992.0 2015.0 799196.0 18297.87 NaN 14.0 6.037801e+13
88456 12265636 -0.0030 2016-12-30 1.0 7.0 608.0 2.0 4.0 4.0 7.0 ... 1.0 1 19554.0 32581.0 2015.0 13027.0 6137.29 NaN 14.0 6.037543e+13
88457 17109680 0.0010 2016-12-30 1.0 7.0 608.0 2.0 4.0 4.0 7.0 ... 1.0 1 180034.0 300054.0 2015.0 120020.0 3172.68 NaN 14.0 6.111002e+13
88458 12268527 -0.0151 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 4.0 ... 1.0 1 192373.0 315288.0 2015.0 122915.0 5853.81 NaN 14.0 6.037541e+13
88459 12920746 0.0344 2016-12-30 1.0 7.0 608.0 2.0 3.0 4.0 4.0 ... 1.0 1 141700.0 405000.0 2015.0 263300.0 5311.03 NaN 14.0 6.037404e+13
88460 10774160 -0.0356 2016-12-30 1.0 7.0 608.0 1.0 1.0 4.0 4.0 ... 1.0 1 43800.0 191000.0 2015.0 147200.0 2495.24 NaN 14.0 6.037132e+13
88461 12046695 0.0070 2016-12-30 1.0 7.0 608.0 3.0 3.0 4.0 4.0 ... 1.0 1 117893.0 161111.0 2015.0 43218.0 1886.54 NaN 14.0 6.037301e+13
88462 12995401 -0.2679 2016-12-30 1.0 7.0 608.0 2.0 4.0 4.0 7.0 ... 1.0 1 22008.0 38096.0 2015.0 16088.0 1925.70 Y 14.0 6.037433e+13
88463 11402105 0.0602 2016-12-30 1.0 7.0 608.0 2.0 2.0 4.0 4.0 ... 1.0 1 132991.0 165869.0 2015.0 32878.0 2285.57 NaN 14.0 6.037601e+13
88464 12566293 0.4207 2016-12-30 1.0 7.0 608.0 1.0 3.0 4.0 7.0 ... 1.0 1 66258.0 163037.0 2015.0 96779.0 2560.96 NaN 14.0 6.037544e+13

88465 rows × 60 columns


In [29]:
pearson = train_df_null_to_median.corr(method='pearson')
pearson


Out[29]:
parcelid logerror airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr ... yardbuildingsqft26 yearbuilt numberofstories structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyyear censustractandblock
parcelid 1.000000 0.010306 0.075627 0.002342 0.007355 -0.013529 0.006506 NaN 0.271732 -0.004001 ... 0.021418 0.165355 0.306954 0.008655 0.004977 NaN 0.002429 -0.030407 0.010734 0.657123
logerror 0.010306 1.000000 0.005117 0.000083 0.004130 0.048349 0.037918 NaN 0.000864 0.047638 ... -0.001189 0.018686 0.020035 0.028913 0.022985 NaN 0.016100 0.005202 0.002756 0.014806
airconditioningtypeid 0.075627 0.005117 1.000000 -0.006796 -0.000651 0.032308 0.048640 NaN 0.083650 0.030833 ... -0.001966 0.035764 0.263692 0.000824 0.011736 NaN 0.015603 -0.005293 0.005000 0.073453
architecturalstyletypeid 0.002342 0.000083 -0.006796 1.000000 -0.000020 -0.001913 -0.000197 NaN 0.002942 -0.002092 ... -0.000059 0.001084 0.000222 -0.002584 -0.000685 NaN 0.000401 -0.001210 0.000131 0.001981
basementsqft 0.007355 0.004130 -0.000651 -0.000020 1.000000 0.012620 0.010197 NaN 0.002359 0.012687 ... 0.022482 0.012176 0.018040 0.016152 0.007355 NaN 0.001699 0.006699 0.000122 0.012949
bathroomcnt -0.013529 0.048349 0.032308 -0.001913 0.012620 1.000000 0.643803 NaN -0.191037 0.975320 ... 0.014554 0.362175 0.085337 0.589363 0.479236 NaN 0.351501 0.464978 0.012423 0.018739
bedroomcnt 0.006506 0.037918 0.048640 -0.000197 0.010197 0.643803 1.000000 NaN -0.024728 0.599753 ... 0.006775 0.039132 0.097508 0.309309 0.245820 NaN 0.175001 0.243980 0.004057 0.036643
buildingclasstypeid NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
buildingqualitytypeid 0.271732 0.000864 0.083650 0.002942 0.002359 -0.191037 -0.024728 NaN 1.000000 -0.184092 ... 0.007129 -0.197237 0.174389 -0.051107 -0.006138 NaN 0.018900 -0.023971 0.001629 0.314131
calculatedbathnbr -0.004001 0.047638 0.030833 -0.002092 0.012687 0.975320 0.599753 NaN -0.184092 1.000000 ... 0.014716 0.370866 0.095658 0.609152 0.494461 NaN 0.364419 0.482777 0.012646 0.029236
decktypeid NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
finishedfloor1squarefeet 0.073875 -0.000147 -0.006585 -0.000199 0.018092 0.111041 0.111429 NaN 0.023872 0.115466 ... 0.105574 0.027194 -0.079529 0.118855 0.088529 NaN 0.060112 0.070022 0.000932 0.130579
calculatedfinishedsquarefeet 0.016971 0.060022 0.065548 -0.002841 0.008548 0.761497 0.623556 NaN -0.082207 0.792365 ... 0.022349 0.212809 0.140274 0.712990 0.597229 NaN 0.452218 0.584410 0.010023 0.059554
finishedsquarefeet12 0.030043 0.060014 0.072142 -0.002769 0.009066 0.730567 0.559068 NaN -0.100720 0.750860 ... 0.023621 0.257795 0.145132 0.730538 0.610269 NaN 0.461049 0.593448 0.009782 0.080473
finishedsquarefeet13 -0.009106 -0.003197 0.000847 0.000026 0.000024 0.014100 0.015881 NaN -0.003069 0.001767 ... 0.000072 -0.004563 0.001762 0.003014 0.002465 NaN 0.001923 0.003007 -0.000159 -0.016847
finishedsquarefeet15 -0.020371 0.016603 -0.007467 -0.000225 -0.000210 0.211337 0.280318 NaN 0.006634 0.221210 ... -0.000864 -0.031896 -0.015137 0.049510 0.038815 NaN 0.027686 0.046354 0.004081 -0.028086
finishedsquarefeet50 0.074084 -0.000593 -0.006630 -0.000200 0.002926 0.109571 0.108696 NaN 0.024034 0.113874 ... 0.126939 0.025404 -0.076683 0.117105 0.087742 NaN 0.059941 0.069463 0.000974 0.131487
finishedsquarefeet6 0.006273 -0.007285 0.013155 -0.000062 -0.000058 -0.031734 -0.049235 NaN 0.007491 0.016176 ... -0.000176 0.016783 0.073003 0.018480 0.020688 NaN 0.018805 0.019256 0.001362 0.006600
fips 0.664260 0.012589 0.073706 0.001986 0.012837 0.010796 0.026370 NaN 0.318031 0.028184 ... 0.038795 0.239064 0.454038 0.038831 0.014635 NaN 0.000244 -0.030361 0.012665 0.992589
fireplacecnt 0.148244 0.010012 0.057457 -0.000548 0.036677 0.128699 0.097753 NaN 0.065749 0.132213 ... 0.078650 0.078577 0.228591 0.124807 0.099932 NaN 0.072703 0.077095 0.002877 0.250243
fullbathcnt -0.062001 0.045152 0.008176 -0.003245 0.009646 0.957935 0.593412 NaN -0.225982 0.983718 ... 0.012673 0.323498 0.028312 0.589761 0.477130 NaN 0.350695 0.471292 0.010865 -0.042388
garagecarcnt -0.136684 0.007248 0.031093 -0.005935 0.005646 0.223673 0.198057 NaN -0.099614 0.212026 ... 0.042092 0.005171 -0.027278 0.228314 0.170434 NaN 0.116250 0.166340 -0.006559 -0.154781
garagetotalsqft -0.168511 0.008866 0.097101 -0.007124 0.008978 0.195709 0.274165 NaN -0.103655 0.195364 ... 0.030862 -0.123114 0.009429 0.207103 0.168999 NaN 0.125457 0.163496 -0.005970 -0.175891
heatingorsystemtypeid -0.043480 -0.011515 -0.041376 0.004774 -0.001670 -0.256787 -0.154644 NaN 0.172919 -0.274687 ... -0.005048 -0.269123 -0.032489 -0.160576 -0.113876 NaN -0.073760 -0.106763 -0.007710 -0.112841
latitude -0.239493 -0.012802 -0.141956 -0.003456 0.001591 0.024267 0.067632 NaN -0.286062 0.026190 ... 0.015159 -0.006765 -0.107666 -0.012526 -0.115390 NaN -0.152164 -0.079743 -0.011551 -0.071390
longitude -0.018027 0.008149 0.160944 0.003802 -0.007197 -0.018423 -0.005694 NaN 0.153444 -0.026493 ... -0.027885 0.067580 -0.067203 -0.060552 -0.023400 NaN -0.001283 -0.033856 0.008161 -0.323041
lotsizesquarefeet -0.083030 0.004241 -0.024682 -0.000674 -0.001072 0.002550 -0.094320 NaN -0.120155 0.005959 ... 0.007268 0.088472 -0.051386 0.008461 -0.020316 NaN -0.032717 -0.020008 -0.000471 -0.086731
poolcnt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
poolsizesum 0.020859 -0.002435 -0.001892 -0.000057 -0.004621 0.035157 0.023097 NaN 0.006859 0.036538 ... 0.038712 0.009663 0.011266 0.043237 0.037688 NaN 0.029456 0.030696 0.001551 0.037655
pooltypeid10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
pooltypeid2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
pooltypeid7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
propertylandusetypeid 0.042919 -0.003440 -0.025456 0.002358 -0.001090 -0.056311 -0.286755 NaN -0.164675 -0.089658 ... -0.003506 0.361040 -0.017556 0.006020 -0.030273 NaN -0.044331 -0.044047 0.010742 0.051648
rawcensustractandblock 0.663642 0.012465 0.072366 0.001949 0.012873 0.011036 0.027026 NaN 0.316052 0.028466 ... 0.038903 0.240387 0.453407 0.038966 0.013931 NaN -0.000791 -0.030787 0.012534 0.992639
regionidcity -0.016629 0.002564 -0.020346 0.000259 0.000832 -0.038019 -0.022861 NaN 0.015869 -0.040235 ... -0.000981 -0.051527 -0.021970 -0.031519 -0.021051 NaN -0.012709 -0.026761 -0.004738 -0.030860
regionidcounty -0.513389 -0.014949 -0.232266 -0.006506 -0.002440 -0.023213 -0.004709 NaN -0.395769 -0.035456 ... -0.007375 -0.287956 -0.369262 -0.017827 -0.056316 NaN -0.068190 -0.009444 -0.022892 -0.601329
regionidneighborhood 0.032411 -0.003426 -0.012375 0.003036 0.003137 -0.020849 -0.040006 NaN 0.001001 -0.020515 ... -0.002514 -0.076973 -0.030962 0.010381 0.034589 NaN 0.042130 0.040476 -0.000501 -0.023396
regionidzip 0.045135 0.000047 0.014786 0.000373 -0.000571 0.011423 0.015107 NaN 0.021891 0.013043 ... 0.001495 0.043221 0.029402 0.000270 0.005614 NaN 0.007647 0.000446 0.004578 0.055464
roomcnt 0.483323 0.014990 0.306533 0.006619 0.014565 0.043741 0.166969 NaN 0.290073 0.033048 ... 0.029155 0.074103 0.608843 -0.029449 0.011953 NaN 0.031367 -0.032766 0.014851 0.702615
storytypeid NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
threequarterbathnbr 0.015537 0.009109 0.004868 -0.000139 -0.000129 0.094490 0.036256 NaN 0.016664 0.096893 ... -0.000391 0.039228 0.000279 0.193139 0.165791 NaN 0.128072 0.150136 0.000861 0.014652
typeconstructiontypeid 0.000721 -0.001653 -0.000203 -0.000006 -0.000006 0.001639 0.000739 NaN 0.000736 0.001660 ... -0.000017 0.000673 0.006537 0.006477 0.004340 NaN 0.002606 0.003557 0.000038 0.000645
unitcnt -0.029509 -0.001005 -0.015069 -0.000488 -0.000455 0.093369 0.142269 NaN 0.050205 0.117603 ... -0.000881 -0.104817 -0.011000 -0.003039 -0.000436 NaN 0.000906 0.006892 -0.000324 -0.044912
yardbuildingsqft17 0.069054 -0.007415 -0.006172 -0.000186 0.045089 0.044693 0.051440 NaN 0.022373 0.045485 ... 0.040605 0.016214 0.049529 0.056218 0.038684 NaN 0.023986 0.028279 0.000448 0.122845
yardbuildingsqft26 0.021418 -0.001189 -0.001966 -0.000059 0.022482 0.014554 0.006775 NaN 0.007129 0.014716 ... 1.000000 0.008645 0.009733 0.016063 0.008231 NaN 0.002946 0.005395 -0.006135 0.039133
yearbuilt 0.165355 0.018686 0.035764 0.001084 0.012176 0.362175 0.039132 NaN -0.197237 0.370866 ... 0.008645 1.000000 0.117645 0.303638 0.143627 NaN 0.039483 0.118438 0.017129 0.238722
numberofstories 0.306954 0.020035 0.263692 0.000222 0.018040 0.085337 0.097508 NaN 0.174389 0.095658 ... 0.009733 0.117645 1.000000 0.020515 0.036995 NaN 0.040079 0.007874 0.008188 0.456431
structuretaxvaluedollarcnt 0.008655 0.028913 0.000824 -0.002584 0.016152 0.589363 0.309309 NaN -0.051107 0.609152 ... 0.016063 0.303638 0.020515 1.000000 0.819461 NaN 0.609111 0.786681 0.013515 0.039201
taxvaluedollarcnt 0.004977 0.022985 0.011736 -0.000685 0.007355 0.479236 0.245820 NaN -0.006138 0.494461 ... 0.008231 0.143627 0.036995 0.819461 1.000000 NaN 0.953520 0.946936 0.013946 0.015158
assessmentyear NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
landtaxvaluedollarcnt 0.002429 0.016100 0.015603 0.000401 0.001699 0.351501 0.175001 NaN 0.018900 0.364419 ... 0.002946 0.039483 0.040079 0.609111 0.953520 NaN 1.000000 0.897819 0.012108 -0.000003
taxamount -0.030407 0.005202 -0.005293 -0.001210 0.006699 0.464978 0.243980 NaN -0.023971 0.482777 ... 0.005395 0.118438 0.007874 0.786681 0.946936 NaN 0.897819 1.000000 0.013306 -0.030038
taxdelinquencyyear 0.010734 0.002756 0.005000 0.000131 0.000122 0.012423 0.004057 NaN 0.001629 0.012646 ... -0.006135 0.017129 0.008188 0.013515 0.013946 NaN 0.012108 0.013306 1.000000 0.012483
censustractandblock 0.657123 0.014806 0.073453 0.001981 0.012949 0.018739 0.036643 NaN 0.314131 0.029236 ... 0.039133 0.238722 0.456431 0.039201 0.015158 NaN -0.000003 -0.030038 0.012483 1.000000

54 rows × 54 columns


In [30]:
# Generate a mask for the upper triangle
mask = np.zeros_like(pearson, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(pearson, mask=mask, cmap=cmap, vmax=.3,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1129aa2b0>

In [31]:
# Now let us look at the correlation coefficient of each of these variables #
x_cols = [col for col in train_df_null_to_median.columns if col not in ['logerror'] if train_df_null_to_median[col].dtype=='float64']

labels = []
values = []
for col in x_cols:
    labels.append(col)
    values.append(np.corrcoef(train_df_null_to_median[col].values, train_df_null_to_median.logerror.values)[0,1])
corr_df = pd.DataFrame({'col_labels':labels, 'corr_values':values})
corr_df = corr_df.sort_values(by='corr_values')


/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/numpy/lib/function_base.py:3162: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/numpy/lib/function_base.py:3163: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]

In [32]:
ind = np.arange(len(labels))
width = 0.9
fig, ax = plt.subplots(figsize=(12,40))
rects = ax.barh(ind, np.array(corr_df.corr_values.values), color='r')
ax.set_yticks(ind)
ax.set_yticklabels(corr_df.col_labels.values, rotation='horizontal')
ax.set_xlabel("Correlation coefficient")
ax.set_title("Correlation coefficient of the variables")
#autolabel(rects)
plt.show()


The correlation of the target variable with the given set of variables are low overall.

There are few variables at the top of this graph without any correlation values. I guess they have only one unique value and hence no correlation value. Let us confirm the same.


In [33]:
corr_zero_cols = ['assessmentyear', 'storytypeid', 'pooltypeid2', 'pooltypeid7', 'pooltypeid10', 'poolcnt', 'decktypeid', 'buildingclasstypeid']
for col in corr_zero_cols:
    print(col, len(train_df_null_to_median[col].unique()))


assessmentyear 1
storytypeid 1
pooltypeid2 1
pooltypeid7 1
pooltypeid10 1
poolcnt 1
decktypeid 1
buildingclasstypeid 1

Let us take the variables with high correlation values and then do some analysis on them.


In [34]:
corr_df_sel = corr_df[(corr_df['corr_values']>0.02) | (corr_df['corr_values'] < -0.01)]
corr_df_sel


Out[34]:
col_labels corr_values
33 regionidcounty -0.014949
22 latitude -0.012802
21 heatingorsystemtypeid -0.011515
44 numberofstories 0.020035
46 taxvaluedollarcnt 0.022985
45 structuretaxvaluedollarcnt 0.028913
4 bedroomcnt 0.037918
18 fullbathcnt 0.045152
7 calculatedbathnbr 0.047638
3 bathroomcnt 0.048349
11 finishedsquarefeet12 0.060014
10 calculatedfinishedsquarefeet 0.060022

In [35]:
cols_to_use = corr_df_sel.col_labels.tolist()

temp_df = train_df_null_to_median[cols_to_use]
corrmat = temp_df.corr(method='spearman')
f, ax = plt.subplots(figsize=(8, 8))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True)
plt.title("Important variables correlation map", fontsize=15)
plt.show()


Checking relationship between important variables and prediction target

The important variables themselves are very highly correlated.! Let us now look at each of them.

Note : for better visibility, we'll use train_df_without_outliers, i.e. the dataset where we removed outliers with respect to the target, logerror.

Finished SquareFeet 12:

Let us seee how the finished square feet 12 varies with the log error.


In [37]:
col = "finishedsquarefeet12"

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median.finishedsquarefeet12.values, y=train_df_null_to_median.logerror.values, size=10, color=color[4])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Finished Square Feet 12', fontsize=12)
plt.title("Finished square feet 12 Vs Log error", fontsize=15)
plt.show()


<matplotlib.figure.Figure at 0x112ef7d68>

Seems the range of logerror narrows down with increase in finished square feet 12 variable. Probably larger houses are easy to predict?

Calculated finished square feet:


In [38]:
col = "calculatedfinishedsquarefeet"

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median.calculatedfinishedsquarefeet.values, y=train_df_null_to_median.logerror.values, size=10, color=color[5])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated finished square feet', fontsize=12)
plt.title("Calculated finished square feet Vs Log error", fontsize=15)
plt.show()


<matplotlib.figure.Figure at 0x154c046a0>

Here as well the distribution is very similar to the previous one. No wonder the correlation between the two variables are also high.

Bathroom Count:


In [39]:
plt.figure(figsize=(12,8))
sns.countplot(x="bathroomcnt", data=train_df_null_to_median)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathroom', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bathroom count", fontsize=15)
plt.show()



In [40]:
plt.figure(figsize=(12,8))
sns.boxplot(x="bathroomcnt", y="logerror", data=train_df_null_to_median)
plt.ylabel('Log error', fontsize=12)
plt.xlabel('Bathroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("How log error changes with bathroom count?", fontsize=15)
plt.show()


Bedroom count:


In [41]:
plt.figure(figsize=(12,8))
sns.countplot(x="bedroomcnt", data=train_df_null_to_median)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Bedroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bedroom count", fontsize=15)
plt.show()



In [42]:
plt.figure(figsize=(12,8))
sns.violinplot(x='bedroomcnt', y='logerror', data=train_df_null_to_median)
plt.xlabel('Bedroom count', fontsize=12)
plt.ylabel('Log Error', fontsize=12)
plt.show()



In [43]:
col = "taxamount"

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median['taxamount'].values, y=train_df_null_to_median['logerror'].values, size=10, color='g')
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Tax Amount', fontsize=12)
plt.title("Tax Amount Vs Log error", fontsize=15)
plt.show()


<matplotlib.figure.Figure at 0x112663be0>

YearBuilt:

Let us explore how the error varies with the yearbuilt variable.


In [44]:
from ggplot import *
ggplot(aes(x='yearbuilt', y='logerror'), data=train_df_null_to_median) + \
    geom_point(color='steelblue', size=1) + \
    stat_smooth()


/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/ggplot/utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/ggplot/stats/smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
Out[44]:
<ggplot: (-9223372036566973429)>

There is a minor incremental trend seen with respect to built year.

Now let us see how the logerror varies with respect to latitude and longitude.


In [45]:
ggplot(aes(x='latitude', y='longitude', color='logerror'), data=train_df_null_to_median) + \
    geom_point() + \
    scale_color_gradient(low = 'red', high = 'blue')


Out[45]:
<ggplot: (288324584)>

There are no visible pockets as such with respect to latitude or longitude atleast with the naked eye.

Let us take the variables with highest positive correlation and highest negative correlation to see if we can see some visible patterns.


In [46]:
ggplot(aes(x='finishedsquarefeet12', y='taxamount', color='logerror'), data=train_df_null_to_median) + \
    geom_point(alpha=0.7) + \
    scale_color_gradient(low = 'pink', high = 'blue')


Out[46]:
<ggplot: (296947313)>

Finding important variables with a simple model

We had an understanding of important variables from the univariate analysis. But this is on a stand alone basis and also we have linearity assumption. Now let us build a non-linear model to get the important variables by building Extra Trees model.


In [47]:
train_y = train_df_null_to_median['logerror'].values
cat_cols = ["hashottuborspa", "propertycountylandusecode", "propertyzoningdesc", "fireplaceflag", "taxdelinquencyflag"]
train_df_blind = train_df_null_to_median.copy()
# train_df_blind = train_df_blind.drop(['parcelid', 'logerror', 'transactiondate', 'transaction_month']+cat_cols, axis=1)
train_df_blind = train_df_blind.drop(['parcelid', 'logerror', 'transactiondate']+cat_cols, axis=1)
feat_names = train_df_blind.columns.values

from sklearn import ensemble
model = ensemble.ExtraTreesRegressor(n_estimators=25, max_depth=30, max_features=0.3, n_jobs=-1, random_state=0)
model.fit(train_df_blind, train_y)

## plot the importances ##
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0)
indices = np.argsort(importances)[::-1][:20]

plt.figure(figsize=(12,12))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical')
plt.xlim([-1, len(indices)])
plt.show()


Seems "tax amount" is the most importanct variable followed by "structure tax value dollar count" and "land tax value dollor count"


In [ ]: