In this notebook, let us try and explore the data given for Zillow prize competition. Before we dive deep into the data, let us know a little more about the competition.

Zillow:

Zillow is an online real estate database company founded in 2006 - Wikipedia

Zestimate:

“Zestimates” are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today),

Objective:

Building a model to improve the Zestimate residual error.

The competition is in two stages. This public competition will go on till Jan 2018 and has $50,000 in prize. Please make sure to read about the Prize details and Competition overview since it is quite different in this one.

Let us first import the necessary modules.



In [2]:

    
import numpy as np # linear algebra, the main (only real?) option for handling vectors/matrices
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting library
import seaborn as sns # plotting, specialized for statistical fata (distributions, etc) and colorful visualization
color = sns.color_palette()

# for displaying plots inside the notebook
%matplotlib inline

Let us list the files present in the input folder.



In [3]:

    
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))









    



properties_2016.csv
sample_submission.csv
train_2016_v2.csv
zillow_data_dictionary.xlsx

Explore training data



In [4]:

    
# read the data from a CSV file, and put it in a Pandas "dataframe" (~ a table, but with lots of fancy features)
train_df = pd.read_csv("../input/train_2016_v2.csv", parse_dates=["transactiondate"])
train_df.shape









    Out[4]:





(90275, 3)



In [8]:

    
train_df.head(10)









    Out[8]:







  
    
      
      parcelid
      logerror
      transactiondate
    
  
  
    
      0
      11016594
      0.0276
      2016-01-01
    
    
      1
      14366692
      -0.1684
      2016-01-01
    
    
      2
      12098116
      -0.0040
      2016-01-01
    
    
      3
      12643413
      0.0218
      2016-01-02
    
    
      4
      14432541
      -0.0050
      2016-01-02
    
    
      5
      11509835
      -0.2705
      2016-01-02
    
    
      6
      12286022
      0.0440
      2016-01-02
    
    
      7
      17177301
      0.1638
      2016-01-02
    
    
      8
      14739064
      -0.0030
      2016-01-02
    
    
      9
      14677559
      0.0843
      2016-01-03

Logerror:

Target variable for this competition is "logerror" field. So let us do some analysis on this field first.



In [9]:

    
plt.figure(figsize=(12,9))
# notice that we sort the y-values = the log error
plt.scatter(range(train_df.shape[0]), np.sort(train_df.logerror.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('logerror', fontsize=12)
plt.show()

Looks like most data points (=houses) have a log error between -1 and +1, with several outliers on both ends. Next common step: plot a histogram to see how many points are actually in each bucket.



In [10]:

    
plt.figure(figsize=(12,8))
sns.distplot(train_df.logerror.values, bins=50, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()

Outliers are making it hard to see the distribution for the majority of the data. We are going to remove outliers (top and bottom percentiles; this is an arbitrary choice, should be re-examined always)



In [11]:

    
ulimit = np.percentile(train_df.logerror.values, 99)
llimit = np.percentile(train_df.logerror.values, 1)
train_df_without_outliers = train_df[(train_df['logerror']<ulimit) & (train_df['logerror']>llimit)].copy()
train_df_outliers = train_df[(train_df['logerror']>=ulimit) | (train_df['logerror']<=llimit)].copy()

plt.figure(figsize=(12,8))
sns.distplot(train_df_without_outliers.logerror.values, bins=100, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()



In [12]:

    
plt.figure(figsize=(12,8))
sns.distplot(train_df_outliers.logerror.values, bins=100, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()

Transaction Date:

Now let us explore the date field. Let us first check the number of transactions in each month.



In [13]:

    
train_df['transaction_month'] = train_df['transactiondate'].dt.month

count_event_per_month = train_df['transaction_month'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(count_event_per_month.index, count_event_per_month.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.xlabel('Month of transaction', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.show()

As we could see from the data page as well The train data has all the transactions before October 15, 2016, plus some of the transactions after October 15, 2016.

So we have shorter bars in the last three months.

Parcel Id:



In [14]:

    
# We want to count how many times each parcel appears in the train dataset
grouped = train_df.groupby('parcelid')
counted = grouped['parcelid'].agg(['count'])
counted['count'].value_counts()









    Out[14]:





1    90026
2      123
3        1
Name: count, dtype: int64

So most of the parcel ids are appearing only once in the dataset.

Look at the properties data:

Now let us explore the properties_2016 file.



In [15]:

    
prop_df = pd.read_csv("../input/properties_2016.csv")
prop_df.shape









    



/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (22,32,34,49,55) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    Out[15]:





(2985217, 58)

So, almost 3M properties listed, and 58 features



In [16]:

    
prop_df.head()









    Out[16]:







  
    
      
      parcelid
      airconditioningtypeid
      architecturalstyletypeid
      basementsqft
      bathroomcnt
      bedroomcnt
      buildingclasstypeid
      buildingqualitytypeid
      calculatedbathnbr
      decktypeid
      ...
      numberofstories
      fireplaceflag
      structuretaxvaluedollarcnt
      taxvaluedollarcnt
      assessmentyear
      landtaxvaluedollarcnt
      taxamount
      taxdelinquencyflag
      taxdelinquencyyear
      censustractandblock
    
  
  
    
      0
      10754147
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      9.0
      2015.0
      9.0
      NaN
      NaN
      NaN
      NaN
    
    
      1
      10759547
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      27516.0
      2015.0
      27516.0
      NaN
      NaN
      NaN
      NaN
    
    
      2
      10843547
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      650756.0
      1413387.0
      2015.0
      762631.0
      20800.37
      NaN
      NaN
      NaN
    
    
      3
      10859147
      NaN
      NaN
      NaN
      0.0
      0.0
      3.0
      7.0
      NaN
      NaN
      ...
      1.0
      NaN
      571346.0
      1156834.0
      2015.0
      585488.0
      14557.57
      NaN
      NaN
      NaN
    
    
      4
      10879947
      NaN
      NaN
      NaN
      0.0
      0.0
      4.0
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      193796.0
      433491.0
      2015.0
      239695.0
      5725.17
      NaN
      NaN
      NaN
    
  

5 rows × 58 columns

There are so many NaN values in the dataset. So let us first do some exploration on that one.



In [17]:

    
# convert each cell value to True/False (isnull?), then sum over the columns (=count the "True"'s for each column)
missing_df = prop_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
# only keep the rows that have at least 1 missing value
missing_df = missing_df[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')
missing_df.tail(10)









    Out[17]:







  
    
      
      column_name
      missing_count
    
  
  
    
      16
      finishedsquarefeet6
      2963216
    
    
      9
      decktypeid
      2968121
    
    
      6
      buildingclasstypeid
      2972588
    
    
      13
      finishedsquarefeet13
      2977545
    
    
      43
      typeconstructiontypeid
      2978470
    
    
      2
      architecturalstyletypeid
      2979156
    
    
      49
      fireplaceflag
      2980054
    
    
      46
      yardbuildingsqft26
      2982570
    
    
      3
      basementsqft
      2983589
    
    
      41
      storytypeid
      2983593



In [18]:

    
# Create an index vector
ind = np.arange(missing_df.shape[0])
fig, ax = plt.subplots(figsize=(12,18))
# doing a horizontal bar plot, with value shown being the missing value count from previous cell
rects = ax.barh(ind, missing_df.missing_count.values, color='blue')
ax.set_yticks(ind)
ax.set_yticklabels(missing_df.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column")
plt.show()

This is a lot of missing values for some features (near 100%). We'll have to decide what to do with those: (1)remove them entirely or (2)fill the missing value with mean/something else.

Bivariate analysis

Let us explore the latitude and longitude variable to begin with.



In [19]:

    
plt.figure(figsize=(12,12))
# The jointplot method is used for doing bivariate analysis (it has a lot of cool options!)
sns.jointplot(x=prop_df.latitude.values, y=prop_df.longitude.values, size=10)
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.show()









    





<matplotlib.figure.Figure at 0x129373f98>

From the data page, we are provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016.

We have about 90,811 rows in train but we have about 2,985,217 rows in properties file. So let us merge the two files and then carry out our analysis. This will add all the property info to the transaction data (=train data)



In [20]:

    
train_df_without_outliers = pd.merge(train_df_without_outliers, prop_df, on='parcelid', how='left')
train_df_without_outliers.head()









    Out[20]:







  
    
      
      parcelid
      logerror
      transactiondate
      airconditioningtypeid
      architecturalstyletypeid
      basementsqft
      bathroomcnt
      bedroomcnt
      buildingclasstypeid
      buildingqualitytypeid
      ...
      numberofstories
      fireplaceflag
      structuretaxvaluedollarcnt
      taxvaluedollarcnt
      assessmentyear
      landtaxvaluedollarcnt
      taxamount
      taxdelinquencyflag
      taxdelinquencyyear
      censustractandblock
    
  
  
    
      0
      11016594
      0.0276
      2016-01-01
      1.0
      NaN
      NaN
      2.0
      3.0
      NaN
      4.0
      ...
      NaN
      NaN
      122754.0
      360170.0
      2015.0
      237416.0
      6735.88
      NaN
      NaN
      6.037107e+13
    
    
      1
      14366692
      -0.1684
      2016-01-01
      NaN
      NaN
      NaN
      3.5
      4.0
      NaN
      NaN
      ...
      NaN
      NaN
      346458.0
      585529.0
      2015.0
      239071.0
      10153.02
      NaN
      NaN
      NaN
    
    
      2
      12098116
      -0.0040
      2016-01-01
      1.0
      NaN
      NaN
      3.0
      2.0
      NaN
      4.0
      ...
      NaN
      NaN
      61994.0
      119906.0
      2015.0
      57912.0
      11484.48
      NaN
      NaN
      6.037464e+13
    
    
      3
      12643413
      0.0218
      2016-01-02
      1.0
      NaN
      NaN
      2.0
      2.0
      NaN
      4.0
      ...
      NaN
      NaN
      171518.0
      244880.0
      2015.0
      73362.0
      3048.74
      NaN
      NaN
      6.037296e+13
    
    
      4
      14432541
      -0.0050
      2016-01-02
      NaN
      NaN
      NaN
      2.5
      4.0
      NaN
      NaN
      ...
      2.0
      NaN
      169574.0
      434551.0
      2015.0
      264977.0
      5488.96
      NaN
      NaN
      6.059042e+13
    
  

5 rows × 60 columns

Quickly check the data types

Now let us check the dtypes of different types of variable.



In [21]:

    
pd.options.display.max_rows = 65 #for readability

# Get the Series of pairs (column names, its data type)
dtype_df = train_df_without_outliers.dtypes.reset_index()
# Rename correctly (often it's messed up by the reset_index)
dtype_df.columns = ["Feature name", "dtype"]
dtype_df









    Out[21]:







  
    
      
      Feature name
      dtype
    
  
  
    
      0
      parcelid
      int64
    
    
      1
      logerror
      float64
    
    
      2
      transactiondate
      datetime64[ns]
    
    
      3
      airconditioningtypeid
      float64
    
    
      4
      architecturalstyletypeid
      float64
    
    
      5
      basementsqft
      float64
    
    
      6
      bathroomcnt
      float64
    
    
      7
      bedroomcnt
      float64
    
    
      8
      buildingclasstypeid
      float64
    
    
      9
      buildingqualitytypeid
      float64
    
    
      10
      calculatedbathnbr
      float64
    
    
      11
      decktypeid
      float64
    
    
      12
      finishedfloor1squarefeet
      float64
    
    
      13
      calculatedfinishedsquarefeet
      float64
    
    
      14
      finishedsquarefeet12
      float64
    
    
      15
      finishedsquarefeet13
      float64
    
    
      16
      finishedsquarefeet15
      float64
    
    
      17
      finishedsquarefeet50
      float64
    
    
      18
      finishedsquarefeet6
      float64
    
    
      19
      fips
      float64
    
    
      20
      fireplacecnt
      float64
    
    
      21
      fullbathcnt
      float64
    
    
      22
      garagecarcnt
      float64
    
    
      23
      garagetotalsqft
      float64
    
    
      24
      hashottuborspa
      object
    
    
      25
      heatingorsystemtypeid
      float64
    
    
      26
      latitude
      float64
    
    
      27
      longitude
      float64
    
    
      28
      lotsizesquarefeet
      float64
    
    
      29
      poolcnt
      float64
    
    
      30
      poolsizesum
      float64
    
    
      31
      pooltypeid10
      float64
    
    
      32
      pooltypeid2
      float64
    
    
      33
      pooltypeid7
      float64
    
    
      34
      propertycountylandusecode
      object
    
    
      35
      propertylandusetypeid
      float64
    
    
      36
      propertyzoningdesc
      object
    
    
      37
      rawcensustractandblock
      float64
    
    
      38
      regionidcity
      float64
    
    
      39
      regionidcounty
      float64
    
    
      40
      regionidneighborhood
      float64
    
    
      41
      regionidzip
      float64
    
    
      42
      roomcnt
      float64
    
    
      43
      storytypeid
      float64
    
    
      44
      threequarterbathnbr
      float64
    
    
      45
      typeconstructiontypeid
      float64
    
    
      46
      unitcnt
      float64
    
    
      47
      yardbuildingsqft17
      float64
    
    
      48
      yardbuildingsqft26
      float64
    
    
      49
      yearbuilt
      float64
    
    
      50
      numberofstories
      float64
    
    
      51
      fireplaceflag
      object
    
    
      52
      structuretaxvaluedollarcnt
      float64
    
    
      53
      taxvaluedollarcnt
      float64
    
    
      54
      assessmentyear
      float64
    
    
      55
      landtaxvaluedollarcnt
      float64
    
    
      56
      taxamount
      float64
    
    
      57
      taxdelinquencyflag
      object
    
    
      58
      taxdelinquencyyear
      float64
    
    
      59
      censustractandblock
      float64

Almost all are float variables with few object (categorical) variables. Let us get the count.



In [22]:

    
# Like before, we want to count the number of occurences of certain values (here, the dtype values)
dtype_count = dtype_df.groupby('dtype').aggregate('count').reset_index()
dtype_count.columns = ['dtype','number of occurences']
dtype_count.head()









    Out[22]:







  
    
      
      dtype
      number of occurences
    
  
  
    
      0
      int64
      1
    
    
      1
      float64
      53
    
    
      2
      datetime64[ns]
      1
    
    
      3
      object
      5

Looking into missing values

Now let us check the number of Nulls in this new merged dataset.



In [24]:

    
# convert each cell value to True/False (isnull?), then sum over the columns (=count the "True"'s for each column)
missing_df = train_df_without_outliers.isnull().sum(axis=0).reset_index()
missing_df.columns = ['feature_name', 'missing_count']
# also get the ratio of the null values for each feature
missing_df['missing_ratio'] = missing_df['missing_count'] / train_df.shape[0]
missing_df[missing_df['missing_ratio']>0.9]









    Out[24]:







  
    
      
      feature_name
      missing_count
      missing_ratio
    
  
  
    
      4
      architecturalstyletypeid
      88206
      0.977081
    
    
      5
      basementsqft
      88424
      0.979496
    
    
      8
      buildingclasstypeid
      88453
      0.979817
    
    
      11
      decktypeid
      87818
      0.972783
    
    
      12
      finishedfloor1squarefeet
      81700
      0.905012
    
    
      15
      finishedsquarefeet13
      88435
      0.979618
    
    
      16
      finishedsquarefeet15
      85121
      0.942908
    
    
      17
      finishedsquarefeet50
      81700
      0.905012
    
    
      18
      finishedsquarefeet6
      88075
      0.975630
    
    
      24
      hashottuborspa
      86138
      0.954173
    
    
      30
      poolsizesum
      87506
      0.969327
    
    
      31
      pooltypeid10
      87320
      0.967267
    
    
      32
      pooltypeid2
      87283
      0.966857
    
    
      43
      storytypeid
      88424
      0.979496
    
    
      45
      typeconstructiontypeid
      88168
      0.976660
    
    
      47
      yardbuildingsqft17
      85855
      0.951038
    
    
      48
      yardbuildingsqft26
      88374
      0.978942
    
    
      51
      fireplaceflag
      88245
      0.977513
    
    
      57
      taxdelinquencyflag
      86765
      0.961119
    
    
      58
      taxdelinquencyyear
      86765
      0.961119

Ten columns have missing values 96% of the times ! Let's look into these missing values:

are they just accidents? ("We forgot to type in the square meters for the kitchen") -> replace by mean
are they hidden binary? (1.0 = there is a fireplace, Null = there is no fireplace) -> replace by 0
are they something else? (building type is either 4.0 or 7.0... Null is what??) -> :(



In [25]:

    
missing_df['nb_unique_values'] = missing_df.apply(lambda x: len(train_df_without_outliers[x['feature_name']].unique()), axis=1)
missing_df['dtype'] = missing_df.apply(lambda x: train_df_without_outliers.dtypes[x['feature_name']], axis=1)
missing_df = missing_df.sort_values('missing_ratio', ascending=False)
missing_df.head(20)









    Out[25]:







  
    
      
      feature_name
      missing_count
      missing_ratio
      nb_unique_values
      dtype
    
  
  
    
      8
      buildingclasstypeid
      88453
      0.979817
      2
      float64
    
    
      15
      finishedsquarefeet13
      88435
      0.979618
      12
      float64
    
    
      43
      storytypeid
      88424
      0.979496
      2
      float64
    
    
      5
      basementsqft
      88424
      0.979496
      38
      float64
    
    
      48
      yardbuildingsqft26
      88374
      0.978942
      72
      float64
    
    
      51
      fireplaceflag
      88245
      0.977513
      2
      object
    
    
      4
      architecturalstyletypeid
      88206
      0.977081
      7
      float64
    
    
      45
      typeconstructiontypeid
      88168
      0.976660
      4
      float64
    
    
      18
      finishedsquarefeet6
      88075
      0.975630
      337
      float64
    
    
      11
      decktypeid
      87818
      0.972783
      2
      float64
    
    
      30
      poolsizesum
      87506
      0.969327
      273
      float64
    
    
      31
      pooltypeid10
      87320
      0.967267
      2
      float64
    
    
      32
      pooltypeid2
      87283
      0.966857
      2
      float64
    
    
      58
      taxdelinquencyyear
      86765
      0.961119
      12
      float64
    
    
      57
      taxdelinquencyflag
      86765
      0.961119
      2
      object
    
    
      24
      hashottuborspa
      86138
      0.954173
      2
      object
    
    
      47
      yardbuildingsqft17
      85855
      0.951038
      564
      float64
    
    
      16
      finishedsquarefeet15
      85121
      0.942908
      1831
      float64
    
    
      12
      finishedfloor1squarefeet
      81700
      0.905012
      1874
      float64
    
    
      17
      finishedsquarefeet50
      81700
      0.905012
      1886
      float64



In [26]:

    
missing_df[missing_df['feature_name'] == 'fireplaceflag']









    Out[26]:







  
    
      
      feature_name
      missing_count
      missing_ratio
      nb_unique_values
      dtype
    
  
  
    
      51
      fireplaceflag
      88245
      0.977513
      2
      object



In [27]:

    
train_df_without_outliers.fireplaceflag.unique()









    Out[27]:





array([nan, True], dtype=object)

Question: can we come up with an automated way of dealing with the missing data? Test your logic on the above variables

For example, does it work on both poolsizesum and fireplaceflag? And buildingclasstypeid ?

Correlation between variables:



In [28]:

    
# Let us just impute the missing values with mean values to compute correlation coefficients #
median_values = train_df_without_outliers.median(axis=0)
train_df_null_to_median = train_df_without_outliers.copy()
train_df_null_to_median.fillna(median_values, inplace=True)









    Out[28]:







  
    
      
      parcelid
      logerror
      transactiondate
      airconditioningtypeid
      architecturalstyletypeid
      basementsqft
      bathroomcnt
      bedroomcnt
      buildingclasstypeid
      buildingqualitytypeid
      ...
      numberofstories
      fireplaceflag
      structuretaxvaluedollarcnt
      taxvaluedollarcnt
      assessmentyear
      landtaxvaluedollarcnt
      taxamount
      taxdelinquencyflag
      taxdelinquencyyear
      censustractandblock
    
  
  
    
      0
      11016594
      0.0276
      2016-01-01
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      122754.0
      360170.0
      2015.0
      237416.0
      6735.88
      NaN
      14.0
      6.037107e+13
    
    
      1
      14366692
      -0.1684
      2016-01-01
      1.0
      7.0
      608.0
      3.5
      4.0
      4.0
      7.0
      ...
      1.0
      1
      346458.0
      585529.0
      2015.0
      239071.0
      10153.02
      NaN
      14.0
      6.037620e+13
    
    
      2
      12098116
      -0.0040
      2016-01-01
      1.0
      7.0
      608.0
      3.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      61994.0
      119906.0
      2015.0
      57912.0
      11484.48
      NaN
      14.0
      6.037464e+13
    
    
      3
      12643413
      0.0218
      2016-01-02
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      171518.0
      244880.0
      2015.0
      73362.0
      3048.74
      NaN
      14.0
      6.037296e+13
    
    
      4
      14432541
      -0.0050
      2016-01-02
      1.0
      7.0
      608.0
      2.5
      4.0
      4.0
      7.0
      ...
      2.0
      1
      169574.0
      434551.0
      2015.0
      264977.0
      5488.96
      NaN
      14.0
      6.059042e+13
    
    
      5
      11509835
      -0.2705
      2016-01-02
      1.0
      7.0
      608.0
      4.0
      4.0
      4.0
      1.0
      ...
      1.0
      1
      880650.0
      2447951.0
      2015.0
      1567301.0
      27126.57
      NaN
      14.0
      6.037621e+13
    
    
      6
      12286022
      0.0440
      2016-01-02
      1.0
      7.0
      608.0
      1.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      64549.0
      111521.0
      2015.0
      46972.0
      2304.97
      NaN
      14.0
      6.037542e+13
    
    
      7
      17177301
      0.1638
      2016-01-02
      1.0
      7.0
      608.0
      2.5
      3.0
      4.0
      7.0
      ...
      2.0
      1
      107000.0
      306000.0
      2015.0
      199000.0
      3745.50
      NaN
      14.0
      6.111003e+13
    
    
      8
      14739064
      -0.0030
      2016-01-02
      1.0
      7.0
      608.0
      1.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      66834.0
      210064.0
      2015.0
      143230.0
      2172.88
      NaN
      14.0
      6.059042e+13
    
    
      9
      14677559
      0.0843
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      109977.0
      190960.0
      2015.0
      80983.0
      1940.26
      NaN
      14.0
      6.059063e+13
    
    
      10
      10854446
      0.3825
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      55396.0
      105954.0
      2015.0
      50558.0
      1443.69
      NaN
      14.0
      6.037141e+13
    
    
      11
      11672170
      -0.0161
      2016-01-03
      1.0
      7.0
      608.0
      4.0
      5.0
      4.0
      1.0
      ...
      1.0
      1
      559040.0
      1090127.0
      2015.0
      531087.0
      13428.94
      NaN
      14.0
      6.037263e+13
    
    
      12
      10949761
      0.0344
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      29551.0
      70119.0
      2015.0
      40568.0
      832.35
      NaN
      14.0
      6.037311e+13
    
    
      13
      14665244
      0.0816
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      398747.0
      601000.0
      2015.0
      202253.0
      6876.70
      NaN
      14.0
      6.059063e+13
    
    
      14
      14148540
      -0.0450
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      78565.0
      254817.0
      2015.0
      176252.0
      3192.28
      NaN
      14.0
      6.059002e+13
    
    
      15
      14057104
      0.0020
      2016-01-03
      5.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      89850.0
      345023.0
      2015.0
      255173.0
      4036.32
      NaN
      14.0
      6.059099e+13
    
    
      16
      17296797
      0.0440
      2016-01-03
      1.0
      7.0
      608.0
      2.5
      3.0
      4.0
      7.0
      ...
      2.0
      1
      168000.0
      480000.0
      2015.0
      312000.0
      6383.74
      NaN
      14.0
      6.111007e+13
    
    
      17
      12524288
      -0.0419
      2016-01-03
      1.0
      7.0
      608.0
      1.0
      1.0
      4.0
      7.0
      ...
      1.0
      1
      56233.0
      70316.0
      2015.0
      14083.0
      913.17
      NaN
      14.0
      6.037575e+13
    
    
      18
      14091502
      -0.0020
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      75119.0
      253138.0
      2015.0
      178019.0
      3329.14
      NaN
      14.0
      6.059100e+13
    
    
      19
      10815909
      0.0100
      2016-01-03
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      108985.0
      416279.0
      2015.0
      307294.0
      5194.00
      NaN
      14.0
      6.037140e+13
    
    
      20
      12453853
      -0.0387
      2016-01-03
      1.0
      7.0
      608.0
      1.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      122952.0
      260249.0
      2015.0
      137297.0
      3211.31
      NaN
      14.0
      6.037570e+13
    
    
      21
      14253461
      0.0218
      2016-01-03
      13.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      109293.0
      258960.0
      2015.0
      149667.0
      3090.72
      NaN
      14.0
      6.059075e+13
    
    
      22
      14001034
      0.0149
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      4.0
      4.0
      7.0
      ...
      1.0
      1
      92991.0
      212077.0
      2015.0
      119086.0
      3020.24
      NaN
      14.0
      6.059110e+13
    
    
      23
      11458863
      0.1151
      2016-01-03
      1.0
      7.0
      608.0
      4.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      319449.0
      1194279.0
      2015.0
      874830.0
      14747.98
      NaN
      14.0
      6.037277e+13
    
    
      24
      13957980
      -0.1696
      2016-01-03
      1.0
      7.0
      608.0
      2.5
      2.0
      4.0
      7.0
      ...
      1.0
      1
      157171.0
      335790.0
      2015.0
      178619.0
      4269.90
      NaN
      14.0
      6.059089e+13
    
    
      25
      12752047
      -0.0440
      2016-01-03
      1.0
      7.0
      608.0
      4.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      154186.0
      239456.0
      2015.0
      85270.0
      3249.37
      NaN
      14.0
      6.037504e+13
    
    
      26
      11188425
      -0.0747
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      80600.0
      174000.0
      2015.0
      93400.0
      3895.76
      NaN
      14.0
      6.037910e+13
    
    
      27
      14646051
      0.0080
      2016-01-03
      1.0
      7.0
      608.0
      2.5
      2.0
      4.0
      7.0
      ...
      2.0
      True
      145291.0
      825000.0
      2015.0
      679709.0
      9553.56
      NaN
      14.0
      6.059042e+13
    
    
      28
      10805063
      0.0010
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      135218.0
      250245.0
      2015.0
      115027.0
      3117.19
      NaN
      14.0
      6.037140e+13
    
    
      29
      12524725
      -0.0534
      2016-01-03
      1.0
      7.0
      608.0
      3.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      117504.0
      215097.0
      2015.0
      97593.0
      2648.96
      NaN
      14.0
      6.037577e+13
    
    
      30
      11104966
      -0.0398
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      4.0
      4.0
      7.0
      ...
      1.0
      1
      176891.0
      335093.0
      2015.0
      158202.0
      4499.03
      NaN
      14.0
      6.037920e+13
    
    
      31
      11624877
      -0.0294
      2016-01-03
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      1.0
      ...
      1.0
      1
      202000.0
      673000.0
      2015.0
      471000.0
      8048.18
      NaN
      14.0
      6.037269e+13
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      88433
      17182690
      0.0030
      2016-12-30
      1.0
      7.0
      608.0
      2.5
      4.0
      4.0
      7.0
      ...
      2.0
      1
      138475.0
      276950.0
      2015.0
      138475.0
      3530.94
      NaN
      14.0
      6.111004e+13
    
    
      88434
      17182881
      -0.0111
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      90938.0
      113965.0
      2015.0
      23027.0
      1394.72
      NaN
      14.0
      6.111005e+13
    
    
      88435
      11143993
      -0.0294
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      294677.0
      399596.0
      2015.0
      104919.0
      5692.71
      NaN
      14.0
      6.037920e+13
    
    
      88436
      10757551
      0.0953
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      200424.0
      324821.0
      2015.0
      124397.0
      6427.03
      NaN
      14.0
      6.037137e+13
    
    
      88437
      12558267
      0.0139
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      315000.0
      464000.0
      2015.0
      149000.0
      6286.13
      NaN
      14.0
      6.037576e+13
    
    
      88438
      11710734
      0.2095
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      35465.0
      69303.0
      2015.0
      33838.0
      964.14
      NaN
      14.0
      6.037234e+13
    
    
      88439
      14347592
      0.0080
      2016-12-30
      13.0
      7.0
      608.0
      2.5
      4.0
      4.0
      7.0
      ...
      2.0
      1
      233044.0
      615410.0
      2015.0
      382366.0
      6379.72
      NaN
      14.0
      6.059052e+13
    
    
      88440
      11341061
      -0.1312
      2016-12-30
      1.0
      7.0
      608.0
      1.0
      1.0
      4.0
      7.0
      ...
      1.0
      1
      13234.0
      20756.0
      2015.0
      7522.0
      1295.30
      NaN
      14.0
      6.037920e+13
    
    
      88441
      14027878
      0.0421
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      99182.0
      257206.0
      2015.0
      158024.0
      3148.18
      NaN
      14.0
      6.059100e+13
    
    
      88442
      12756771
      0.0658
      2016-12-30
      1.0
      7.0
      608.0
      1.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      65728.0
      307167.0
      2015.0
      241439.0
      4038.70
      NaN
      14.0
      6.037553e+13
    
    
      88443
      12758333
      0.0030
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      134709.0
      203256.0
      2015.0
      68547.0
      2761.24
      NaN
      14.0
      6.037553e+13
    
    
      88444
      11295458
      -0.0294
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      40163.0
      50203.0
      2015.0
      10040.0
      1263.39
      Y
      15.0
      6.037901e+13
    
    
      88445
      12757889
      -0.0111
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      147672.0
      344882.0
      2015.0
      197210.0
      4493.66
      NaN
      14.0
      6.037553e+13
    
    
      88446
      11308315
      0.0070
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      5.0
      4.0
      4.0
      ...
      1.0
      1
      248378.0
      331525.0
      2015.0
      83147.0
      6461.79
      NaN
      14.0
      6.037901e+13
    
    
      88447
      14353215
      0.0218
      2016-12-30
      1.0
      7.0
      608.0
      2.5
      3.0
      4.0
      7.0
      ...
      2.0
      1
      194714.0
      713656.0
      2015.0
      518942.0
      7380.64
      NaN
      14.0
      6.059052e+13
    
    
      88448
      10753416
      0.0526
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      221809.0
      295743.0
      2015.0
      73934.0
      4330.78
      NaN
      14.0
      6.037800e+13
    
    
      88449
      11756902
      0.0889
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      1.0
      ...
      1.0
      1
      259644.0
      504864.0
      2015.0
      245220.0
      6286.52
      NaN
      14.0
      6.037213e+13
    
    
      88450
      11994047
      0.0178
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      1.0
      4.0
      1.0
      ...
      1.0
      1
      140145.0
      448790.0
      2015.0
      308645.0
      5459.11
      NaN
      14.0
      6.037192e+13
    
    
      88451
      11440746
      0.0237
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      58100.0
      220600.0
      2015.0
      162500.0
      3596.07
      NaN
      14.0
      6.037604e+13
    
    
      88452
      11703478
      0.0431
      2016-12-30
      1.0
      7.0
      608.0
      1.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      17520.0
      39934.0
      2015.0
      22414.0
      627.91
      NaN
      14.0
      6.037232e+13
    
    
      88453
      10965821
      -0.2231
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      140984.0
      297456.0
      2015.0
      156472.0
      3738.38
      NaN
      14.0
      6.037107e+13
    
    
      88454
      12067066
      0.1398
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      7.0
      ...
      1.0
      1
      117807.0
      361072.0
      2015.0
      243265.0
      4044.68
      NaN
      14.0
      6.037302e+13
    
    
      88455
      11683263
      0.0478
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      10.0
      ...
      1.0
      1
      532796.0
      1331992.0
      2015.0
      799196.0
      18297.87
      NaN
      14.0
      6.037801e+13
    
    
      88456
      12265636
      -0.0030
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      4.0
      4.0
      7.0
      ...
      1.0
      1
      19554.0
      32581.0
      2015.0
      13027.0
      6137.29
      NaN
      14.0
      6.037543e+13
    
    
      88457
      17109680
      0.0010
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      4.0
      4.0
      7.0
      ...
      1.0
      1
      180034.0
      300054.0
      2015.0
      120020.0
      3172.68
      NaN
      14.0
      6.111002e+13
    
    
      88458
      12268527
      -0.0151
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      192373.0
      315288.0
      2015.0
      122915.0
      5853.81
      NaN
      14.0
      6.037541e+13
    
    
      88459
      12920746
      0.0344
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      141700.0
      405000.0
      2015.0
      263300.0
      5311.03
      NaN
      14.0
      6.037404e+13
    
    
      88460
      10774160
      -0.0356
      2016-12-30
      1.0
      7.0
      608.0
      1.0
      1.0
      4.0
      4.0
      ...
      1.0
      1
      43800.0
      191000.0
      2015.0
      147200.0
      2495.24
      NaN
      14.0
      6.037132e+13
    
    
      88461
      12046695
      0.0070
      2016-12-30
      1.0
      7.0
      608.0
      3.0
      3.0
      4.0
      4.0
      ...
      1.0
      1
      117893.0
      161111.0
      2015.0
      43218.0
      1886.54
      NaN
      14.0
      6.037301e+13
    
    
      88462
      12995401
      -0.2679
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      4.0
      4.0
      7.0
      ...
      1.0
      1
      22008.0
      38096.0
      2015.0
      16088.0
      1925.70
      Y
      14.0
      6.037433e+13
    
    
      88463
      11402105
      0.0602
      2016-12-30
      1.0
      7.0
      608.0
      2.0
      2.0
      4.0
      4.0
      ...
      1.0
      1
      132991.0
      165869.0
      2015.0
      32878.0
      2285.57
      NaN
      14.0
      6.037601e+13
    
    
      88464
      12566293
      0.4207
      2016-12-30
      1.0
      7.0
      608.0
      1.0
      3.0
      4.0
      7.0
      ...
      1.0
      1
      66258.0
      163037.0
      2015.0
      96779.0
      2560.96
      NaN
      14.0
      6.037544e+13
    
  

88465 rows × 60 columns



In [29]:

    
pearson = train_df_null_to_median.corr(method='pearson')
pearson









    Out[29]:







  
    
      
      parcelid
      logerror
      airconditioningtypeid
      architecturalstyletypeid
      basementsqft
      bathroomcnt
      bedroomcnt
      buildingclasstypeid
      buildingqualitytypeid
      calculatedbathnbr
      ...
      yardbuildingsqft26
      yearbuilt
      numberofstories
      structuretaxvaluedollarcnt
      taxvaluedollarcnt
      assessmentyear
      landtaxvaluedollarcnt
      taxamount
      taxdelinquencyyear
      censustractandblock
    
  
  
    
      parcelid
      1.000000
      0.010306
      0.075627
      0.002342
      0.007355
      -0.013529
      0.006506
      NaN
      0.271732
      -0.004001
      ...
      0.021418
      0.165355
      0.306954
      0.008655
      0.004977
      NaN
      0.002429
      -0.030407
      0.010734
      0.657123
    
    
      logerror
      0.010306
      1.000000
      0.005117
      0.000083
      0.004130
      0.048349
      0.037918
      NaN
      0.000864
      0.047638
      ...
      -0.001189
      0.018686
      0.020035
      0.028913
      0.022985
      NaN
      0.016100
      0.005202
      0.002756
      0.014806
    
    
      airconditioningtypeid
      0.075627
      0.005117
      1.000000
      -0.006796
      -0.000651
      0.032308
      0.048640
      NaN
      0.083650
      0.030833
      ...
      -0.001966
      0.035764
      0.263692
      0.000824
      0.011736
      NaN
      0.015603
      -0.005293
      0.005000
      0.073453
    
    
      architecturalstyletypeid
      0.002342
      0.000083
      -0.006796
      1.000000
      -0.000020
      -0.001913
      -0.000197
      NaN
      0.002942
      -0.002092
      ...
      -0.000059
      0.001084
      0.000222
      -0.002584
      -0.000685
      NaN
      0.000401
      -0.001210
      0.000131
      0.001981
    
    
      basementsqft
      0.007355
      0.004130
      -0.000651
      -0.000020
      1.000000
      0.012620
      0.010197
      NaN
      0.002359
      0.012687
      ...
      0.022482
      0.012176
      0.018040
      0.016152
      0.007355
      NaN
      0.001699
      0.006699
      0.000122
      0.012949
    
    
      bathroomcnt
      -0.013529
      0.048349
      0.032308
      -0.001913
      0.012620
      1.000000
      0.643803
      NaN
      -0.191037
      0.975320
      ...
      0.014554
      0.362175
      0.085337
      0.589363
      0.479236
      NaN
      0.351501
      0.464978
      0.012423
      0.018739
    
    
      bedroomcnt
      0.006506
      0.037918
      0.048640
      -0.000197
      0.010197
      0.643803
      1.000000
      NaN
      -0.024728
      0.599753
      ...
      0.006775
      0.039132
      0.097508
      0.309309
      0.245820
      NaN
      0.175001
      0.243980
      0.004057
      0.036643
    
    
      buildingclasstypeid
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      buildingqualitytypeid
      0.271732
      0.000864
      0.083650
      0.002942
      0.002359
      -0.191037
      -0.024728
      NaN
      1.000000
      -0.184092
      ...
      0.007129
      -0.197237
      0.174389
      -0.051107
      -0.006138
      NaN
      0.018900
      -0.023971
      0.001629
      0.314131
    
    
      calculatedbathnbr
      -0.004001
      0.047638
      0.030833
      -0.002092
      0.012687
      0.975320
      0.599753
      NaN
      -0.184092
      1.000000
      ...
      0.014716
      0.370866
      0.095658
      0.609152
      0.494461
      NaN
      0.364419
      0.482777
      0.012646
      0.029236
    
    
      decktypeid
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      finishedfloor1squarefeet
      0.073875
      -0.000147
      -0.006585
      -0.000199
      0.018092
      0.111041
      0.111429
      NaN
      0.023872
      0.115466
      ...
      0.105574
      0.027194
      -0.079529
      0.118855
      0.088529
      NaN
      0.060112
      0.070022
      0.000932
      0.130579
    
    
      calculatedfinishedsquarefeet
      0.016971
      0.060022
      0.065548
      -0.002841
      0.008548
      0.761497
      0.623556
      NaN
      -0.082207
      0.792365
      ...
      0.022349
      0.212809
      0.140274
      0.712990
      0.597229
      NaN
      0.452218
      0.584410
      0.010023
      0.059554
    
    
      finishedsquarefeet12
      0.030043
      0.060014
      0.072142
      -0.002769
      0.009066
      0.730567
      0.559068
      NaN
      -0.100720
      0.750860
      ...
      0.023621
      0.257795
      0.145132
      0.730538
      0.610269
      NaN
      0.461049
      0.593448
      0.009782
      0.080473
    
    
      finishedsquarefeet13
      -0.009106
      -0.003197
      0.000847
      0.000026
      0.000024
      0.014100
      0.015881
      NaN
      -0.003069
      0.001767
      ...
      0.000072
      -0.004563
      0.001762
      0.003014
      0.002465
      NaN
      0.001923
      0.003007
      -0.000159
      -0.016847
    
    
      finishedsquarefeet15
      -0.020371
      0.016603
      -0.007467
      -0.000225
      -0.000210
      0.211337
      0.280318
      NaN
      0.006634
      0.221210
      ...
      -0.000864
      -0.031896
      -0.015137
      0.049510
      0.038815
      NaN
      0.027686
      0.046354
      0.004081
      -0.028086
    
    
      finishedsquarefeet50
      0.074084
      -0.000593
      -0.006630
      -0.000200
      0.002926
      0.109571
      0.108696
      NaN
      0.024034
      0.113874
      ...
      0.126939
      0.025404
      -0.076683
      0.117105
      0.087742
      NaN
      0.059941
      0.069463
      0.000974
      0.131487
    
    
      finishedsquarefeet6
      0.006273
      -0.007285
      0.013155
      -0.000062
      -0.000058
      -0.031734
      -0.049235
      NaN
      0.007491
      0.016176
      ...
      -0.000176
      0.016783
      0.073003
      0.018480
      0.020688
      NaN
      0.018805
      0.019256
      0.001362
      0.006600
    
    
      fips
      0.664260
      0.012589
      0.073706
      0.001986
      0.012837
      0.010796
      0.026370
      NaN
      0.318031
      0.028184
      ...
      0.038795
      0.239064
      0.454038
      0.038831
      0.014635
      NaN
      0.000244
      -0.030361
      0.012665
      0.992589
    
    
      fireplacecnt
      0.148244
      0.010012
      0.057457
      -0.000548
      0.036677
      0.128699
      0.097753
      NaN
      0.065749
      0.132213
      ...
      0.078650
      0.078577
      0.228591
      0.124807
      0.099932
      NaN
      0.072703
      0.077095
      0.002877
      0.250243
    
    
      fullbathcnt
      -0.062001
      0.045152
      0.008176
      -0.003245
      0.009646
      0.957935
      0.593412
      NaN
      -0.225982
      0.983718
      ...
      0.012673
      0.323498
      0.028312
      0.589761
      0.477130
      NaN
      0.350695
      0.471292
      0.010865
      -0.042388
    
    
      garagecarcnt
      -0.136684
      0.007248
      0.031093
      -0.005935
      0.005646
      0.223673
      0.198057
      NaN
      -0.099614
      0.212026
      ...
      0.042092
      0.005171
      -0.027278
      0.228314
      0.170434
      NaN
      0.116250
      0.166340
      -0.006559
      -0.154781
    
    
      garagetotalsqft
      -0.168511
      0.008866
      0.097101
      -0.007124
      0.008978
      0.195709
      0.274165
      NaN
      -0.103655
      0.195364
      ...
      0.030862
      -0.123114
      0.009429
      0.207103
      0.168999
      NaN
      0.125457
      0.163496
      -0.005970
      -0.175891
    
    
      heatingorsystemtypeid
      -0.043480
      -0.011515
      -0.041376
      0.004774
      -0.001670
      -0.256787
      -0.154644
      NaN
      0.172919
      -0.274687
      ...
      -0.005048
      -0.269123
      -0.032489
      -0.160576
      -0.113876
      NaN
      -0.073760
      -0.106763
      -0.007710
      -0.112841
    
    
      latitude
      -0.239493
      -0.012802
      -0.141956
      -0.003456
      0.001591
      0.024267
      0.067632
      NaN
      -0.286062
      0.026190
      ...
      0.015159
      -0.006765
      -0.107666
      -0.012526
      -0.115390
      NaN
      -0.152164
      -0.079743
      -0.011551
      -0.071390
    
    
      longitude
      -0.018027
      0.008149
      0.160944
      0.003802
      -0.007197
      -0.018423
      -0.005694
      NaN
      0.153444
      -0.026493
      ...
      -0.027885
      0.067580
      -0.067203
      -0.060552
      -0.023400
      NaN
      -0.001283
      -0.033856
      0.008161
      -0.323041
    
    
      lotsizesquarefeet
      -0.083030
      0.004241
      -0.024682
      -0.000674
      -0.001072
      0.002550
      -0.094320
      NaN
      -0.120155
      0.005959
      ...
      0.007268
      0.088472
      -0.051386
      0.008461
      -0.020316
      NaN
      -0.032717
      -0.020008
      -0.000471
      -0.086731
    
    
      poolcnt
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      poolsizesum
      0.020859
      -0.002435
      -0.001892
      -0.000057
      -0.004621
      0.035157
      0.023097
      NaN
      0.006859
      0.036538
      ...
      0.038712
      0.009663
      0.011266
      0.043237
      0.037688
      NaN
      0.029456
      0.030696
      0.001551
      0.037655
    
    
      pooltypeid10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      pooltypeid2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      pooltypeid7
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      propertylandusetypeid
      0.042919
      -0.003440
      -0.025456
      0.002358
      -0.001090
      -0.056311
      -0.286755
      NaN
      -0.164675
      -0.089658
      ...
      -0.003506
      0.361040
      -0.017556
      0.006020
      -0.030273
      NaN
      -0.044331
      -0.044047
      0.010742
      0.051648
    
    
      rawcensustractandblock
      0.663642
      0.012465
      0.072366
      0.001949
      0.012873
      0.011036
      0.027026
      NaN
      0.316052
      0.028466
      ...
      0.038903
      0.240387
      0.453407
      0.038966
      0.013931
      NaN
      -0.000791
      -0.030787
      0.012534
      0.992639
    
    
      regionidcity
      -0.016629
      0.002564
      -0.020346
      0.000259
      0.000832
      -0.038019
      -0.022861
      NaN
      0.015869
      -0.040235
      ...
      -0.000981
      -0.051527
      -0.021970
      -0.031519
      -0.021051
      NaN
      -0.012709
      -0.026761
      -0.004738
      -0.030860
    
    
      regionidcounty
      -0.513389
      -0.014949
      -0.232266
      -0.006506
      -0.002440
      -0.023213
      -0.004709
      NaN
      -0.395769
      -0.035456
      ...
      -0.007375
      -0.287956
      -0.369262
      -0.017827
      -0.056316
      NaN
      -0.068190
      -0.009444
      -0.022892
      -0.601329
    
    
      regionidneighborhood
      0.032411
      -0.003426
      -0.012375
      0.003036
      0.003137
      -0.020849
      -0.040006
      NaN
      0.001001
      -0.020515
      ...
      -0.002514
      -0.076973
      -0.030962
      0.010381
      0.034589
      NaN
      0.042130
      0.040476
      -0.000501
      -0.023396
    
    
      regionidzip
      0.045135
      0.000047
      0.014786
      0.000373
      -0.000571
      0.011423
      0.015107
      NaN
      0.021891
      0.013043
      ...
      0.001495
      0.043221
      0.029402
      0.000270
      0.005614
      NaN
      0.007647
      0.000446
      0.004578
      0.055464
    
    
      roomcnt
      0.483323
      0.014990
      0.306533
      0.006619
      0.014565
      0.043741
      0.166969
      NaN
      0.290073
      0.033048
      ...
      0.029155
      0.074103
      0.608843
      -0.029449
      0.011953
      NaN
      0.031367
      -0.032766
      0.014851
      0.702615
    
    
      storytypeid
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      threequarterbathnbr
      0.015537
      0.009109
      0.004868
      -0.000139
      -0.000129
      0.094490
      0.036256
      NaN
      0.016664
      0.096893
      ...
      -0.000391
      0.039228
      0.000279
      0.193139
      0.165791
      NaN
      0.128072
      0.150136
      0.000861
      0.014652
    
    
      typeconstructiontypeid
      0.000721
      -0.001653
      -0.000203
      -0.000006
      -0.000006
      0.001639
      0.000739
      NaN
      0.000736
      0.001660
      ...
      -0.000017
      0.000673
      0.006537
      0.006477
      0.004340
      NaN
      0.002606
      0.003557
      0.000038
      0.000645
    
    
      unitcnt
      -0.029509
      -0.001005
      -0.015069
      -0.000488
      -0.000455
      0.093369
      0.142269
      NaN
      0.050205
      0.117603
      ...
      -0.000881
      -0.104817
      -0.011000
      -0.003039
      -0.000436
      NaN
      0.000906
      0.006892
      -0.000324
      -0.044912
    
    
      yardbuildingsqft17
      0.069054
      -0.007415
      -0.006172
      -0.000186
      0.045089
      0.044693
      0.051440
      NaN
      0.022373
      0.045485
      ...
      0.040605
      0.016214
      0.049529
      0.056218
      0.038684
      NaN
      0.023986
      0.028279
      0.000448
      0.122845
    
    
      yardbuildingsqft26
      0.021418
      -0.001189
      -0.001966
      -0.000059
      0.022482
      0.014554
      0.006775
      NaN
      0.007129
      0.014716
      ...
      1.000000
      0.008645
      0.009733
      0.016063
      0.008231
      NaN
      0.002946
      0.005395
      -0.006135
      0.039133
    
    
      yearbuilt
      0.165355
      0.018686
      0.035764
      0.001084
      0.012176
      0.362175
      0.039132
      NaN
      -0.197237
      0.370866
      ...
      0.008645
      1.000000
      0.117645
      0.303638
      0.143627
      NaN
      0.039483
      0.118438
      0.017129
      0.238722
    
    
      numberofstories
      0.306954
      0.020035
      0.263692
      0.000222
      0.018040
      0.085337
      0.097508
      NaN
      0.174389
      0.095658
      ...
      0.009733
      0.117645
      1.000000
      0.020515
      0.036995
      NaN
      0.040079
      0.007874
      0.008188
      0.456431
    
    
      structuretaxvaluedollarcnt
      0.008655
      0.028913
      0.000824
      -0.002584
      0.016152
      0.589363
      0.309309
      NaN
      -0.051107
      0.609152
      ...
      0.016063
      0.303638
      0.020515
      1.000000
      0.819461
      NaN
      0.609111
      0.786681
      0.013515
      0.039201
    
    
      taxvaluedollarcnt
      0.004977
      0.022985
      0.011736
      -0.000685
      0.007355
      0.479236
      0.245820
      NaN
      -0.006138
      0.494461
      ...
      0.008231
      0.143627
      0.036995
      0.819461
      1.000000
      NaN
      0.953520
      0.946936
      0.013946
      0.015158
    
    
      assessmentyear
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      landtaxvaluedollarcnt
      0.002429
      0.016100
      0.015603
      0.000401
      0.001699
      0.351501
      0.175001
      NaN
      0.018900
      0.364419
      ...
      0.002946
      0.039483
      0.040079
      0.609111
      0.953520
      NaN
      1.000000
      0.897819
      0.012108
      -0.000003
    
    
      taxamount
      -0.030407
      0.005202
      -0.005293
      -0.001210
      0.006699
      0.464978
      0.243980
      NaN
      -0.023971
      0.482777
      ...
      0.005395
      0.118438
      0.007874
      0.786681
      0.946936
      NaN
      0.897819
      1.000000
      0.013306
      -0.030038
    
    
      taxdelinquencyyear
      0.010734
      0.002756
      0.005000
      0.000131
      0.000122
      0.012423
      0.004057
      NaN
      0.001629
      0.012646
      ...
      -0.006135
      0.017129
      0.008188
      0.013515
      0.013946
      NaN
      0.012108
      0.013306
      1.000000
      0.012483
    
    
      censustractandblock
      0.657123
      0.014806
      0.073453
      0.001981
      0.012949
      0.018739
      0.036643
      NaN
      0.314131
      0.029236
      ...
      0.039133
      0.238722
      0.456431
      0.039201
      0.015158
      NaN
      -0.000003
      -0.030038
      0.012483
      1.000000
    
  

54 rows × 54 columns



In [30]:

    
# Generate a mask for the upper triangle
mask = np.zeros_like(pearson, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(pearson, mask=mask, cmap=cmap, vmax=.3,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x1129aa2b0>



In [31]:

    
# Now let us look at the correlation coefficient of each of these variables #
x_cols = [col for col in train_df_null_to_median.columns if col not in ['logerror'] if train_df_null_to_median[col].dtype=='float64']

labels = []
values = []
for col in x_cols:
    labels.append(col)
    values.append(np.corrcoef(train_df_null_to_median[col].values, train_df_null_to_median.logerror.values)[0,1])
corr_df = pd.DataFrame({'col_labels':labels, 'corr_values':values})
corr_df = corr_df.sort_values(by='corr_values')









    



/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/numpy/lib/function_base.py:3162: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/numpy/lib/function_base.py:3163: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]



In [32]:

    
ind = np.arange(len(labels))
width = 0.9
fig, ax = plt.subplots(figsize=(12,40))
rects = ax.barh(ind, np.array(corr_df.corr_values.values), color='r')
ax.set_yticks(ind)
ax.set_yticklabels(corr_df.col_labels.values, rotation='horizontal')
ax.set_xlabel("Correlation coefficient")
ax.set_title("Correlation coefficient of the variables")
#autolabel(rects)
plt.show()

The correlation of the target variable with the given set of variables are low overall.

There are few variables at the top of this graph without any correlation values. I guess they have only one unique value and hence no correlation value. Let us confirm the same.



In [33]:

    
corr_zero_cols = ['assessmentyear', 'storytypeid', 'pooltypeid2', 'pooltypeid7', 'pooltypeid10', 'poolcnt', 'decktypeid', 'buildingclasstypeid']
for col in corr_zero_cols:
    print(col, len(train_df_null_to_median[col].unique()))









    



assessmentyear 1
storytypeid 1
pooltypeid2 1
pooltypeid7 1
pooltypeid10 1
poolcnt 1
decktypeid 1
buildingclasstypeid 1

Let us take the variables with high correlation values and then do some analysis on them.



In [34]:

    
corr_df_sel = corr_df[(corr_df['corr_values']>0.02) | (corr_df['corr_values'] < -0.01)]
corr_df_sel









    Out[34]:







  
    
      
      col_labels
      corr_values
    
  
  
    
      33
      regionidcounty
      -0.014949
    
    
      22
      latitude
      -0.012802
    
    
      21
      heatingorsystemtypeid
      -0.011515
    
    
      44
      numberofstories
      0.020035
    
    
      46
      taxvaluedollarcnt
      0.022985
    
    
      45
      structuretaxvaluedollarcnt
      0.028913
    
    
      4
      bedroomcnt
      0.037918
    
    
      18
      fullbathcnt
      0.045152
    
    
      7
      calculatedbathnbr
      0.047638
    
    
      3
      bathroomcnt
      0.048349
    
    
      11
      finishedsquarefeet12
      0.060014
    
    
      10
      calculatedfinishedsquarefeet
      0.060022



In [35]:

    
cols_to_use = corr_df_sel.col_labels.tolist()

temp_df = train_df_null_to_median[cols_to_use]
corrmat = temp_df.corr(method='spearman')
f, ax = plt.subplots(figsize=(8, 8))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True)
plt.title("Important variables correlation map", fontsize=15)
plt.show()

Checking relationship between important variables and prediction target

The important variables themselves are very highly correlated.! Let us now look at each of them.

Note : for better visibility, we'll use train_df_without_outliers, i.e. the dataset where we removed outliers with respect to the target, logerror.

Finished SquareFeet 12:

Let us seee how the finished square feet 12 varies with the log error.



In [37]:

    
col = "finishedsquarefeet12"

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median.finishedsquarefeet12.values, y=train_df_null_to_median.logerror.values, size=10, color=color[4])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Finished Square Feet 12', fontsize=12)
plt.title("Finished square feet 12 Vs Log error", fontsize=15)
plt.show()









    





<matplotlib.figure.Figure at 0x112ef7d68>

Seems the range of logerror narrows down with increase in finished square feet 12 variable. Probably larger houses are easy to predict?

Calculated finished square feet:



In [38]:

    
col = "calculatedfinishedsquarefeet"

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median.calculatedfinishedsquarefeet.values, y=train_df_null_to_median.logerror.values, size=10, color=color[5])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated finished square feet', fontsize=12)
plt.title("Calculated finished square feet Vs Log error", fontsize=15)
plt.show()









    





<matplotlib.figure.Figure at 0x154c046a0>

Here as well the distribution is very similar to the previous one. No wonder the correlation between the two variables are also high.

Bathroom Count:



In [39]:

    
plt.figure(figsize=(12,8))
sns.countplot(x="bathroomcnt", data=train_df_null_to_median)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathroom', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bathroom count", fontsize=15)
plt.show()



In [40]:

    
plt.figure(figsize=(12,8))
sns.boxplot(x="bathroomcnt", y="logerror", data=train_df_null_to_median)
plt.ylabel('Log error', fontsize=12)
plt.xlabel('Bathroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("How log error changes with bathroom count?", fontsize=15)
plt.show()

Bedroom count:



In [41]:

    
plt.figure(figsize=(12,8))
sns.countplot(x="bedroomcnt", data=train_df_null_to_median)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Bedroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bedroom count", fontsize=15)
plt.show()



In [42]:

    
plt.figure(figsize=(12,8))
sns.violinplot(x='bedroomcnt', y='logerror', data=train_df_null_to_median)
plt.xlabel('Bedroom count', fontsize=12)
plt.ylabel('Log Error', fontsize=12)
plt.show()



In [43]:

    
col = "taxamount"

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df_null_to_median['taxamount'].values, y=train_df_null_to_median['logerror'].values, size=10, color='g')
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Tax Amount', fontsize=12)
plt.title("Tax Amount Vs Log error", fontsize=15)
plt.show()









    





<matplotlib.figure.Figure at 0x112663be0>

YearBuilt:

Let us explore how the error varies with the yearbuilt variable.



In [44]:

    
from ggplot import *
ggplot(aes(x='yearbuilt', y='logerror'), data=train_df_null_to_median) + \
    geom_point(color='steelblue', size=1) + \
    stat_smooth()









    



/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/ggplot/utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/ggplot/stats/smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
/Users/adrienctx/anaconda/envs/py3env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools






    












    Out[44]:





<ggplot: (-9223372036566973429)>

There is a minor incremental trend seen with respect to built year.

Now let us see how the logerror varies with respect to latitude and longitude.



In [45]:

    
ggplot(aes(x='latitude', y='longitude', color='logerror'), data=train_df_null_to_median) + \
    geom_point() + \
    scale_color_gradient(low = 'red', high = 'blue')









    












    Out[45]:





<ggplot: (288324584)>

There are no visible pockets as such with respect to latitude or longitude atleast with the naked eye.

Let us take the variables with highest positive correlation and highest negative correlation to see if we can see some visible patterns.



In [46]:

    
ggplot(aes(x='finishedsquarefeet12', y='taxamount', color='logerror'), data=train_df_null_to_median) + \
    geom_point(alpha=0.7) + \
    scale_color_gradient(low = 'pink', high = 'blue')









    












    Out[46]:





<ggplot: (296947313)>

Finding important variables with a simple model

We had an understanding of important variables from the univariate analysis. But this is on a stand alone basis and also we have linearity assumption. Now let us build a non-linear model to get the important variables by building Extra Trees model.



In [47]:

    
train_y = train_df_null_to_median['logerror'].values
cat_cols = ["hashottuborspa", "propertycountylandusecode", "propertyzoningdesc", "fireplaceflag", "taxdelinquencyflag"]
train_df_blind = train_df_null_to_median.copy()
# train_df_blind = train_df_blind.drop(['parcelid', 'logerror', 'transactiondate', 'transaction_month']+cat_cols, axis=1)
train_df_blind = train_df_blind.drop(['parcelid', 'logerror', 'transactiondate']+cat_cols, axis=1)
feat_names = train_df_blind.columns.values

from sklearn import ensemble
model = ensemble.ExtraTreesRegressor(n_estimators=25, max_depth=30, max_features=0.3, n_jobs=-1, random_state=0)
model.fit(train_df_blind, train_y)

## plot the importances ##
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0)
indices = np.argsort(importances)[::-1][:20]

plt.figure(figsize=(12,12))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical')
plt.xlim([-1, len(indices)])
plt.show()

Seems "tax amount" is the most importanct variable followed by "structure tax value dollar count" and "land tax value dollor count"



In [ ]:

	parcelid	logerror	transactiondate
0	11016594	0.0276	2016-01-01
1	14366692	-0.1684	2016-01-01
2	12098116	-0.0040	2016-01-01
3	12643413	0.0218	2016-01-02
4	14432541	-0.0050	2016-01-02
5	11509835	-0.2705	2016-01-02
6	12286022	0.0440	2016-01-02
7	17177301	0.1638	2016-01-02
8	14739064	-0.0030	2016-01-02
9	14677559	0.0843	2016-01-03

	parcelid	airconditioningtypeid	architecturalstyletypeid	basementsqft	buildingclasstypeid	buildingqualitytypeid	calculatedbathnbr	decktypeid	...	numberofstories	fireplaceflag	structuretaxvaluedollarcnt	taxvaluedollarcnt	assessmentyear	landtaxvaluedollarcnt	taxamount	taxdelinquencyflag	taxdelinquencyyear	censustractandblock
0	10754147	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	9.0	2015.0	9.0	NaN	NaN	NaN	NaN
1	10759547	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	27516.0	2015.0	27516.0	NaN	NaN	NaN	NaN
2	10843547	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	650756.0	1413387.0	2015.0	762631.0	20800.37	NaN	NaN	NaN
3	10859147	NaN	NaN	NaN	3.0	7.0	NaN	NaN	...	1.0	NaN	571346.0	1156834.0	2015.0	585488.0	14557.57	NaN	NaN	NaN
4	10879947	NaN	NaN	NaN	4.0	NaN	NaN	NaN	...	NaN	NaN	193796.0	433491.0	2015.0	239695.0	5725.17	NaN	NaN	NaN

	column_name	missing_count
16	finishedsquarefeet6	2963216
9	decktypeid	2968121
6	buildingclasstypeid	2972588
13	finishedsquarefeet13	2977545
43	typeconstructiontypeid	2978470
2	architecturalstyletypeid	2979156
49	fireplaceflag	2980054
46	yardbuildingsqft26	2982570
3	basementsqft	2983589
41	storytypeid	2983593

	Feature name	dtype
0	parcelid	int64
1	logerror	float64
2	transactiondate	datetime64[ns]
3	airconditioningtypeid	float64
4	architecturalstyletypeid	float64
5	basementsqft	float64
6	bathroomcnt	float64
7	bedroomcnt	float64
8	buildingclasstypeid	float64
9	buildingqualitytypeid	float64
10	calculatedbathnbr	float64
11	decktypeid	float64
12	finishedfloor1squarefeet	float64
13	calculatedfinishedsquarefeet	float64
14	finishedsquarefeet12	float64
15	finishedsquarefeet13	float64
16	finishedsquarefeet15	float64
17	finishedsquarefeet50	float64
18	finishedsquarefeet6	float64
19	fips	float64
20	fireplacecnt	float64
21	fullbathcnt	float64
22	garagecarcnt	float64
23	garagetotalsqft	float64
24	hashottuborspa	object
25	heatingorsystemtypeid	float64
26	latitude	float64
27	longitude	float64
28	lotsizesquarefeet	float64
29	poolcnt	float64
30	poolsizesum	float64
31	pooltypeid10	float64
32	pooltypeid2	float64
33	pooltypeid7	float64
34	propertycountylandusecode	object
35	propertylandusetypeid	float64
36	propertyzoningdesc	object
37	rawcensustractandblock	float64
38	regionidcity	float64
39	regionidcounty	float64
40	regionidneighborhood	float64
41	regionidzip	float64
42	roomcnt	float64
43	storytypeid	float64
44	threequarterbathnbr	float64
45	typeconstructiontypeid	float64
46	unitcnt	float64
47	yardbuildingsqft17	float64
48	yardbuildingsqft26	float64
49	yearbuilt	float64
50	numberofstories	float64
51	fireplaceflag	object
52	structuretaxvaluedollarcnt	float64
53	taxvaluedollarcnt	float64
54	assessmentyear	float64
55	landtaxvaluedollarcnt	float64
56	taxamount	float64
57	taxdelinquencyflag	object
58	taxdelinquencyyear	float64
59	censustractandblock	float64

	feature_name	missing_count	missing_ratio
4	architecturalstyletypeid	88206	0.977081
5	basementsqft	88424	0.979496
8	buildingclasstypeid	88453	0.979817
11	decktypeid	87818	0.972783
12	finishedfloor1squarefeet	81700	0.905012
15	finishedsquarefeet13	88435	0.979618
16	finishedsquarefeet15	85121	0.942908
17	finishedsquarefeet50	81700	0.905012
18	finishedsquarefeet6	88075	0.975630
24	hashottuborspa	86138	0.954173
30	poolsizesum	87506	0.969327
31	pooltypeid10	87320	0.967267
32	pooltypeid2	87283	0.966857
43	storytypeid	88424	0.979496
45	typeconstructiontypeid	88168	0.976660
47	yardbuildingsqft17	85855	0.951038
48	yardbuildingsqft26	88374	0.978942
51	fireplaceflag	88245	0.977513
57	taxdelinquencyflag	86765	0.961119
58	taxdelinquencyyear	86765	0.961119

	parcelid	logerror	airconditioningtypeid	architecturalstyletypeid	basementsqft	bathroomcnt	bedroomcnt	buildingclasstypeid	buildingqualitytypeid	calculatedbathnbr	...	yardbuildingsqft26	yearbuilt	numberofstories	structuretaxvaluedollarcnt	taxvaluedollarcnt	assessmentyear	landtaxvaluedollarcnt	taxamount	taxdelinquencyyear	censustractandblock
parcelid	1.000000	0.010306	0.075627	0.002342	0.007355	-0.013529	0.006506	NaN	0.271732	-0.004001	...	0.021418	0.165355	0.306954	0.008655	0.004977	NaN	0.002429	-0.030407	0.010734	0.657123
logerror	0.010306	1.000000	0.005117	0.000083	0.004130	0.048349	0.037918	NaN	0.000864	0.047638	...	-0.001189	0.018686	0.020035	0.028913	0.022985	NaN	0.016100	0.005202	0.002756	0.014806
airconditioningtypeid	0.075627	0.005117	1.000000	-0.006796	-0.000651	0.032308	0.048640	NaN	0.083650	0.030833	...	-0.001966	0.035764	0.263692	0.000824	0.011736	NaN	0.015603	-0.005293	0.005000	0.073453
architecturalstyletypeid	0.002342	0.000083	-0.006796	1.000000	-0.000020	-0.001913	-0.000197	NaN	0.002942	-0.002092	...	-0.000059	0.001084	0.000222	-0.002584	-0.000685	NaN	0.000401	-0.001210	0.000131	0.001981
basementsqft	0.007355	0.004130	-0.000651	-0.000020	1.000000	0.012620	0.010197	NaN	0.002359	0.012687	...	0.022482	0.012176	0.018040	0.016152	0.007355	NaN	0.001699	0.006699	0.000122	0.012949
bathroomcnt	-0.013529	0.048349	0.032308	-0.001913	0.012620	1.000000	0.643803	NaN	-0.191037	0.975320	...	0.014554	0.362175	0.085337	0.589363	0.479236	NaN	0.351501	0.464978	0.012423	0.018739
bedroomcnt	0.006506	0.037918	0.048640	-0.000197	0.010197	0.643803	1.000000	NaN	-0.024728	0.599753	...	0.006775	0.039132	0.097508	0.309309	0.245820	NaN	0.175001	0.243980	0.004057	0.036643
buildingclasstypeid	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
buildingqualitytypeid	0.271732	0.000864	0.083650	0.002942	0.002359	-0.191037	-0.024728	NaN	1.000000	-0.184092	...	0.007129	-0.197237	0.174389	-0.051107	-0.006138	NaN	0.018900	-0.023971	0.001629	0.314131
calculatedbathnbr	-0.004001	0.047638	0.030833	-0.002092	0.012687	0.975320	0.599753	NaN	-0.184092	1.000000	...	0.014716	0.370866	0.095658	0.609152	0.494461	NaN	0.364419	0.482777	0.012646	0.029236
decktypeid	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
finishedfloor1squarefeet	0.073875	-0.000147	-0.006585	-0.000199	0.018092	0.111041	0.111429	NaN	0.023872	0.115466	...	0.105574	0.027194	-0.079529	0.118855	0.088529	NaN	0.060112	0.070022	0.000932	0.130579
calculatedfinishedsquarefeet	0.016971	0.060022	0.065548	-0.002841	0.008548	0.761497	0.623556	NaN	-0.082207	0.792365	...	0.022349	0.212809	0.140274	0.712990	0.597229	NaN	0.452218	0.584410	0.010023	0.059554
finishedsquarefeet12	0.030043	0.060014	0.072142	-0.002769	0.009066	0.730567	0.559068	NaN	-0.100720	0.750860	...	0.023621	0.257795	0.145132	0.730538	0.610269	NaN	0.461049	0.593448	0.009782	0.080473
finishedsquarefeet13	-0.009106	-0.003197	0.000847	0.000026	0.000024	0.014100	0.015881	NaN	-0.003069	0.001767	...	0.000072	-0.004563	0.001762	0.003014	0.002465	NaN	0.001923	0.003007	-0.000159	-0.016847
finishedsquarefeet15	-0.020371	0.016603	-0.007467	-0.000225	-0.000210	0.211337	0.280318	NaN	0.006634	0.221210	...	-0.000864	-0.031896	-0.015137	0.049510	0.038815	NaN	0.027686	0.046354	0.004081	-0.028086
finishedsquarefeet50	0.074084	-0.000593	-0.006630	-0.000200	0.002926	0.109571	0.108696	NaN	0.024034	0.113874	...	0.126939	0.025404	-0.076683	0.117105	0.087742	NaN	0.059941	0.069463	0.000974	0.131487
finishedsquarefeet6	0.006273	-0.007285	0.013155	-0.000062	-0.000058	-0.031734	-0.049235	NaN	0.007491	0.016176	...	-0.000176	0.016783	0.073003	0.018480	0.020688	NaN	0.018805	0.019256	0.001362	0.006600
fips	0.664260	0.012589	0.073706	0.001986	0.012837	0.010796	0.026370	NaN	0.318031	0.028184	...	0.038795	0.239064	0.454038	0.038831	0.014635	NaN	0.000244	-0.030361	0.012665	0.992589
fireplacecnt	0.148244	0.010012	0.057457	-0.000548	0.036677	0.128699	0.097753	NaN	0.065749	0.132213	...	0.078650	0.078577	0.228591	0.124807	0.099932	NaN	0.072703	0.077095	0.002877	0.250243
fullbathcnt	-0.062001	0.045152	0.008176	-0.003245	0.009646	0.957935	0.593412	NaN	-0.225982	0.983718	...	0.012673	0.323498	0.028312	0.589761	0.477130	NaN	0.350695	0.471292	0.010865	-0.042388
garagecarcnt	-0.136684	0.007248	0.031093	-0.005935	0.005646	0.223673	0.198057	NaN	-0.099614	0.212026	...	0.042092	0.005171	-0.027278	0.228314	0.170434	NaN	0.116250	0.166340	-0.006559	-0.154781
garagetotalsqft	-0.168511	0.008866	0.097101	-0.007124	0.008978	0.195709	0.274165	NaN	-0.103655	0.195364	...	0.030862	-0.123114	0.009429	0.207103	0.168999	NaN	0.125457	0.163496	-0.005970	-0.175891
heatingorsystemtypeid	-0.043480	-0.011515	-0.041376	0.004774	-0.001670	-0.256787	-0.154644	NaN	0.172919	-0.274687	...	-0.005048	-0.269123	-0.032489	-0.160576	-0.113876	NaN	-0.073760	-0.106763	-0.007710	-0.112841
latitude	-0.239493	-0.012802	-0.141956	-0.003456	0.001591	0.024267	0.067632	NaN	-0.286062	0.026190	...	0.015159	-0.006765	-0.107666	-0.012526	-0.115390	NaN	-0.152164	-0.079743	-0.011551	-0.071390
longitude	-0.018027	0.008149	0.160944	0.003802	-0.007197	-0.018423	-0.005694	NaN	0.153444	-0.026493	...	-0.027885	0.067580	-0.067203	-0.060552	-0.023400	NaN	-0.001283	-0.033856	0.008161	-0.323041
lotsizesquarefeet	-0.083030	0.004241	-0.024682	-0.000674	-0.001072	0.002550	-0.094320	NaN	-0.120155	0.005959	...	0.007268	0.088472	-0.051386	0.008461	-0.020316	NaN	-0.032717	-0.020008	-0.000471	-0.086731
poolcnt	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
poolsizesum	0.020859	-0.002435	-0.001892	-0.000057	-0.004621	0.035157	0.023097	NaN	0.006859	0.036538	...	0.038712	0.009663	0.011266	0.043237	0.037688	NaN	0.029456	0.030696	0.001551	0.037655
pooltypeid10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
pooltypeid2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
pooltypeid7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
propertylandusetypeid	0.042919	-0.003440	-0.025456	0.002358	-0.001090	-0.056311	-0.286755	NaN	-0.164675	-0.089658	...	-0.003506	0.361040	-0.017556	0.006020	-0.030273	NaN	-0.044331	-0.044047	0.010742	0.051648
rawcensustractandblock	0.663642	0.012465	0.072366	0.001949	0.012873	0.011036	0.027026	NaN	0.316052	0.028466	...	0.038903	0.240387	0.453407	0.038966	0.013931	NaN	-0.000791	-0.030787	0.012534	0.992639
regionidcity	-0.016629	0.002564	-0.020346	0.000259	0.000832	-0.038019	-0.022861	NaN	0.015869	-0.040235	...	-0.000981	-0.051527	-0.021970	-0.031519	-0.021051	NaN	-0.012709	-0.026761	-0.004738	-0.030860
regionidcounty	-0.513389	-0.014949	-0.232266	-0.006506	-0.002440	-0.023213	-0.004709	NaN	-0.395769	-0.035456	...	-0.007375	-0.287956	-0.369262	-0.017827	-0.056316	NaN	-0.068190	-0.009444	-0.022892	-0.601329
regionidneighborhood	0.032411	-0.003426	-0.012375	0.003036	0.003137	-0.020849	-0.040006	NaN	0.001001	-0.020515	...	-0.002514	-0.076973	-0.030962	0.010381	0.034589	NaN	0.042130	0.040476	-0.000501	-0.023396
regionidzip	0.045135	0.000047	0.014786	0.000373	-0.000571	0.011423	0.015107	NaN	0.021891	0.013043	...	0.001495	0.043221	0.029402	0.000270	0.005614	NaN	0.007647	0.000446	0.004578	0.055464
roomcnt	0.483323	0.014990	0.306533	0.006619	0.014565	0.043741	0.166969	NaN	0.290073	0.033048	...	0.029155	0.074103	0.608843	-0.029449	0.011953	NaN	0.031367	-0.032766	0.014851	0.702615
storytypeid	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
threequarterbathnbr	0.015537	0.009109	0.004868	-0.000139	-0.000129	0.094490	0.036256	NaN	0.016664	0.096893	...	-0.000391	0.039228	0.000279	0.193139	0.165791	NaN	0.128072	0.150136	0.000861	0.014652
typeconstructiontypeid	0.000721	-0.001653	-0.000203	-0.000006	-0.000006	0.001639	0.000739	NaN	0.000736	0.001660	...	-0.000017	0.000673	0.006537	0.006477	0.004340	NaN	0.002606	0.003557	0.000038	0.000645
unitcnt	-0.029509	-0.001005	-0.015069	-0.000488	-0.000455	0.093369	0.142269	NaN	0.050205	0.117603	...	-0.000881	-0.104817	-0.011000	-0.003039	-0.000436	NaN	0.000906	0.006892	-0.000324	-0.044912
yardbuildingsqft17	0.069054	-0.007415	-0.006172	-0.000186	0.045089	0.044693	0.051440	NaN	0.022373	0.045485	...	0.040605	0.016214	0.049529	0.056218	0.038684	NaN	0.023986	0.028279	0.000448	0.122845
yardbuildingsqft26	0.021418	-0.001189	-0.001966	-0.000059	0.022482	0.014554	0.006775	NaN	0.007129	0.014716	...	1.000000	0.008645	0.009733	0.016063	0.008231	NaN	0.002946	0.005395	-0.006135	0.039133
yearbuilt	0.165355	0.018686	0.035764	0.001084	0.012176	0.362175	0.039132	NaN	-0.197237	0.370866	...	0.008645	1.000000	0.117645	0.303638	0.143627	NaN	0.039483	0.118438	0.017129	0.238722
numberofstories	0.306954	0.020035	0.263692	0.000222	0.018040	0.085337	0.097508	NaN	0.174389	0.095658	...	0.009733	0.117645	1.000000	0.020515	0.036995	NaN	0.040079	0.007874	0.008188	0.456431
structuretaxvaluedollarcnt	0.008655	0.028913	0.000824	-0.002584	0.016152	0.589363	0.309309	NaN	-0.051107	0.609152	...	0.016063	0.303638	0.020515	1.000000	0.819461	NaN	0.609111	0.786681	0.013515	0.039201
taxvaluedollarcnt	0.004977	0.022985	0.011736	-0.000685	0.007355	0.479236	0.245820	NaN	-0.006138	0.494461	...	0.008231	0.143627	0.036995	0.819461	1.000000	NaN	0.953520	0.946936	0.013946	0.015158
assessmentyear	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
landtaxvaluedollarcnt	0.002429	0.016100	0.015603	0.000401	0.001699	0.351501	0.175001	NaN	0.018900	0.364419	...	0.002946	0.039483	0.040079	0.609111	0.953520	NaN	1.000000	0.897819	0.012108	-0.000003
taxamount	-0.030407	0.005202	-0.005293	-0.001210	0.006699	0.464978	0.243980	NaN	-0.023971	0.482777	...	0.005395	0.118438	0.007874	0.786681	0.946936	NaN	0.897819	1.000000	0.013306	-0.030038
taxdelinquencyyear	0.010734	0.002756	0.005000	0.000131	0.000122	0.012423	0.004057	NaN	0.001629	0.012646	...	-0.006135	0.017129	0.008188	0.013515	0.013946	NaN	0.012108	0.013306	1.000000	0.012483
censustractandblock	0.657123	0.014806	0.073453	0.001981	0.012949	0.018739	0.036643	NaN	0.314131	0.029236	...	0.039133	0.238722	0.456431	0.039201	0.015158	NaN	-0.000003	-0.030038	0.012483	1.000000