California house price prediction

Load the data and explore it



In [4]:

    
import pandas as pd
housing = pd.read_csv(r"E:\GIS_Data\file_formats\CSV\housing.csv")
housing.head()









    Out[4]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
      ocean_proximity
    
  
  
    
      0
      -122.23
      37.88
      41.0
      880.0
      129.0
      322.0
      126.0
      8.3252
      452600.0
      NEAR BAY
    
    
      1
      -122.22
      37.86
      21.0
      7099.0
      1106.0
      2401.0
      1138.0
      8.3014
      358500.0
      NEAR BAY
    
    
      2
      -122.24
      37.85
      52.0
      1467.0
      190.0
      496.0
      177.0
      7.2574
      352100.0
      NEAR BAY
    
    
      3
      -122.25
      37.85
      52.0
      1274.0
      235.0
      558.0
      219.0
      5.6431
      341300.0
      NEAR BAY
    
    
      4
      -122.25
      37.85
      52.0
      1627.0
      280.0
      565.0
      259.0
      3.8462
      342200.0
      NEAR BAY



In [5]:

    
housing.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB



In [6]:

    
# find unique values in ocean proximity column
housing.ocean_proximity.value_counts()









    Out[6]:





<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64



In [7]:

    
#describe all numerical rows - basic stats
housing.describe()









    Out[7]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      count
      20640.000000
      20640.000000
      20640.000000
      20640.000000
      20433.000000
      20640.000000
      20640.000000
      20640.000000
      20640.000000
    
    
      mean
      -119.569704
      35.631861
      28.639486
      2635.763081
      537.870553
      1425.476744
      499.539680
      3.870671
      206855.816909
    
    
      std
      2.003532
      2.135952
      12.585558
      2181.615252
      421.385070
      1132.462122
      382.329753
      1.899822
      115395.615874
    
    
      min
      -124.350000
      32.540000
      1.000000
      2.000000
      1.000000
      3.000000
      1.000000
      0.499900
      14999.000000
    
    
      25%
      -121.800000
      33.930000
      18.000000
      1447.750000
      296.000000
      787.000000
      280.000000
      2.563400
      119600.000000
    
    
      50%
      -118.490000
      34.260000
      29.000000
      2127.000000
      435.000000
      1166.000000
      409.000000
      3.534800
      179700.000000
    
    
      75%
      -118.010000
      37.710000
      37.000000
      3148.000000
      647.000000
      1725.000000
      605.000000
      4.743250
      264725.000000
    
    
      max
      -114.310000
      41.950000
      52.000000
      39320.000000
      6445.000000
      35682.000000
      6082.000000
      15.000100
      500001.000000



In [9]:

    
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))









    Out[9]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000004717640F28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000047179E4F98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000004717D702E8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000004716961588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000047169927F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000004716992358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000004717B7BA58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000004717BEE0F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000004717C4A470>]], dtype=object)

Create a test set



In [10]:

    
from sklearn.model_selection import train_test_split



In [11]:

    
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(train_set.shape)
print(test_set.shape)









    



(16512, 10)
(4128, 10)

Do stratified sampling

Assuming median_income is an important predictor, we need to categorize it. It is important to build categories such that there are a sufficient number of data points in each strata, else the stratum's importance is biased. To make sure, we need not too many strata (like it is now with median income) and strata are relatively wide.



In [14]:

    
# scale the median income down by dividing it by 1.5 and rounding up those which are greater than 5 to 5.0
import numpy as np
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5) #up round to integers

#replace those with values > 5 with 5.0, values < 5 remain as is
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)



In [15]:

    
housing['income_cat'].hist()









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x471aab0a58>



In [16]:

    
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Now remove the income_cat field used for this sampling. We will learn on the median_income data instead



In [27]:

    
for _temp in (strat_test_set, strat_train_set):
    _temp.drop("income_cat", axis=1, inplace=True)



In [38]:

    
# Write the train and test data to disk
strat_test_set.to_csv('./housing_strat_test.csv')
strat_train_set.to_csv('./housing_strat_train.csv')

Exploratory data analysis



In [37]:

    
strat_train_set.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4, s=strat_train_set['population']/100,
                    label='population', figsize=(10,7), color=strat_train_set['median_house_value'], 
                     cmap=plt.get_cmap('jet'), colorbar=True)
plt.legend()









    



C:\Anaconda3\envs\ml\lib\site-packages\pandas\plotting\_core.py:196: UserWarning: 'color' and 'colormap' cannot be used simultaneously. Using 'color'
  warnings.warn("'color' and 'colormap' cannot be used "






    Out[37]:





<matplotlib.legend.Legend at 0x471dd17128>

Do pairwise plot to understand how each feature is correlated to each other



In [50]:

    
import seaborn as sns
sns.pairplot(data=strat_train_set[['median_house_value','median_income','total_rooms','housing_median_age']])









    Out[50]:





<seaborn.axisgrid.PairGrid at 0x471df3cef0>

Focussing on relationship between income and house value



In [51]:

    
strat_train_set.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)









    Out[51]:





<matplotlib.axes._subplots.AxesSubplot at 0x471b8f5cf8>

Creating new features that are meaningful and also useful in prediction

Create the number of rooms per household, bedrooms per household, ratio of bedrooms to the rooms, number of people per household. We do this on the whole dataset, then collect the train and test datasets.



In [52]:

    
housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_household'] = housing['total_bedrooms'] / housing['households']
housing['bedrooms_per_rooms'] = housing['total_bedrooms'] / housing['total_rooms']



In [54]:

    
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)









    Out[54]:





median_house_value        1.000000
median_income             0.688075
income_cat                0.643892
rooms_per_household       0.151948
total_rooms               0.134153
housing_median_age        0.105623
households                0.065843
total_bedrooms            0.049686
population               -0.024650
longitude                -0.045967
bedrooms_per_household   -0.046739
latitude                 -0.144160
bedrooms_per_rooms       -0.255880
Name: median_house_value, dtype: float64



In [59]:

    
housing.plot(kind='scatter', x='bedrooms_per_household',y='median_house_value', alpha=0.5)









    Out[59]:





<matplotlib.axes._subplots.AxesSubplot at 0x471f178c88>

Prepare data for ML



In [60]:

    
#create a copy without the house value column
housing = strat_train_set.drop('median_house_value', axis=1)
#create a copy of house value column into a new series, which will be the labeled data
housing_labels = strat_test_set['median_house_value'].copy()

Fill missing values using `Imputer` - using median values



In [61]:

    
from sklearn.preprocessing import Imputer
housing_imputer = Imputer(strategy='median')



In [62]:

    
#drop text columns let Imputer learn
housing_numeric = housing.drop('ocean_proximity', axis=1)
housing_imputer.fit(housing_numeric)
housing_imputer.statistics_









    Out[62]:





array([ -118.51  ,    34.26  ,    29.    ,  2119.5   ,   433.    ,
        1164.    ,   408.    ,     3.5409])



In [64]:

    
_x = housing_imputer.transform(housing_numeric)
housing_filled= pd.DataFrame(_x, columns=housing_numeric.columns)



In [66]:

    
housing_filled['ocean_proximity'] = housing['ocean_proximity']
housing_filled.head()









    Out[66]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      ocean_proximity
    
  
  
    
      0
      -121.89
      37.29
      38.0
      1568.0
      351.0
      710.0
      339.0
      2.7042
      NEAR BAY
    
    
      1
      -121.93
      37.05
      14.0
      679.0
      108.0
      306.0
      113.0
      6.4214
      NEAR BAY
    
    
      2
      -117.20
      32.77
      31.0
      1952.0
      471.0
      936.0
      462.0
      2.8621
      NEAR BAY
    
    
      3
      -119.61
      36.31
      25.0
      1847.0
      371.0
      1460.0
      353.0
      1.8839
      NEAR BAY
    
    
      4
      -118.59
      34.23
      17.0
      6592.0
      1525.0
      4459.0
      1463.0
      3.0347
      NEAR BAY

Transform categorical features to numeric - OneHotEncoder

We can use LabelEncoder of scipy to enumerate the ocean_proximity category. However ML algorithm should not associate higher values as desireable or should not think values 1,2 are closer than values 1,4. So we transform the categories into new columns of booleans using OneHotEncoder.

We can do both enumeration then to booleans using LabelBinarizer



In [ ]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	ocean_proximity
0	-121.89	37.29	38.0	1568.0	351.0	710.0	339.0	2.7042	NEAR BAY
1	-121.93	37.05	14.0	679.0	108.0	306.0	113.0	6.4214	NEAR BAY
2	-117.20	32.77	31.0	1952.0	471.0	936.0	462.0	2.8621	NEAR BAY
3	-119.61	36.31	25.0	1847.0	371.0	1460.0	353.0	1.8839	NEAR BAY
4	-118.59	34.23	17.0	6592.0	1525.0	4459.0	1463.0	3.0347	NEAR BAY