Regression Exercise

California Housing Data

This data set contains information about all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area.

The task is to aproximate the median house value of each block from the values of the rest of the variables.

It has been obtained from the LIACC repository. The original page where the data set can be found is:

The Features:

  • housingMedianAge: continuous.
  • totalRooms: continuous.
  • totalBedrooms: continuous.
  • population: continuous.
  • households: continuous.
  • medianIncome: continuous.
  • medianHouseValue: continuous.

The Data

Import the cal_housing_clean.csv file with pandas. Separate it into a training (70%) and testing set(30%).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/cal_housing_clean.csv')

In [3]:

housingMedianAge totalRooms totalBedrooms population households medianIncome medianHouseValue
0 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0
1 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0
2 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0
3 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0
4 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0

In [4]:

count mean std min 25% 50% 75% max
housingMedianAge 20640.0 28.639486 12.585558 1.0000 18.0000 29.0000 37.00000 52.0000
totalRooms 20640.0 2635.763081 2181.615252 2.0000 1447.7500 2127.0000 3148.00000 39320.0000
totalBedrooms 20640.0 537.898014 421.247906 1.0000 295.0000 435.0000 647.00000 6445.0000
population 20640.0 1425.476744 1132.462122 3.0000 787.0000 1166.0000 1725.00000 35682.0000
households 20640.0 499.539680 382.329753 1.0000 280.0000 409.0000 605.00000 6082.0000
medianIncome 20640.0 3.870671 1.899822 0.4999 2.5634 3.5348 4.74325 15.0001
medianHouseValue 20640.0 206855.816909 115395.615874 14999.0000 119600.0000 179700.0000 264725.00000 500001.0000

In [5]:
y = df['medianHouseValue']
x = df.drop('medianHouseValue', axis = 1)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, 
                                                    test_size = 0.3, 
                                                    random_state = 7)

In [8]:

housingMedianAge totalRooms totalBedrooms population households medianIncome
7630 18.0 5923.0 1409.0 3887.0 1322.0 3.4712
3762 42.0 1713.0 416.0 1349.0 427.0 3.2596
2852 41.0 2417.0 435.0 973.0 406.0 3.0568
11759 12.0 3605.0 576.0 1556.0 549.0 4.9000
18062 29.0 2718.0 365.0 982.0 339.0 7.9234

In [9]:

7630     194400.0
3762     191800.0
2852      85600.0
11759    203700.0
18062    500001.0
Name: medianHouseValue, dtype: float64

Scale the Feature Data

Use sklearn preprocessing to create a MinMaxScaler for the feature data. Fit this scaler only to the training data. Then use it to transform X_test and X_train. Then use the scaled X_test and X_train along with pd.Dataframe to re-create two dataframes of scaled data.

In [10]:
from sklearn.preprocessing import MinMaxScaler

In [11]:
scaler = MinMaxScaler()

In [12]:

MinMaxScaler(copy=True, feature_range=(0, 1))

In [13]:
# Keeping Pandas DataFrame format after re-scaling
X_train = pd.DataFrame(data = scaler.transform(X_train), 
                       columns = X_train.columns,
                       index = X_train.index)

In [14]:
X_test = pd.DataFrame(data = scaler.transform(X_test), 
                      columns = X_test.columns,
                      index = X_test.index)

In [15]:

housingMedianAge totalRooms totalBedrooms population households medianIncome
7630 0.333333 0.150506 0.218377 0.108860 0.217105 0.204914
3762 0.803922 0.043420 0.064256 0.037725 0.069901 0.190322
2852 0.784314 0.061327 0.067205 0.027187 0.066447 0.176335
11759 0.215686 0.091545 0.089089 0.043527 0.089967 0.303451
18062 0.549020 0.068983 0.056340 0.027439 0.055428 0.511958

Create Feature Columns

Create the necessary tf.feature_column objects for the estimator. They should all be trated as continuous numeric_columns.

In [16]:

Index(['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population',
       'households', 'medianIncome', 'medianHouseValue'],

In [17]:
import tensorflow as tf

In [18]:
age = tf.feature_column.numeric_column('housingMedianAge')
rooms = tf.feature_column.numeric_column('totalRooms')
bedrooms = tf.feature_column.numeric_column('totalBedrooms')
pop = tf.feature_column.numeric_column('population')
households = tf.feature_column.numeric_column('households')
income = tf.feature_column.numeric_column('medianIncome')

In [19]:
feature_columns = [age, rooms, bedrooms, pop, households, income]

Create the input function for the estimator object. (play around with batch_size and num_epochs)

In [20]:
input_feature_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                         y = Y_train, 
                                                         batch_size = 10,
                                                         num_epochs = 1000,
                                                         shuffle = True)

Create the estimator model. Use a DNNRegressor. Play around with the hidden units!

In [21]:
dnn_model = tf.estimator.DNNRegressor(hidden_units = [6, 5, 5], feature_columns = feature_columns)

<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x1b1fbc80a90>

Create a prediction input function and then use the .predict method off your estimator model to create a list or predictions on your test data.

In [23]:
prediction_input_func = tf.estimator.inputs.pandas_input_fn(x = X_test,
                                                            batch_size = 10,
                                                            num_epochs = 1,
                                                            shuffle = False)

In [24]:
prediction_generator = dnn_model.predict(prediction_input_func)

In [25]:
precitions = list(prediction_generator)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

Calculate the RMSE. You should be able to get around 100,000 RMSE (remember that this is in the same units as the label.) Do this manually or use sklearn.metrics

In [26]:
final_pred = []

for pred in precitions:

In [27]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [28]:
# RMSE = sqrt(MSE) = MSE ** 0.5
mean_squared_error(Y_test, final_pred) ** 0.5


Great Job!