Regression Exercise

California Housing Data

This data set contains information about all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area.

The task is to aproximate the median house value of each block from the values of the rest of the variables.

It has been obtained from the LIACC repository. The original page where the data set can be found is: http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html.

The Features:

  • housingMedianAge: continuous.
  • totalRooms: continuous.
  • totalBedrooms: continuous.
  • population: continuous.
  • households: continuous.
  • medianIncome: continuous.
  • medianHouseValue: continuous.

The Data

Import the cal_housing_clean.csv file with pandas. Separate it into a training (70%) and testing set(30%).


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/cal_housing_clean.csv')

In [3]:
df.head()


Out[3]:
housingMedianAge totalRooms totalBedrooms population households medianIncome medianHouseValue
0 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0
1 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0
2 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0
3 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0
4 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0

In [4]:
df.describe().T


Out[4]:
count mean std min 25% 50% 75% max
housingMedianAge 20640.0 28.639486 12.585558 1.0000 18.0000 29.0000 37.00000 52.0000
totalRooms 20640.0 2635.763081 2181.615252 2.0000 1447.7500 2127.0000 3148.00000 39320.0000
totalBedrooms 20640.0 537.898014 421.247906 1.0000 295.0000 435.0000 647.00000 6445.0000
population 20640.0 1425.476744 1132.462122 3.0000 787.0000 1166.0000 1725.00000 35682.0000
households 20640.0 499.539680 382.329753 1.0000 280.0000 409.0000 605.00000 6082.0000
medianIncome 20640.0 3.870671 1.899822 0.4999 2.5634 3.5348 4.74325 15.0001
medianHouseValue 20640.0 206855.816909 115395.615874 14999.0000 119600.0000 179700.0000 264725.00000 500001.0000

In [5]:
y = df['medianHouseValue']
x = df.drop('medianHouseValue', axis = 1)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, 
                                                    test_size = 0.3, 
                                                    random_state = 7)

In [8]:
X_train.head()


Out[8]:
housingMedianAge totalRooms totalBedrooms population households medianIncome
7630 18.0 5923.0 1409.0 3887.0 1322.0 3.4712
3762 42.0 1713.0 416.0 1349.0 427.0 3.2596
2852 41.0 2417.0 435.0 973.0 406.0 3.0568
11759 12.0 3605.0 576.0 1556.0 549.0 4.9000
18062 29.0 2718.0 365.0 982.0 339.0 7.9234

In [9]:
Y_train.head()


Out[9]:
7630     194400.0
3762     191800.0
2852      85600.0
11759    203700.0
18062    500001.0
Name: medianHouseValue, dtype: float64

Scale the Feature Data

Use sklearn preprocessing to create a MinMaxScaler for the feature data. Fit this scaler only to the training data. Then use it to transform X_test and X_train. Then use the scaled X_test and X_train along with pd.Dataframe to re-create two dataframes of scaled data.


In [10]:
from sklearn.preprocessing import MinMaxScaler

In [11]:
scaler = MinMaxScaler()

In [12]:
scaler.fit(X_train)


Out[12]:
MinMaxScaler(copy=True, feature_range=(0, 1))

In [13]:
# Keeping Pandas DataFrame format after re-scaling
X_train = pd.DataFrame(data = scaler.transform(X_train), 
                       columns = X_train.columns,
                       index = X_train.index)

In [14]:
X_test = pd.DataFrame(data = scaler.transform(X_test), 
                      columns = X_test.columns,
                      index = X_test.index)

In [15]:
X_train.head()


Out[15]:
housingMedianAge totalRooms totalBedrooms population households medianIncome
7630 0.333333 0.150506 0.218377 0.108860 0.217105 0.204914
3762 0.803922 0.043420 0.064256 0.037725 0.069901 0.190322
2852 0.784314 0.061327 0.067205 0.027187 0.066447 0.176335
11759 0.215686 0.091545 0.089089 0.043527 0.089967 0.303451
18062 0.549020 0.068983 0.056340 0.027439 0.055428 0.511958

Create Feature Columns

Create the necessary tf.feature_column objects for the estimator. They should all be trated as continuous numeric_columns.


In [16]:
df.columns


Out[16]:
Index(['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population',
       'households', 'medianIncome', 'medianHouseValue'],
      dtype='object')

In [17]:
import tensorflow as tf

In [18]:
age = tf.feature_column.numeric_column('housingMedianAge')
rooms = tf.feature_column.numeric_column('totalRooms')
bedrooms = tf.feature_column.numeric_column('totalBedrooms')
pop = tf.feature_column.numeric_column('population')
households = tf.feature_column.numeric_column('households')
income = tf.feature_column.numeric_column('medianIncome')

In [19]:
feature_columns = [age, rooms, bedrooms, pop, households, income]

Create the input function for the estimator object. (play around with batch_size and num_epochs)


In [20]:
input_feature_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                         y = Y_train, 
                                                         batch_size = 10,
                                                         num_epochs = 1000,
                                                         shuffle = True)

Create the estimator model. Use a DNNRegressor. Play around with the hidden units!


In [21]:
dnn_model = tf.estimator.DNNRegressor(hidden_units = [6, 5, 5], feature_columns = feature_columns)


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpiljbseon', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001B1FB18F208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Train the model for ~1,000 steps. (Later come back to this and train it for more and check for improvement)

In [22]:
dnn_model.train(input_fn = input_feature_func, steps = 5000)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt.
INFO:tensorflow:loss = 207973600000.0, step = 0
INFO:tensorflow:global_step/sec: 215.313
INFO:tensorflow:loss = 425438480000.0, step = 100 (0.466 sec)
INFO:tensorflow:global_step/sec: 279.064
INFO:tensorflow:loss = 845220350000.0, step = 200 (0.358 sec)
INFO:tensorflow:global_step/sec: 249.139
INFO:tensorflow:loss = 250733400000.0, step = 300 (0.401 sec)
INFO:tensorflow:global_step/sec: 290.421
INFO:tensorflow:loss = 278291780000.0, step = 400 (0.344 sec)
INFO:tensorflow:global_step/sec: 259.493
INFO:tensorflow:loss = 764664500000.0, step = 500 (0.385 sec)
INFO:tensorflow:global_step/sec: 210.326
INFO:tensorflow:loss = 725178450000.0, step = 600 (0.477 sec)
INFO:tensorflow:global_step/sec: 138.755
INFO:tensorflow:loss = 221782130000.0, step = 700 (0.730 sec)
INFO:tensorflow:global_step/sec: 129.077
INFO:tensorflow:loss = 615681100000.0, step = 800 (0.764 sec)
INFO:tensorflow:global_step/sec: 177.768
INFO:tensorflow:loss = 198133650000.0, step = 900 (0.566 sec)
INFO:tensorflow:global_step/sec: 106.964
INFO:tensorflow:loss = 178777720000.0, step = 1000 (0.933 sec)
INFO:tensorflow:global_step/sec: 146.488
INFO:tensorflow:loss = 254143920000.0, step = 1100 (0.688 sec)
INFO:tensorflow:global_step/sec: 128.909
INFO:tensorflow:loss = 21505511000.0, step = 1200 (0.771 sec)
INFO:tensorflow:global_step/sec: 195.51
INFO:tensorflow:loss = 126390256000.0, step = 1300 (0.513 sec)
INFO:tensorflow:global_step/sec: 176.823
INFO:tensorflow:loss = 50787190000.0, step = 1400 (0.565 sec)
INFO:tensorflow:global_step/sec: 205.143
INFO:tensorflow:loss = 218735670000.0, step = 1500 (0.487 sec)
INFO:tensorflow:global_step/sec: 203.888
INFO:tensorflow:loss = 47898820000.0, step = 1600 (0.489 sec)
INFO:tensorflow:global_step/sec: 203.472
INFO:tensorflow:loss = 249293000000.0, step = 1700 (0.491 sec)
INFO:tensorflow:global_step/sec: 190.292
INFO:tensorflow:loss = 213011140000.0, step = 1800 (0.524 sec)
INFO:tensorflow:global_step/sec: 218.136
INFO:tensorflow:loss = 33383338000.0, step = 1900 (0.458 sec)
INFO:tensorflow:global_step/sec: 215.778
INFO:tensorflow:loss = 109325560000.0, step = 2000 (0.463 sec)
INFO:tensorflow:global_step/sec: 207.271
INFO:tensorflow:loss = 50221620000.0, step = 2100 (0.482 sec)
INFO:tensorflow:global_step/sec: 169.906
INFO:tensorflow:loss = 161848080000.0, step = 2200 (0.589 sec)
INFO:tensorflow:global_step/sec: 123.187
INFO:tensorflow:loss = 81235550000.0, step = 2300 (0.813 sec)
INFO:tensorflow:global_step/sec: 157.579
INFO:tensorflow:loss = 175650700000.0, step = 2400 (0.634 sec)
INFO:tensorflow:global_step/sec: 203.472
INFO:tensorflow:loss = 65173828000.0, step = 2500 (0.491 sec)
INFO:tensorflow:global_step/sec: 194.368
INFO:tensorflow:loss = 149565850000.0, step = 2600 (0.515 sec)
INFO:tensorflow:global_step/sec: 192.125
INFO:tensorflow:loss = 97784590000.0, step = 2700 (0.519 sec)
INFO:tensorflow:global_step/sec: 201.016
INFO:tensorflow:loss = 105214130000.0, step = 2800 (0.496 sec)
INFO:tensorflow:global_step/sec: 223.501
INFO:tensorflow:loss = 37675885000.0, step = 2900 (0.447 sec)
INFO:tensorflow:global_step/sec: 238.433
INFO:tensorflow:loss = 96770245000.0, step = 3000 (0.420 sec)
INFO:tensorflow:global_step/sec: 229.143
INFO:tensorflow:loss = 303832300000.0, step = 3100 (0.435 sec)
INFO:tensorflow:global_step/sec: 228.615
INFO:tensorflow:loss = 107682940000.0, step = 3200 (0.438 sec)
INFO:tensorflow:global_step/sec: 177.767
INFO:tensorflow:loss = 45944070000.0, step = 3300 (0.564 sec)
INFO:tensorflow:global_step/sec: 174.05
INFO:tensorflow:loss = 93797830000.0, step = 3400 (0.577 sec)
INFO:tensorflow:global_step/sec: 172.846
INFO:tensorflow:loss = 208430470000.0, step = 3500 (0.575 sec)
INFO:tensorflow:global_step/sec: 229.666
INFO:tensorflow:loss = 42924080000.0, step = 3600 (0.436 sec)
INFO:tensorflow:global_step/sec: 233.968
INFO:tensorflow:loss = 181113030000.0, step = 3700 (0.428 sec)
INFO:tensorflow:global_step/sec: 232.338
INFO:tensorflow:loss = 74690560000.0, step = 3800 (0.430 sec)
INFO:tensorflow:global_step/sec: 229.666
INFO:tensorflow:loss = 100517490000.0, step = 3900 (0.435 sec)
INFO:tensorflow:global_step/sec: 229.667
INFO:tensorflow:loss = 36461940000.0, step = 4000 (0.436 sec)
INFO:tensorflow:global_step/sec: 232.337
INFO:tensorflow:loss = 55311868000.0, step = 4100 (0.429 sec)
INFO:tensorflow:global_step/sec: 230.195
INFO:tensorflow:loss = 87295080000.0, step = 4200 (0.434 sec)
INFO:tensorflow:global_step/sec: 176.822
INFO:tensorflow:loss = 174099920000.0, step = 4300 (0.566 sec)
INFO:tensorflow:global_step/sec: 211.663
INFO:tensorflow:loss = 200433960000.0, step = 4400 (0.470 sec)
INFO:tensorflow:global_step/sec: 210.326
INFO:tensorflow:loss = 65175245000.0, step = 4500 (0.476 sec)
INFO:tensorflow:global_step/sec: 180.66
INFO:tensorflow:loss = 49770040000.0, step = 4600 (0.554 sec)
INFO:tensorflow:global_step/sec: 128.577
INFO:tensorflow:loss = 152153690000.0, step = 4700 (0.784 sec)
INFO:tensorflow:global_step/sec: 108.24
INFO:tensorflow:loss = 154153900000.0, step = 4800 (0.920 sec)
INFO:tensorflow:global_step/sec: 153.937
INFO:tensorflow:loss = 157088510000.0, step = 4900 (0.649 sec)
INFO:tensorflow:Saving checkpoints for 5000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt.
INFO:tensorflow:Loss for final step: 171178840000.0.
Out[22]:
<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x1b1fbc80a90>

Create a prediction input function and then use the .predict method off your estimator model to create a list or predictions on your test data.


In [23]:
prediction_input_func = tf.estimator.inputs.pandas_input_fn(x = X_test,
                                                            batch_size = 10,
                                                            num_epochs = 1,
                                                            shuffle = False)

In [24]:
prediction_generator = dnn_model.predict(prediction_input_func)

In [25]:
precitions = list(prediction_generator)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

Calculate the RMSE. You should be able to get around 100,000 RMSE (remember that this is in the same units as the label.) Do this manually or use sklearn.metrics


In [26]:
final_pred = []

for pred in precitions:
    final_pred.append(pred['predictions'])

In [27]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [28]:
# RMSE = sqrt(MSE) = MSE ** 0.5
mean_squared_error(Y_test, final_pred) ** 0.5


Out[28]:
103363.22596276859

Great Job!