Regression Exercise

California Housing Data

This data set contains information about all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area.

The task is to aproximate the median house value of each block from the values of the rest of the variables.

It has been obtained from the LIACC repository. The original page where the data set can be found is: http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html.

The Features:

housingMedianAge: continuous.
totalRooms: continuous.
totalBedrooms: continuous.
population: continuous.
households: continuous.
medianIncome: continuous.
medianHouseValue: continuous.

The Data

Import the cal_housing_clean.csv file with pandas. Separate it into a training (70%) and testing set(30%).



In [1]:

    
import pandas as pd



In [2]:

    
df = pd.read_csv('./data/cal_housing_clean.csv')



In [3]:

    
df.head()









    Out[3]:







  
    
      
      housingMedianAge
      totalRooms
      totalBedrooms
      population
      households
      medianIncome
      medianHouseValue
    
  
  
    
      0
      41.0
      880.0
      129.0
      322.0
      126.0
      8.3252
      452600.0
    
    
      1
      21.0
      7099.0
      1106.0
      2401.0
      1138.0
      8.3014
      358500.0
    
    
      2
      52.0
      1467.0
      190.0
      496.0
      177.0
      7.2574
      352100.0
    
    
      3
      52.0
      1274.0
      235.0
      558.0
      219.0
      5.6431
      341300.0
    
    
      4
      52.0
      1627.0
      280.0
      565.0
      259.0
      3.8462
      342200.0



In [4]:

    
df.describe().T









    Out[4]:







  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      housingMedianAge
      20640.0
      28.639486
      12.585558
      1.0000
      18.0000
      29.0000
      37.00000
      52.0000
    
    
      totalRooms
      20640.0
      2635.763081
      2181.615252
      2.0000
      1447.7500
      2127.0000
      3148.00000
      39320.0000
    
    
      totalBedrooms
      20640.0
      537.898014
      421.247906
      1.0000
      295.0000
      435.0000
      647.00000
      6445.0000
    
    
      population
      20640.0
      1425.476744
      1132.462122
      3.0000
      787.0000
      1166.0000
      1725.00000
      35682.0000
    
    
      households
      20640.0
      499.539680
      382.329753
      1.0000
      280.0000
      409.0000
      605.00000
      6082.0000
    
    
      medianIncome
      20640.0
      3.870671
      1.899822
      0.4999
      2.5634
      3.5348
      4.74325
      15.0001
    
    
      medianHouseValue
      20640.0
      206855.816909
      115395.615874
      14999.0000
      119600.0000
      179700.0000
      264725.00000
      500001.0000



In [5]:

    
y = df['medianHouseValue']
x = df.drop('medianHouseValue', axis = 1)



In [6]:

    
from sklearn.model_selection import train_test_split



In [7]:

    
X_train, X_test, Y_train, Y_test = train_test_split(x, y, 
                                                    test_size = 0.3, 
                                                    random_state = 7)



In [8]:

    
X_train.head()









    Out[8]:







  
    
      
      housingMedianAge
      totalRooms
      totalBedrooms
      population
      households
      medianIncome
    
  
  
    
      7630
      18.0
      5923.0
      1409.0
      3887.0
      1322.0
      3.4712
    
    
      3762
      42.0
      1713.0
      416.0
      1349.0
      427.0
      3.2596
    
    
      2852
      41.0
      2417.0
      435.0
      973.0
      406.0
      3.0568
    
    
      11759
      12.0
      3605.0
      576.0
      1556.0
      549.0
      4.9000
    
    
      18062
      29.0
      2718.0
      365.0
      982.0
      339.0
      7.9234



In [9]:

    
Y_train.head()









    Out[9]:





7630     194400.0
3762     191800.0
2852      85600.0
11759    203700.0
18062    500001.0
Name: medianHouseValue, dtype: float64

Scale the Feature Data

Use sklearn preprocessing to create a MinMaxScaler for the feature data. Fit this scaler only to the training data. Then use it to transform X_test and X_train. Then use the scaled X_test and X_train along with pd.Dataframe to re-create two dataframes of scaled data.



In [10]:

    
from sklearn.preprocessing import MinMaxScaler



In [11]:

    
scaler = MinMaxScaler()



In [12]:

    
scaler.fit(X_train)









    Out[12]:





MinMaxScaler(copy=True, feature_range=(0, 1))



In [13]:

    
# Keeping Pandas DataFrame format after re-scaling
X_train = pd.DataFrame(data = scaler.transform(X_train), 
                       columns = X_train.columns,
                       index = X_train.index)



In [14]:

    
X_test = pd.DataFrame(data = scaler.transform(X_test), 
                      columns = X_test.columns,
                      index = X_test.index)



In [15]:

    
X_train.head()









    Out[15]:







  
    
      
      housingMedianAge
      totalRooms
      totalBedrooms
      population
      households
      medianIncome
    
  
  
    
      7630
      0.333333
      0.150506
      0.218377
      0.108860
      0.217105
      0.204914
    
    
      3762
      0.803922
      0.043420
      0.064256
      0.037725
      0.069901
      0.190322
    
    
      2852
      0.784314
      0.061327
      0.067205
      0.027187
      0.066447
      0.176335
    
    
      11759
      0.215686
      0.091545
      0.089089
      0.043527
      0.089967
      0.303451
    
    
      18062
      0.549020
      0.068983
      0.056340
      0.027439
      0.055428
      0.511958

Create Feature Columns

Create the necessary tf.feature_column objects for the estimator. They should all be trated as continuous numeric_columns.



In [16]:

    
df.columns









    Out[16]:





Index(['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population',
       'households', 'medianIncome', 'medianHouseValue'],
      dtype='object')



In [17]:

    
import tensorflow as tf



In [18]:

    
age = tf.feature_column.numeric_column('housingMedianAge')
rooms = tf.feature_column.numeric_column('totalRooms')
bedrooms = tf.feature_column.numeric_column('totalBedrooms')
pop = tf.feature_column.numeric_column('population')
households = tf.feature_column.numeric_column('households')
income = tf.feature_column.numeric_column('medianIncome')



In [19]:

    
feature_columns = [age, rooms, bedrooms, pop, households, income]

Create the input function for the estimator object. (play around with batch_size and num_epochs)



In [20]:

    
input_feature_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                         y = Y_train, 
                                                         batch_size = 10,
                                                         num_epochs = 1000,
                                                         shuffle = True)

Create the estimator model. Use a DNNRegressor. Play around with the hidden units!



In [21]:

    
dnn_model = tf.estimator.DNNRegressor(hidden_units = [6, 5, 5], feature_columns = feature_columns)









    



INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpiljbseon', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001B1FB18F208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Train the model for ~1,000 steps. (Later come back to this and train it for more and check for improvement)



In [22]:

    
dnn_model.train(input_fn = input_feature_func, steps = 5000)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt.
INFO:tensorflow:loss = 207973600000.0, step = 0
INFO:tensorflow:global_step/sec: 215.313
INFO:tensorflow:loss = 425438480000.0, step = 100 (0.466 sec)
INFO:tensorflow:global_step/sec: 279.064
INFO:tensorflow:loss = 845220350000.0, step = 200 (0.358 sec)
INFO:tensorflow:global_step/sec: 249.139
INFO:tensorflow:loss = 250733400000.0, step = 300 (0.401 sec)
INFO:tensorflow:global_step/sec: 290.421
INFO:tensorflow:loss = 278291780000.0, step = 400 (0.344 sec)
INFO:tensorflow:global_step/sec: 259.493
INFO:tensorflow:loss = 764664500000.0, step = 500 (0.385 sec)
INFO:tensorflow:global_step/sec: 210.326
INFO:tensorflow:loss = 725178450000.0, step = 600 (0.477 sec)
INFO:tensorflow:global_step/sec: 138.755
INFO:tensorflow:loss = 221782130000.0, step = 700 (0.730 sec)
INFO:tensorflow:global_step/sec: 129.077
INFO:tensorflow:loss = 615681100000.0, step = 800 (0.764 sec)
INFO:tensorflow:global_step/sec: 177.768
INFO:tensorflow:loss = 198133650000.0, step = 900 (0.566 sec)
INFO:tensorflow:global_step/sec: 106.964
INFO:tensorflow:loss = 178777720000.0, step = 1000 (0.933 sec)
INFO:tensorflow:global_step/sec: 146.488
INFO:tensorflow:loss = 254143920000.0, step = 1100 (0.688 sec)
INFO:tensorflow:global_step/sec: 128.909
INFO:tensorflow:loss = 21505511000.0, step = 1200 (0.771 sec)
INFO:tensorflow:global_step/sec: 195.51
INFO:tensorflow:loss = 126390256000.0, step = 1300 (0.513 sec)
INFO:tensorflow:global_step/sec: 176.823
INFO:tensorflow:loss = 50787190000.0, step = 1400 (0.565 sec)
INFO:tensorflow:global_step/sec: 205.143
INFO:tensorflow:loss = 218735670000.0, step = 1500 (0.487 sec)
INFO:tensorflow:global_step/sec: 203.888
INFO:tensorflow:loss = 47898820000.0, step = 1600 (0.489 sec)
INFO:tensorflow:global_step/sec: 203.472
INFO:tensorflow:loss = 249293000000.0, step = 1700 (0.491 sec)
INFO:tensorflow:global_step/sec: 190.292
INFO:tensorflow:loss = 213011140000.0, step = 1800 (0.524 sec)
INFO:tensorflow:global_step/sec: 218.136
INFO:tensorflow:loss = 33383338000.0, step = 1900 (0.458 sec)
INFO:tensorflow:global_step/sec: 215.778
INFO:tensorflow:loss = 109325560000.0, step = 2000 (0.463 sec)
INFO:tensorflow:global_step/sec: 207.271
INFO:tensorflow:loss = 50221620000.0, step = 2100 (0.482 sec)
INFO:tensorflow:global_step/sec: 169.906
INFO:tensorflow:loss = 161848080000.0, step = 2200 (0.589 sec)
INFO:tensorflow:global_step/sec: 123.187
INFO:tensorflow:loss = 81235550000.0, step = 2300 (0.813 sec)
INFO:tensorflow:global_step/sec: 157.579
INFO:tensorflow:loss = 175650700000.0, step = 2400 (0.634 sec)
INFO:tensorflow:global_step/sec: 203.472
INFO:tensorflow:loss = 65173828000.0, step = 2500 (0.491 sec)
INFO:tensorflow:global_step/sec: 194.368
INFO:tensorflow:loss = 149565850000.0, step = 2600 (0.515 sec)
INFO:tensorflow:global_step/sec: 192.125
INFO:tensorflow:loss = 97784590000.0, step = 2700 (0.519 sec)
INFO:tensorflow:global_step/sec: 201.016
INFO:tensorflow:loss = 105214130000.0, step = 2800 (0.496 sec)
INFO:tensorflow:global_step/sec: 223.501
INFO:tensorflow:loss = 37675885000.0, step = 2900 (0.447 sec)
INFO:tensorflow:global_step/sec: 238.433
INFO:tensorflow:loss = 96770245000.0, step = 3000 (0.420 sec)
INFO:tensorflow:global_step/sec: 229.143
INFO:tensorflow:loss = 303832300000.0, step = 3100 (0.435 sec)
INFO:tensorflow:global_step/sec: 228.615
INFO:tensorflow:loss = 107682940000.0, step = 3200 (0.438 sec)
INFO:tensorflow:global_step/sec: 177.767
INFO:tensorflow:loss = 45944070000.0, step = 3300 (0.564 sec)
INFO:tensorflow:global_step/sec: 174.05
INFO:tensorflow:loss = 93797830000.0, step = 3400 (0.577 sec)
INFO:tensorflow:global_step/sec: 172.846
INFO:tensorflow:loss = 208430470000.0, step = 3500 (0.575 sec)
INFO:tensorflow:global_step/sec: 229.666
INFO:tensorflow:loss = 42924080000.0, step = 3600 (0.436 sec)
INFO:tensorflow:global_step/sec: 233.968
INFO:tensorflow:loss = 181113030000.0, step = 3700 (0.428 sec)
INFO:tensorflow:global_step/sec: 232.338
INFO:tensorflow:loss = 74690560000.0, step = 3800 (0.430 sec)
INFO:tensorflow:global_step/sec: 229.666
INFO:tensorflow:loss = 100517490000.0, step = 3900 (0.435 sec)
INFO:tensorflow:global_step/sec: 229.667
INFO:tensorflow:loss = 36461940000.0, step = 4000 (0.436 sec)
INFO:tensorflow:global_step/sec: 232.337
INFO:tensorflow:loss = 55311868000.0, step = 4100 (0.429 sec)
INFO:tensorflow:global_step/sec: 230.195
INFO:tensorflow:loss = 87295080000.0, step = 4200 (0.434 sec)
INFO:tensorflow:global_step/sec: 176.822
INFO:tensorflow:loss = 174099920000.0, step = 4300 (0.566 sec)
INFO:tensorflow:global_step/sec: 211.663
INFO:tensorflow:loss = 200433960000.0, step = 4400 (0.470 sec)
INFO:tensorflow:global_step/sec: 210.326
INFO:tensorflow:loss = 65175245000.0, step = 4500 (0.476 sec)
INFO:tensorflow:global_step/sec: 180.66
INFO:tensorflow:loss = 49770040000.0, step = 4600 (0.554 sec)
INFO:tensorflow:global_step/sec: 128.577
INFO:tensorflow:loss = 152153690000.0, step = 4700 (0.784 sec)
INFO:tensorflow:global_step/sec: 108.24
INFO:tensorflow:loss = 154153900000.0, step = 4800 (0.920 sec)
INFO:tensorflow:global_step/sec: 153.937
INFO:tensorflow:loss = 157088510000.0, step = 4900 (0.649 sec)
INFO:tensorflow:Saving checkpoints for 5000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt.
INFO:tensorflow:Loss for final step: 171178840000.0.






    Out[22]:





<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x1b1fbc80a90>

Create a prediction input function and then use the .predict method off your estimator model to create a list or predictions on your test data.



In [23]:

    
prediction_input_func = tf.estimator.inputs.pandas_input_fn(x = X_test,
                                                            batch_size = 10,
                                                            num_epochs = 1,
                                                            shuffle = False)



In [24]:

    
prediction_generator = dnn_model.predict(prediction_input_func)



In [25]:

    
precitions = list(prediction_generator)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpiljbseon\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

Calculate the RMSE. You should be able to get around 100,000 RMSE (remember that this is in the same units as the label.) Do this manually or use sklearn.metrics



In [26]:

    
final_pred = []

for pred in precitions:
    final_pred.append(pred['predictions'])



In [27]:

    
from sklearn.metrics import mean_squared_error, mean_absolute_error



In [28]:

    
# RMSE = sqrt(MSE) = MSE ** 0.5
mean_squared_error(Y_test, final_pred) ** 0.5









    Out[28]:





103363.22596276859

	housingMedianAge	totalRooms	totalBedrooms	population	households	medianIncome	medianHouseValue
0	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0
1	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0
2	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0
3	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0
4	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0

	count	mean	std	min	25%	50%	75%	max
housingMedianAge	20640.0	28.639486	12.585558	1.0000	18.0000	29.0000	37.00000	52.0000
totalRooms	20640.0	2635.763081	2181.615252	2.0000	1447.7500	2127.0000	3148.00000	39320.0000
totalBedrooms	20640.0	537.898014	421.247906	1.0000	295.0000	435.0000	647.00000	6445.0000
population	20640.0	1425.476744	1132.462122	3.0000	787.0000	1166.0000	1725.00000	35682.0000
households	20640.0	499.539680	382.329753	1.0000	280.0000	409.0000	605.00000	6082.0000
medianIncome	20640.0	3.870671	1.899822	0.4999	2.5634	3.5348	4.74325	15.0001
medianHouseValue	20640.0	206855.816909	115395.615874	14999.0000	119600.0000	179700.0000	264725.00000	500001.0000

	housingMedianAge	totalRooms	totalBedrooms	population	households	medianIncome
7630	18.0	5923.0	1409.0	3887.0	1322.0	3.4712
3762	42.0	1713.0	416.0	1349.0	427.0	3.2596
2852	41.0	2417.0	435.0	973.0	406.0	3.0568
11759	12.0	3605.0	576.0	1556.0	549.0	4.9000
18062	29.0	2718.0	365.0	982.0	339.0	7.9234

	housingMedianAge	totalRooms	totalBedrooms	population	households	medianIncome
7630	0.333333	0.150506	0.218377	0.108860	0.217105	0.204914
3762	0.803922	0.043420	0.064256	0.037725	0.069901	0.190322
2852	0.784314	0.061327	0.067205	0.027187	0.066447	0.176335
11759	0.215686	0.091545	0.089089	0.043527	0.089967	0.303451
18062	0.549020	0.068983	0.056340	0.027439	0.055428	0.511958

Regression Exercise

The Data

Scale the Feature Data

Create Feature Columns

Train the model for ~1,000 steps. (Later come back to this and train it for more and check for improvement)

Great Job!