Trying out features

Learning Objectives:

Improve the accuracy of a model by adding new features with the appropriate representation

The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.

Set Up

In this first cell, we'll load the necessary libraries.



In [ ]:

    
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst



In [ ]:

    
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1



In [1]:

    
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

print(tf.__version__)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

Next, we'll load our data set.



In [2]:

    
df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

Examine and split the data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column.

This will include things like mean, standard deviation, max, min, and various quantiles.



In [3]:

    
df.head()









    Out[3]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      0
      -114.3
      34.2
      15.0
      5612.0
      1283.0
      1015.0
      472.0
      1.5
      66900.0
    
    
      1
      -114.5
      34.4
      19.0
      7650.0
      1901.0
      1129.0
      463.0
      1.8
      80100.0
    
    
      2
      -114.6
      33.7
      17.0
      720.0
      174.0
      333.0
      117.0
      1.7
      85700.0
    
    
      3
      -114.6
      33.6
      14.0
      1501.0
      337.0
      515.0
      226.0
      3.2
      73400.0
    
    
      4
      -114.6
      33.6
      20.0
      1454.0
      326.0
      624.0
      262.0
      1.9
      65500.0



In [4]:

    
df.describe()









    Out[4]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      count
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
    
    
      mean
      -119.6
      35.6
      28.6
      2643.7
      539.4
      1429.6
      501.2
      3.9
      207300.9
    
    
      std
      2.0
      2.1
      12.6
      2179.9
      421.5
      1147.9
      384.5
      1.9
      115983.8
    
    
      min
      -124.3
      32.5
      1.0
      2.0
      1.0
      3.0
      1.0
      0.5
      14999.0
    
    
      25%
      -121.8
      33.9
      18.0
      1462.0
      297.0
      790.0
      282.0
      2.6
      119400.0
    
    
      50%
      -118.5
      34.2
      29.0
      2127.0
      434.0
      1167.0
      409.0
      3.5
      180400.0
    
    
      75%
      -118.0
      37.7
      37.0
      3151.2
      648.2
      1721.0
      605.2
      4.8
      265000.0
    
    
      max
      -114.3
      42.0
      52.0
      37937.0
      6445.0
      35682.0
      6082.0
      15.0
      500001.0

Now, split the data into two parts -- training and evaluation.



In [5]:

    
np.random.seed(seed=1) #makes result reproducible
msk = np.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]

Training and Evaluation

In this exercise, we'll be trying to predict median_house_value It will be our label (sometimes also called a target).

We'll modify the feature_cols and input function to represent the features you want to use.

We divide total_rooms by households to get avg_rooms_per_house which we expect to positively correlate with median_house_value.

We also divide population by total_rooms to get avg_persons_per_room which we expect to negatively correlate with median_house_value.



In [6]:

    
def add_more_features(df):
  df['avg_rooms_per_house'] = df['total_rooms'] / df['households'] #expect positive correlation
  df['avg_persons_per_room'] = df['population'] / df['total_rooms'] #expect negative correlation
  return df



In [7]:

    
# Create pandas input function
def make_input_fn(df, num_epochs):
  return tf.compat.v1.estimator.inputs.pandas_input_fn(
    x = add_more_features(df),
    y = df['median_house_value'] / 100000, # will talk about why later in the course
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads = 1
  )



In [8]:

    
# Define your feature columns
def create_feature_cols():
  return [
    tf.feature_column.numeric_column('housing_median_age'),
    tf.feature_column.bucketized_column(tf.feature_column.numeric_column('latitude'), boundaries = np.arange(32.0, 42, 1).tolist()),
    tf.feature_column.numeric_column('avg_rooms_per_house'),
    tf.feature_column.numeric_column('avg_persons_per_room'),
    tf.feature_column.numeric_column('median_income')
  ]



In [9]:

    
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):
  estimator = tf.compat.v1.estimator.LinearRegressor(model_dir = output_dir, feature_columns = create_feature_cols())
  train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), 
                                      max_steps = num_train_steps)
  eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), 
                                    steps = None, 
                                    start_delay_secs = 1, # start evaluating after N seconds, 
                                    throttle_secs = 5)  # evaluate every N seconds
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)



In [10]:

    
OUTDIR = './trained_model'



In [ ]:

    
# Run the model
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.compat.v1.summary.FileWriterCache.clear() 
train_and_evaluate(OUTDIR, 2000)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.3	34.2	15.0	5612.0	1283.0	1015.0	472.0	1.5	66900.0
1	-114.5	34.4	19.0	7650.0	1901.0	1129.0	463.0	1.8	80100.0
2	-114.6	33.7	17.0	720.0	174.0	333.0	117.0	1.7	85700.0
3	-114.6	33.6	14.0	1501.0	337.0	515.0	226.0	3.2	73400.0
4	-114.6	33.6	20.0	1454.0	326.0	624.0	262.0	1.9	65500.0

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0
mean	-119.6	35.6	28.6	2643.7	539.4	1429.6	501.2	3.9	207300.9
std	2.0	2.1	12.6	2179.9	421.5	1147.9	384.5	1.9	115983.8
min	-124.3	32.5	1.0	2.0	1.0	3.0	1.0	0.5	14999.0
25%	-121.8	33.9	18.0	1462.0	297.0	790.0	282.0	2.6	119400.0
50%	-118.5	34.2	29.0	2127.0	434.0	1167.0	409.0	3.5	180400.0
75%	-118.0	37.7	37.0	3151.2	648.2	1721.0	605.2	4.8	265000.0
max	-114.3	42.0	52.0	37937.0	6445.0	35682.0	6082.0	15.0	500001.0