Learning Objectives:
The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1
In [ ]:
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
print(tf.__version__)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
Next, we'll load our data set.
In [ ]:
df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")
In [ ]:
df.head()
In [ ]:
df.describe()
Now, split the data into two parts -- training and evaluation.
In [ ]:
np.random.seed(seed=1) #makes result reproducible
msk = np.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]
In this exercise, we'll be trying to predict median_house_value It will be our label (sometimes also called a target).
We'll modify the feature_cols and input function to represent the features you want to use.
Hint: Some of the features in the dataframe aren't directly correlated with median_house_value (e.g. total_rooms) but can you think of a column to divide it by that we would expect to be correlated with median_house_value?
In [ ]:
def add_more_features(df):
# TODO: Add more features to the dataframe
return df
In [ ]:
# Create pandas input function
def make_input_fn(df, num_epochs):
return tf.compat.v1.estimator.inputs.pandas_input_fn(
x = add_more_features(df),
y = df['median_house_value'] / 100000, # will talk about why later in the course
batch_size = 128,
num_epochs = num_epochs,
shuffle = True,
queue_capacity = 1000,
num_threads = 1
)
In [1]:
# Define your feature columns
def create_feature_cols():
return [
tf.feature_column.numeric_column('housing_median_age')
# TODO: Define additional feature columns
# Hint: Are there any features that would benefit from bucketizing?
]
In [ ]:
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):
# TODO: Create tf.estimator.LinearRegressor, train_spec, eval_spec, and train_and_evaluate using your feature columns
In [1]:
OUTDIR = './trained_model'
In [ ]:
# Run the model
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.compat.v1.summary.FileWriterCache.clear()
train_and_evaluate(OUTDIR, 2000)