In [ ]:
# Ensure that we have Tensorflow 1.13 installed.
!pip3 freeze | grep tensorflow==1.13.1 || pip3 install tensorflow==1.13.1

In [2]:
import tensorflow as tf

tf.enable_eager_execution()
tf.logging.set_verbosity(tf.logging.ERROR)

Intro

The tf.feature_column package provides several options for encoding categorical data. This mini-lab gives you an oppurtunity to explore and understand these options.


In [3]:
# Toy Features Dictionary

features = {"sq_footage": [ 1000, 2000, 3000, 4000, 5000],
            "house_type":       ["house", "house", "apt", "apt", "townhouse"]}

Feature Column Definition

We have one continuous feature and one categorical feature.

Note that the category 'townhouse' is outside of our vocabulary list (OOV for short).


In [4]:
feat_cols = [
    tf.feature_column.numeric_column('sq_footage'),
    tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            'house_type',['house','apt']
        ))
]

Inspect Transformed Data

This is what would be input to your model would be after the features are transformed by the feature column specification.


In [5]:
tf.feature_column.input_layer(features,feat_cols)


Out[5]:
<tf.Tensor: id=51, shape=(5, 3), dtype=float32, numpy=
array([[1.e+00, 0.e+00, 1.e+03],
       [1.e+00, 0.e+00, 2.e+03],
       [0.e+00, 1.e+00, 3.e+03],
       [0.e+00, 1.e+00, 4.e+03],
       [0.e+00, 0.e+00, 5.e+03]], dtype=float32)>

Excercise 1

What is the current encoding behavior for the OOV value? Currently it is ignored, which is to say the feature vector is set to all zeros.

Modify the feature column to have OOV values default to the 'house' category.


In [7]:
feat_cols = [
    tf.feature_column.numeric_column('sq_footage'),
    tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            'house_type',['house','apt'], default_value=0
        ))
]

tf.feature_column.input_layer(features,feat_cols)


Out[7]:
<tf.Tensor: id=115, shape=(5, 3), dtype=float32, numpy=
array([[1.e+00, 0.e+00, 1.e+03],
       [1.e+00, 0.e+00, 2.e+03],
       [0.e+00, 1.e+00, 3.e+03],
       [0.e+00, 1.e+00, 4.e+03],
       [1.e+00, 0.e+00, 5.e+03]], dtype=float32)>

Excercise 2

Now modify the feature column to have OOV values be assigned to a separate 'catch-all' category.


In [8]:
feat_cols = [
    tf.feature_column.numeric_column('sq_footage'),
    tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            'house_type',['house','apt'], num_oov_buckets=1
        ))
]

tf.feature_column.input_layer(features,feat_cols)


Out[8]:
<tf.Tensor: id=172, shape=(5, 4), dtype=float32, numpy=
array([[1.e+00, 0.e+00, 0.e+00, 1.e+03],
       [1.e+00, 0.e+00, 0.e+00, 2.e+03],
       [0.e+00, 1.e+00, 0.e+00, 3.e+03],
       [0.e+00, 1.e+00, 0.e+00, 4.e+03],
       [0.e+00, 0.e+00, 1.e+00, 5.e+03]], dtype=float32)>

Excercise 3

Assume we didn't have a vocabulary list available. Modify the feature column to one-hot encode house type based on a hash function.

What is the minimum hash bucket size to ensure no collisions? 5 is the minumum. With a hash bucket size of 2, all categories collide. With a size of 3 of 4, 'house' and 'townhouse' collide'


In [14]:
feat_cols = [
    tf.feature_column.numeric_column('sq_footage'),
    tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_hash_bucket(
            'house_type',5
        ))
]

tf.feature_column.input_layer(features,feat_cols)


Out[14]:
<tf.Tensor: id=382, shape=(5, 6), dtype=float32, numpy=
array([[1.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 1.e+03],
       [1.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 2.e+03],
       [0.e+00, 0.e+00, 0.e+00, 0.e+00, 1.e+00, 3.e+03],
       [0.e+00, 0.e+00, 0.e+00, 0.e+00, 1.e+00, 4.e+03],
       [0.e+00, 0.e+00, 1.e+00, 0.e+00, 0.e+00, 5.e+03]], dtype=float32)>