Classification Exercise - Solutions

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k).

Here is some information about the data:

Column Name Type Description
age Continuous The age of the individual
workclass Categorical The type of employer the individual has (government, military, private, etc.).
fnlwgt Continuous The number of people the census takers believe that observation represents (sample weight). This variable will not be used.
education Categorical The highest level of education achieved for that individual.
education_num Continuous The highest level of education in numerical form.
marital_status Categorical Marital status of the individual.
occupation Categorical The occupation of the individual.
relationship Categorical Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race Categorical White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
gender Categorical Female, Male.
capital_gain Continuous Capital gains recorded.
capital_loss Continuous Capital Losses recorded.
hours_per_week Continuous Hours worked per week.
native_country Categorical Country of origin of the individual.
income Categorical ">50K" or "<=50K", meaning whether the person makes more than \$50,000 annually.

Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

THE DATA

Read in the census_data.csv data with pandas


In [1]:
import pandas as pd

In [2]:
census = pd.read_csv("./data/census_data.csv")

In [3]:
census.head()


Out[3]:
age workclass education education_num marital_status occupation relationship race gender capital_gain capital_loss hours_per_week native_country income_bracket
0 39 State-gov Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.

Convert the Label column to 0s and 1s instead of strings.


In [4]:
census['income_bracket'].unique()


Out[4]:
array([' <=50K', ' >50K'], dtype=object)

In [5]:
def label_fix(label):
    if label==' <=50K':
        return 0
    else:
        return 1

In [6]:
# Applying function to every row of the DataFrame
census['income_bracket'] = census['income_bracket'].apply(label_fix)

In [7]:
# Alternative
# lambda label:int(label==' <=50k')

# census['income_bracket'].apply(lambda label: int(label==' <=50K'))

Perform a Train Test Split on the Data


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_data = census.drop('income_bracket', axis = 1)
y_labels = census['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size = 0.3,random_state = 101)

In [10]:
x_data.head()


Out[10]:
age workclass education education_num marital_status occupation relationship race gender capital_gain capital_loss hours_per_week native_country
0 39 State-gov Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba

In [11]:
y_labels.head()


Out[11]:
0    0
1    0
2    0
3    0
4    0
Name: income_bracket, dtype: int64

Create the Feature Columns for tf.esitmator

Take note of categorical vs continuous values!


In [12]:
census.columns


Out[12]:
Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

Import Tensorflow


In [13]:
import tensorflow as tf

Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets.


In [14]:
gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Female", "Male"])
occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)

Create the continuous feature_columns for the continuous values using numeric_column


In [15]:
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

Put all these variables into a single list with the variable name feat_cols


In [16]:
feat_cols = [gender, occupation, marital_status, relationship, education, workclass, native_country,
            age, education_num, capital_gain, capital_loss, hours_per_week]

Create Input Function

Batch_size is up to you. But do make sure to shuffle!


In [17]:
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 100,
                                                 num_epochs = None,
                                                 shuffle = True)

Create your model with tf.estimator

Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)


In [18]:
model = tf.estimator.LinearClassifier(feature_columns = feat_cols)


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpjm5hetiu', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001683B8D53C8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Train your model on the data, for at least 5000 steps.


In [19]:
model.train(input_fn = input_func,
            steps = 5000)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu\model.ckpt.
INFO:tensorflow:loss = 69.31472, step = 0
INFO:tensorflow:global_step/sec: 29.578
INFO:tensorflow:loss = 151.76192, step = 100 (3.376 sec)
INFO:tensorflow:global_step/sec: 45.3458
INFO:tensorflow:loss = 90.81382, step = 200 (2.206 sec)
INFO:tensorflow:global_step/sec: 48.6777
INFO:tensorflow:loss = 482.4537, step = 300 (2.054 sec)
INFO:tensorflow:global_step/sec: 45.8398
INFO:tensorflow:loss = 11.708162, step = 400 (2.184 sec)
INFO:tensorflow:global_step/sec: 45.9213
INFO:tensorflow:loss = 41.846344, step = 500 (2.178 sec)
INFO:tensorflow:global_step/sec: 33.7949
INFO:tensorflow:loss = 117.881996, step = 600 (2.957 sec)
INFO:tensorflow:global_step/sec: 32.5946
INFO:tensorflow:loss = 224.32774, step = 700 (3.070 sec)
INFO:tensorflow:global_step/sec: 41.4558
INFO:tensorflow:loss = 433.41843, step = 800 (2.415 sec)
INFO:tensorflow:global_step/sec: 45.7607
INFO:tensorflow:loss = 107.56831, step = 900 (2.180 sec)
INFO:tensorflow:global_step/sec: 45.8024
INFO:tensorflow:loss = 110.971, step = 1000 (2.183 sec)
INFO:tensorflow:global_step/sec: 43.4827
INFO:tensorflow:loss = 89.823654, step = 1100 (2.303 sec)
INFO:tensorflow:global_step/sec: 44.8985
INFO:tensorflow:loss = 94.041374, step = 1200 (2.224 sec)
INFO:tensorflow:global_step/sec: 45.5055
INFO:tensorflow:loss = 45.310684, step = 1300 (2.199 sec)
INFO:tensorflow:global_step/sec: 45.6137
INFO:tensorflow:loss = 32.59154, step = 1400 (2.191 sec)
INFO:tensorflow:global_step/sec: 46.1744
INFO:tensorflow:loss = 113.06632, step = 1500 (2.165 sec)
INFO:tensorflow:global_step/sec: 42.2972
INFO:tensorflow:loss = 501.66144, step = 1600 (2.364 sec)
INFO:tensorflow:global_step/sec: 47.2076
INFO:tensorflow:loss = 58.499954, step = 1700 (2.119 sec)
INFO:tensorflow:global_step/sec: 48.2551
INFO:tensorflow:loss = 586.60266, step = 1800 (2.072 sec)
INFO:tensorflow:global_step/sec: 45.5523
INFO:tensorflow:loss = 33.003986, step = 1900 (2.198 sec)
INFO:tensorflow:global_step/sec: 43.9898
INFO:tensorflow:loss = 54.28188, step = 2000 (2.273 sec)
INFO:tensorflow:global_step/sec: 41.4558
INFO:tensorflow:loss = 45.152153, step = 2100 (2.410 sec)
INFO:tensorflow:global_step/sec: 42.7557
INFO:tensorflow:loss = 47.01676, step = 2200 (2.340 sec)
INFO:tensorflow:global_step/sec: 44.4196
INFO:tensorflow:loss = 789.3104, step = 2300 (2.249 sec)
INFO:tensorflow:global_step/sec: 45.4076
INFO:tensorflow:loss = 56.68877, step = 2400 (2.202 sec)
INFO:tensorflow:global_step/sec: 44.688
INFO:tensorflow:loss = 50.10916, step = 2500 (2.238 sec)
INFO:tensorflow:global_step/sec: 43.1688
INFO:tensorflow:loss = 84.51349, step = 2600 (2.318 sec)
INFO:tensorflow:global_step/sec: 47.4764
INFO:tensorflow:loss = 39.5674, step = 2700 (2.105 sec)
INFO:tensorflow:global_step/sec: 47.0743
INFO:tensorflow:loss = 299.21976, step = 2800 (2.123 sec)
INFO:tensorflow:global_step/sec: 49.8419
INFO:tensorflow:loss = 34.14011, step = 2900 (2.005 sec)
INFO:tensorflow:global_step/sec: 49.4478
INFO:tensorflow:loss = 214.40697, step = 3000 (2.026 sec)
INFO:tensorflow:global_step/sec: 45.573
INFO:tensorflow:loss = 155.11523, step = 3100 (2.191 sec)
INFO:tensorflow:global_step/sec: 49.2773
INFO:tensorflow:loss = 33.89815, step = 3200 (2.032 sec)
INFO:tensorflow:global_step/sec: 45.4901
INFO:tensorflow:loss = 31.836535, step = 3300 (2.195 sec)
INFO:tensorflow:global_step/sec: 45.2228
INFO:tensorflow:loss = 34.90007, step = 3400 (2.212 sec)
INFO:tensorflow:global_step/sec: 44.6774
INFO:tensorflow:loss = 88.288025, step = 3500 (2.240 sec)
INFO:tensorflow:global_step/sec: 43.7589
INFO:tensorflow:loss = 49.503735, step = 3600 (2.281 sec)
INFO:tensorflow:global_step/sec: 46.0981
INFO:tensorflow:loss = 60.929817, step = 3700 (2.170 sec)
INFO:tensorflow:global_step/sec: 46.0768
INFO:tensorflow:loss = 90.945564, step = 3800 (2.170 sec)
INFO:tensorflow:global_step/sec: 44.3211
INFO:tensorflow:loss = 44.933937, step = 3900 (2.255 sec)
INFO:tensorflow:global_step/sec: 45.3663
INFO:tensorflow:loss = 299.89456, step = 4000 (2.205 sec)
INFO:tensorflow:global_step/sec: 47.3968
INFO:tensorflow:loss = 32.051506, step = 4100 (2.113 sec)
INFO:tensorflow:global_step/sec: 47.7618
INFO:tensorflow:loss = 35.97357, step = 4200 (2.091 sec)
INFO:tensorflow:global_step/sec: 46.1199
INFO:tensorflow:loss = 270.05746, step = 4300 (2.172 sec)
INFO:tensorflow:global_step/sec: 33.4308
INFO:tensorflow:loss = 97.3794, step = 4400 (2.987 sec)
INFO:tensorflow:global_step/sec: 50.5973
INFO:tensorflow:loss = 33.93995, step = 4500 (1.979 sec)
INFO:tensorflow:global_step/sec: 46.2944
INFO:tensorflow:loss = 31.607256, step = 4600 (2.163 sec)
INFO:tensorflow:global_step/sec: 39.5915
INFO:tensorflow:loss = 33.775787, step = 4700 (2.519 sec)
INFO:tensorflow:global_step/sec: 47.7239
INFO:tensorflow:loss = 30.90573, step = 4800 (2.103 sec)
INFO:tensorflow:global_step/sec: 43.4359
INFO:tensorflow:loss = 32.88506, step = 4900 (2.295 sec)
INFO:tensorflow:Saving checkpoints for 5000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu\model.ckpt.
INFO:tensorflow:Loss for final step: 34.121613.
Out[19]:
<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x1683c379cf8>

Evaluation

Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False.


In [20]:
pred_fn = tf.estimator.inputs.pandas_input_fn(x = X_test,
                                              batch_size = len(X_test),
                                              shuffle = False)

Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list()


In [21]:
predictions = list(model.predict(input_fn = pred_fn))


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

Each item in your list will look like this:


In [22]:
predictions[0]


Out[22]:
{'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'logistic': array([0.23900884], dtype=float32),
 'logits': array([-1.1581213], dtype=float32),
 'probabilities': array([0.76099116, 0.23900878], dtype=float32)}

Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values.


In [23]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [24]:
final_preds[:10]


Out[24]:
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data.


In [25]:
from sklearn.metrics import classification_report

In [26]:
print(classification_report(y_test, final_preds))


             precision    recall  f1-score   support

          0       0.87      0.91      0.89      7436
          1       0.68      0.58      0.63      2333

avg / total       0.83      0.83      0.83      9769

Metrics in binary classification

\begin{equation} Precision = \frac{True Positive}{True Positive + False Positive} \end{equation}\begin{equation} Recall = \frac{True Positive}{True Positive + False Negative} \end{equation}\begin{equation} F1 score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \end{equation}

Great Job!