Classification Exercise - Solutions

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k).

Here is some information about the data:

Column Name	Type	Description
age	Continuous	The age of the individual
workclass	Categorical	The type of employer the individual has (government, military, private, etc.).
fnlwgt	Continuous	The number of people the census takers believe that observation represents (sample weight). This variable will not be used.
education	Categorical	The highest level of education achieved for that individual.
education_num	Continuous	The highest level of education in numerical form.
marital_status	Categorical	Marital status of the individual.
occupation	Categorical	The occupation of the individual.
relationship	Categorical	Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race	Categorical	White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
gender	Categorical	Female, Male.
capital_gain	Continuous	Capital gains recorded.
capital_loss	Continuous	Capital Losses recorded.
hours_per_week	Continuous	Hours worked per week.
native_country	Categorical	Country of origin of the individual.
income	Categorical	">50K" or "<=50K", meaning whether the person makes more than \$50,000 annually.

Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

THE DATA

Read in the census_data.csv data with pandas



In [1]:

    
import pandas as pd



In [2]:

    
census = pd.read_csv("./data/census_data.csv")



In [3]:

    
census.head()









    Out[3]:







  
    
      
      age
      workclass
      education
      education_num
      marital_status
      occupation
      relationship
      race
      gender
      capital_gain
      capital_loss
      hours_per_week
      native_country
      income_bracket
    
  
  
    
      0
      39
      State-gov
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
      <=50K
    
    
      1
      50
      Self-emp-not-inc
      Bachelors
      13
      Married-civ-spouse
      Exec-managerial
      Husband
      White
      Male
      0
      0
      13
      United-States
      <=50K
    
    
      2
      38
      Private
      HS-grad
      9
      Divorced
      Handlers-cleaners
      Not-in-family
      White
      Male
      0
      0
      40
      United-States
      <=50K
    
    
      3
      53
      Private
      11th
      7
      Married-civ-spouse
      Handlers-cleaners
      Husband
      Black
      Male
      0
      0
      40
      United-States
      <=50K
    
    
      4
      28
      Private
      Bachelors
      13
      Married-civ-spouse
      Prof-specialty
      Wife
      Black
      Female
      0
      0
      40
      Cuba
      <=50K

TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.

Convert the Label column to 0s and 1s instead of strings.



In [4]:

    
census['income_bracket'].unique()









    Out[4]:





array([' <=50K', ' >50K'], dtype=object)



In [5]:

    
def label_fix(label):
    if label==' <=50K':
        return 0
    else:
        return 1



In [6]:

    
# Applying function to every row of the DataFrame
census['income_bracket'] = census['income_bracket'].apply(label_fix)



In [7]:

    
# Alternative
# lambda label:int(label==' <=50k')

# census['income_bracket'].apply(lambda label: int(label==' <=50K'))

Perform a Train Test Split on the Data



In [8]:

    
from sklearn.model_selection import train_test_split



In [9]:

    
x_data = census.drop('income_bracket', axis = 1)
y_labels = census['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size = 0.3,random_state = 101)



In [10]:

    
x_data.head()









    Out[10]:







  
    
      
      age
      workclass
      education
      education_num
      marital_status
      occupation
      relationship
      race
      gender
      capital_gain
      capital_loss
      hours_per_week
      native_country
    
  
  
    
      0
      39
      State-gov
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
    
    
      1
      50
      Self-emp-not-inc
      Bachelors
      13
      Married-civ-spouse
      Exec-managerial
      Husband
      White
      Male
      0
      0
      13
      United-States
    
    
      2
      38
      Private
      HS-grad
      9
      Divorced
      Handlers-cleaners
      Not-in-family
      White
      Male
      0
      0
      40
      United-States
    
    
      3
      53
      Private
      11th
      7
      Married-civ-spouse
      Handlers-cleaners
      Husband
      Black
      Male
      0
      0
      40
      United-States
    
    
      4
      28
      Private
      Bachelors
      13
      Married-civ-spouse
      Prof-specialty
      Wife
      Black
      Female
      0
      0
      40
      Cuba



In [11]:

    
y_labels.head()









    Out[11]:





0    0
1    0
2    0
3    0
4    0
Name: income_bracket, dtype: int64

Create the Feature Columns for tf.esitmator

Take note of categorical vs continuous values!



In [12]:

    
census.columns









    Out[12]:





Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

Import Tensorflow



In [13]:

    
import tensorflow as tf

Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets.



In [14]:

    
gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Female", "Male"])
occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)

Create the continuous feature_columns for the continuous values using numeric_column



In [15]:

    
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

Put all these variables into a single list with the variable name feat_cols



In [16]:

    
feat_cols = [gender, occupation, marital_status, relationship, education, workclass, native_country,
            age, education_num, capital_gain, capital_loss, hours_per_week]

Create Input Function

Batch_size is up to you. But do make sure to shuffle!



In [17]:

    
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 100,
                                                 num_epochs = None,
                                                 shuffle = True)

Create your model with tf.estimator

Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)



In [18]:

    
model = tf.estimator.LinearClassifier(feature_columns = feat_cols)









    



INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpjm5hetiu', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001683B8D53C8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Train your model on the data, for at least 5000 steps.



In [19]:

    
model.train(input_fn = input_func,
            steps = 5000)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu\model.ckpt.
INFO:tensorflow:loss = 69.31472, step = 0
INFO:tensorflow:global_step/sec: 29.578
INFO:tensorflow:loss = 151.76192, step = 100 (3.376 sec)
INFO:tensorflow:global_step/sec: 45.3458
INFO:tensorflow:loss = 90.81382, step = 200 (2.206 sec)
INFO:tensorflow:global_step/sec: 48.6777
INFO:tensorflow:loss = 482.4537, step = 300 (2.054 sec)
INFO:tensorflow:global_step/sec: 45.8398
INFO:tensorflow:loss = 11.708162, step = 400 (2.184 sec)
INFO:tensorflow:global_step/sec: 45.9213
INFO:tensorflow:loss = 41.846344, step = 500 (2.178 sec)
INFO:tensorflow:global_step/sec: 33.7949
INFO:tensorflow:loss = 117.881996, step = 600 (2.957 sec)
INFO:tensorflow:global_step/sec: 32.5946
INFO:tensorflow:loss = 224.32774, step = 700 (3.070 sec)
INFO:tensorflow:global_step/sec: 41.4558
INFO:tensorflow:loss = 433.41843, step = 800 (2.415 sec)
INFO:tensorflow:global_step/sec: 45.7607
INFO:tensorflow:loss = 107.56831, step = 900 (2.180 sec)
INFO:tensorflow:global_step/sec: 45.8024
INFO:tensorflow:loss = 110.971, step = 1000 (2.183 sec)
INFO:tensorflow:global_step/sec: 43.4827
INFO:tensorflow:loss = 89.823654, step = 1100 (2.303 sec)
INFO:tensorflow:global_step/sec: 44.8985
INFO:tensorflow:loss = 94.041374, step = 1200 (2.224 sec)
INFO:tensorflow:global_step/sec: 45.5055
INFO:tensorflow:loss = 45.310684, step = 1300 (2.199 sec)
INFO:tensorflow:global_step/sec: 45.6137
INFO:tensorflow:loss = 32.59154, step = 1400 (2.191 sec)
INFO:tensorflow:global_step/sec: 46.1744
INFO:tensorflow:loss = 113.06632, step = 1500 (2.165 sec)
INFO:tensorflow:global_step/sec: 42.2972
INFO:tensorflow:loss = 501.66144, step = 1600 (2.364 sec)
INFO:tensorflow:global_step/sec: 47.2076
INFO:tensorflow:loss = 58.499954, step = 1700 (2.119 sec)
INFO:tensorflow:global_step/sec: 48.2551
INFO:tensorflow:loss = 586.60266, step = 1800 (2.072 sec)
INFO:tensorflow:global_step/sec: 45.5523
INFO:tensorflow:loss = 33.003986, step = 1900 (2.198 sec)
INFO:tensorflow:global_step/sec: 43.9898
INFO:tensorflow:loss = 54.28188, step = 2000 (2.273 sec)
INFO:tensorflow:global_step/sec: 41.4558
INFO:tensorflow:loss = 45.152153, step = 2100 (2.410 sec)
INFO:tensorflow:global_step/sec: 42.7557
INFO:tensorflow:loss = 47.01676, step = 2200 (2.340 sec)
INFO:tensorflow:global_step/sec: 44.4196
INFO:tensorflow:loss = 789.3104, step = 2300 (2.249 sec)
INFO:tensorflow:global_step/sec: 45.4076
INFO:tensorflow:loss = 56.68877, step = 2400 (2.202 sec)
INFO:tensorflow:global_step/sec: 44.688
INFO:tensorflow:loss = 50.10916, step = 2500 (2.238 sec)
INFO:tensorflow:global_step/sec: 43.1688
INFO:tensorflow:loss = 84.51349, step = 2600 (2.318 sec)
INFO:tensorflow:global_step/sec: 47.4764
INFO:tensorflow:loss = 39.5674, step = 2700 (2.105 sec)
INFO:tensorflow:global_step/sec: 47.0743
INFO:tensorflow:loss = 299.21976, step = 2800 (2.123 sec)
INFO:tensorflow:global_step/sec: 49.8419
INFO:tensorflow:loss = 34.14011, step = 2900 (2.005 sec)
INFO:tensorflow:global_step/sec: 49.4478
INFO:tensorflow:loss = 214.40697, step = 3000 (2.026 sec)
INFO:tensorflow:global_step/sec: 45.573
INFO:tensorflow:loss = 155.11523, step = 3100 (2.191 sec)
INFO:tensorflow:global_step/sec: 49.2773
INFO:tensorflow:loss = 33.89815, step = 3200 (2.032 sec)
INFO:tensorflow:global_step/sec: 45.4901
INFO:tensorflow:loss = 31.836535, step = 3300 (2.195 sec)
INFO:tensorflow:global_step/sec: 45.2228
INFO:tensorflow:loss = 34.90007, step = 3400 (2.212 sec)
INFO:tensorflow:global_step/sec: 44.6774
INFO:tensorflow:loss = 88.288025, step = 3500 (2.240 sec)
INFO:tensorflow:global_step/sec: 43.7589
INFO:tensorflow:loss = 49.503735, step = 3600 (2.281 sec)
INFO:tensorflow:global_step/sec: 46.0981
INFO:tensorflow:loss = 60.929817, step = 3700 (2.170 sec)
INFO:tensorflow:global_step/sec: 46.0768
INFO:tensorflow:loss = 90.945564, step = 3800 (2.170 sec)
INFO:tensorflow:global_step/sec: 44.3211
INFO:tensorflow:loss = 44.933937, step = 3900 (2.255 sec)
INFO:tensorflow:global_step/sec: 45.3663
INFO:tensorflow:loss = 299.89456, step = 4000 (2.205 sec)
INFO:tensorflow:global_step/sec: 47.3968
INFO:tensorflow:loss = 32.051506, step = 4100 (2.113 sec)
INFO:tensorflow:global_step/sec: 47.7618
INFO:tensorflow:loss = 35.97357, step = 4200 (2.091 sec)
INFO:tensorflow:global_step/sec: 46.1199
INFO:tensorflow:loss = 270.05746, step = 4300 (2.172 sec)
INFO:tensorflow:global_step/sec: 33.4308
INFO:tensorflow:loss = 97.3794, step = 4400 (2.987 sec)
INFO:tensorflow:global_step/sec: 50.5973
INFO:tensorflow:loss = 33.93995, step = 4500 (1.979 sec)
INFO:tensorflow:global_step/sec: 46.2944
INFO:tensorflow:loss = 31.607256, step = 4600 (2.163 sec)
INFO:tensorflow:global_step/sec: 39.5915
INFO:tensorflow:loss = 33.775787, step = 4700 (2.519 sec)
INFO:tensorflow:global_step/sec: 47.7239
INFO:tensorflow:loss = 30.90573, step = 4800 (2.103 sec)
INFO:tensorflow:global_step/sec: 43.4359
INFO:tensorflow:loss = 32.88506, step = 4900 (2.295 sec)
INFO:tensorflow:Saving checkpoints for 5000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu\model.ckpt.
INFO:tensorflow:Loss for final step: 34.121613.






    Out[19]:





<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x1683c379cf8>

Evaluation

Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False.



In [20]:

    
pred_fn = tf.estimator.inputs.pandas_input_fn(x = X_test,
                                              batch_size = len(X_test),
                                              shuffle = False)

Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list()



In [21]:

    
predictions = list(model.predict(input_fn = pred_fn))









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpjm5hetiu\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

Each item in your list will look like this:



In [22]:

    
predictions[0]









    Out[22]:





{'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'logistic': array([0.23900884], dtype=float32),
 'logits': array([-1.1581213], dtype=float32),
 'probabilities': array([0.76099116, 0.23900878], dtype=float32)}

Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values.



In [23]:

    
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])



In [24]:

    
final_preds[:10]









    Out[24]:





[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data.



In [25]:

    
from sklearn.metrics import classification_report



In [26]:

    
print(classification_report(y_test, final_preds))









    



             precision    recall  f1-score   support

          0       0.87      0.91      0.89      7436
          1       0.68      0.58      0.63      2333

avg / total       0.83      0.83      0.83      9769

Metrics in binary classification

\begin{equation} Precision = \frac{True Positive}{True Positive + False Positive} \end{equation}\begin{equation} Recall = \frac{True Positive}{True Positive + False Negative} \end{equation}\begin{equation} F1 score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \end{equation}

	age	workclass	education	education_num	marital_status	occupation	relationship	race	gender	capital_gain	hours_per_week	native_country	income_bracket
0	39	State-gov	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K