We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k).
Here is some information about the data:
Column Name | Type | Description |
---|---|---|
age | Continuous | The age of the individual |
workclass | Categorical | The type of employer the individual has (government, military, private, etc.). |
fnlwgt | Continuous | The number of people the census takers believe that observation represents (sample weight). This variable will not be used. |
education | Categorical | The highest level of education achieved for that individual. |
education_num | Continuous | The highest level of education in numerical form. |
marital_status | Categorical | Marital status of the individual. |
occupation | Categorical | The occupation of the individual. |
relationship | Categorical | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. |
race | Categorical | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. |
gender | Categorical | Female, Male. |
capital_gain | Continuous | Capital gains recorded. |
capital_loss | Continuous | Capital Losses recorded. |
hours_per_week | Continuous | Hours worked per week. |
native_country | Categorical | Country of origin of the individual. |
income | Categorical | ">50K" or "<=50K", meaning whether the person makes more than \$50,000 annually. |
Read in the census_data.csv data with pandas
In [1]:
import pandas as pd
In [2]:
census = pd.read_csv("./data/census_data.csv")
In [3]:
census.head()
Out[3]:
TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.
Convert the Label column to 0s and 1s instead of strings.
In [4]:
census['income_bracket'].unique()
Out[4]:
In [5]:
def label_fix(label):
if label==' <=50K':
return 0
else:
return 1
In [6]:
# Applying function to every row of the DataFrame
census['income_bracket'] = census['income_bracket'].apply(label_fix)
In [7]:
# Alternative
# lambda label:int(label==' <=50k')
# census['income_bracket'].apply(lambda label: int(label==' <=50K'))
In [8]:
from sklearn.model_selection import train_test_split
In [9]:
x_data = census.drop('income_bracket', axis = 1)
y_labels = census['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size = 0.3,random_state = 101)
In [10]:
x_data.head()
Out[10]:
In [11]:
y_labels.head()
Out[11]:
In [12]:
census.columns
Out[12]:
Import Tensorflow
In [13]:
import tensorflow as tf
Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets.
In [14]:
gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Female", "Male"])
occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)
Create the continuous feature_columns for the continuous values using numeric_column
In [15]:
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")
Put all these variables into a single list with the variable name feat_cols
In [16]:
feat_cols = [gender, occupation, marital_status, relationship, education, workclass, native_country,
age, education_num, capital_gain, capital_loss, hours_per_week]
In [17]:
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
y = y_train,
batch_size = 100,
num_epochs = None,
shuffle = True)
In [18]:
model = tf.estimator.LinearClassifier(feature_columns = feat_cols)
Train your model on the data, for at least 5000 steps.
In [19]:
model.train(input_fn = input_func,
steps = 5000)
Out[19]:
In [20]:
pred_fn = tf.estimator.inputs.pandas_input_fn(x = X_test,
batch_size = len(X_test),
shuffle = False)
Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list()
In [21]:
predictions = list(model.predict(input_fn = pred_fn))
Each item in your list will look like this:
In [22]:
predictions[0]
Out[22]:
Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values.
In [23]:
final_preds = []
for pred in predictions:
final_preds.append(pred['class_ids'][0])
In [24]:
final_preds[:10]
Out[24]:
Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data.
In [25]:
from sklearn.metrics import classification_report
In [26]:
print(classification_report(y_test, final_preds))