We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k).
Here is some information about the data:
| Column Name | Type | Description | 
|---|---|---|
| age | Continuous | The age of the individual | 
| workclass | Categorical | The type of employer the individual has (government, military, private, etc.). | 
| fnlwgt | Continuous | The number of people the census takers believe that observation represents (sample weight). This variable will not be used. | 
| education | Categorical | The highest level of education achieved for that individual. | 
| education_num | Continuous | The highest level of education in numerical form. | 
| marital_status | Categorical | Marital status of the individual. | 
| occupation | Categorical | The occupation of the individual. | 
| relationship | Categorical | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. | 
| race | Categorical | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. | 
| gender | Categorical | Female, Male. | 
| capital_gain | Continuous | Capital gains recorded. | 
| capital_loss | Continuous | Capital Losses recorded. | 
| hours_per_week | Continuous | Hours worked per week. | 
| native_country | Categorical | Country of origin of the individual. | 
| income | Categorical | ">50K" or "<=50K", meaning whether the person makes more than \$50,000 annually. | 
Read in the census_data.csv data with pandas
In [1]:
    
import pandas as pd
    
In [2]:
    
census = pd.read_csv("./data/census_data.csv")
    
In [3]:
    
census.head()
    
    Out[3]:
TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.
Convert the Label column to 0s and 1s instead of strings.
In [4]:
    
census['income_bracket'].unique()
    
    Out[4]:
In [5]:
    
def label_fix(label):
    if label==' <=50K':
        return 0
    else:
        return 1
    
In [6]:
    
# Applying function to every row of the DataFrame
census['income_bracket'] = census['income_bracket'].apply(label_fix)
    
In [7]:
    
# Alternative
# lambda label:int(label==' <=50k')
# census['income_bracket'].apply(lambda label: int(label==' <=50K'))
    
In [8]:
    
from sklearn.model_selection import train_test_split
    
In [9]:
    
x_data = census.drop('income_bracket', axis = 1)
y_labels = census['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size = 0.3,random_state = 101)
    
In [10]:
    
x_data.head()
    
    Out[10]:
In [11]:
    
y_labels.head()
    
    Out[11]:
In [12]:
    
census.columns
    
    Out[12]:
Import Tensorflow
In [13]:
    
import tensorflow as tf
    
Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets.
In [14]:
    
gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Female", "Male"])
occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)
    
Create the continuous feature_columns for the continuous values using numeric_column
In [15]:
    
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")
    
Put all these variables into a single list with the variable name feat_cols
In [16]:
    
feat_cols = [gender, occupation, marital_status, relationship, education, workclass, native_country,
            age, education_num, capital_gain, capital_loss, hours_per_week]
    
In [17]:
    
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 100,
                                                 num_epochs = None,
                                                 shuffle = True)
    
In [18]:
    
model = tf.estimator.LinearClassifier(feature_columns = feat_cols)
    
    
Train your model on the data, for at least 5000 steps.
In [19]:
    
model.train(input_fn = input_func,
            steps = 5000)
    
    
    Out[19]:
In [20]:
    
pred_fn = tf.estimator.inputs.pandas_input_fn(x = X_test,
                                              batch_size = len(X_test),
                                              shuffle = False)
    
Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list()
In [21]:
    
predictions = list(model.predict(input_fn = pred_fn))
    
    
Each item in your list will look like this:
In [22]:
    
predictions[0]
    
    Out[22]:
Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values.
In [23]:
    
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])
    
In [24]:
    
final_preds[:10]
    
    Out[24]:
Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data.
In [25]:
    
from sklearn.metrics import classification_report
    
In [26]:
    
print(classification_report(y_test, final_preds))