Predicting Earnings from Census Data with Decision Tree

The Task

The United States government periodically collects demographic information by conducting a census.

In this problem, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.

The file census.csv contains 1994 census data for 31,978 individuals in the United States.

The dataset includes the following 13 variables:

age = the age of the individual in years
workclass = the classification of the individual's working status (does the person work for the federal government, work for the local government, work without pay, and so on)
education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
maritalstatus = the marital status of the individual
occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
relationship = relationship of individual to his/her household
race = the individual's race
sex = the individual's sex
capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
hoursperweek = the number of hours the individual works per week
nativecountry = the native country of the individual
over50k = whether or not the individual earned more than $50,000 in 1994

Predict whether an individual's earnings are above $50,000 (the variable "over50k") using all of the other variables as independent variables.



In [1]:

    
import pandas as pd
import numpy as np

Exercise 1

Read the dataset census-2.csv.
find out the name and the type of the single colums



In [2]:

    
# TODO

Exercise 2

sklearn classification can only work with numeric values. Therefore we first have to convert all not-numeric values to numeric values.

copy the dataframe
in the copy: convert the target column over50k to a boolean
in the copy: convert the not-numeric independent variables (aka features, aka predictors) via sklearn.LabelEncoder.

See http://pbpython.com/categorical-encoding.html how to use the sklearn.LabelEncoder and for further alternatives to convert not-numeric values to numeric values.



In [14]:

    
# TODO convert over50k to boolean



In [6]:

    
from sklearn.preprocessing import LabelEncoder

# TODO

Exercise 3

Separate target variable over50k from the independent variables (all others): over50k -> y, all others -> X



In [13]:

    
# TODO (hint: use drop(columns,axis=1))

Exercise 4

Then, split the data randomly into a training set and a testing set, setting the random_state to 2000 before creating the split. Split the data so that the training set contains 60% of the observations, while the testing set contains 40% of the observations.



In [62]:

    
from sklearn.model_selection import train_test_split



In [7]:

    
# TODO

Exercise 5

Let us now build a classification tree to predict "over50k". Use the training set to build the model, and all of the other variables as independent variables. Use max_depth=3 and the default parameters else.



In [8]:

    
from sklearn.tree import DecisionTreeClassifier

# TODO

Exercise 6

Plot the decision tree using plotting_utilities.plot_decision_tree

Which are the most important feature? (Root of the Tree)
Which is the next important feature? (2nd Level)



In [66]:

    
from plotting_utilities import plot_decision_tree, plot_feature_importances
import matplotlib.pyplot as plt

%matplotlib inline



In [9]:

    
# TODO

Exercise 7

Plot Top 5 most important features with plotting_utilities.plot_feature_importances.

Are these features also the most important in the Decision Tree?



In [10]:

    
# TODO

Exercise 7

Predict for the test data and
compare with the actual outcome:
- Therefore print the confusion matrix for the test-data and
- calculate the accuracy
  - for the trainings-data
  - for the test-data



In [11]:

    
# TODO predict



In [12]:

    
from sklearn.metrics import confusion_matrix

# TODO