Predicting Earnings from Census Data with Random Forests

The Task

The United States government periodically collects demographic information by conducting a census.

In this problem, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.

The file census.csv contains 1994 census data for 31,978 individuals in the United States.

The dataset includes the following 13 variables:

age = the age of the individual in years
workclass = the classification of the individual's working status (does the person work for the federal government, work for the local government, work without pay, and so on)
education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
maritalstatus = the marital status of the individual
occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
relationship = relationship of individual to his/her household
race = the individual's race
sex = the individual's sex
capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
hoursperweek = the number of hours the individual works per week
nativecountry = the native country of the individual
over50k = whether or not the individual earned more than $50,000 in 1994

Predict whether an individual's earnings are above $50,000 (the variable "over50k") using all of the other variables as independent variables.



In [1]:

    
import pandas as pd
import numpy as np

Exercise 1

Read the dataset census-2.csv.
find out the name and the type of the single colums



In [2]:

    
# TODO

Exercise 2

sklearn classification can only work with numeric values. Therefore we first have to convert all not-numeric values to numeric values.

convert the target column over50k to a boolean
convert the not-numeric independent variables (aka features, aka predictors) via pd.get_dummies().
- check the number of columns before and after applying pd.get_dummies
- how did `pd.get_dummies() work?

See http://pbpython.com/categorical-encoding.html for further alternatives to convert not-numeric values to numeric values.



In [3]:

    
# TODO convert over50k to boolean



In [4]:

    
# TODO convert independend variables

Exercise 3

Separate target variable over50k from the independent variables (all others): over50k -> y, all others -> X



In [5]:

    
# TODO (hint: use drop(columns,axis=1))

Exercise 4

Then, split the data randomly into a training set and a testing set, setting the random_state to 2000 before creating the split. Split the data so that the training set contains 60% of the observations, while the testing set contains 40% of the observations.



In [6]:

    
from sklearn.model_selection import train_test_split



In [7]:

    
# TODO

Exercise 5

Let us now build a classification tree to predict "over50k". Use the training set to build the model, and all of the other variables as independent variables. Use the default parameters.



In [8]:

    
from sklearn.ensemble import RandomForestClassifier

# TODO

Exercise 6

Which are the most important features? Plot Top 5 with plotting_utilities.plot_feature_importances.



In [9]:

    
from plotting_utilities import plot_feature_importances
import matplotlib.pyplot as plt

%matplotlib inline

# TODO

Exercise 7

Predict for the test data and
compare with the actual outcome:
- Therefore print the confusion matrix for the test-data and
- calculate the accuracy
  - for the trainings-data
  - for the test-data
  - how good is the accuracy in comparision to the Decision Tree?



In [11]:

    
# TODO predict



In [12]:

    
from sklearn.metrics import confusion_matrix
# TODO