taken from The Analytics Edge
The United States government periodically collects demographic information by conducting a census.
In this problem, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.
The file census.csv contains 1994 census data for 31,978 individuals in the United States.
The dataset includes the following 13 variables:
Predict whether an individual's earnings are above $50,000 (the variable "over50k") using all of the other variables as independent variables.
In [1]:
import pandas as pd
import numpy as np
In [2]:
# TODO
sklearn classification can only work with numeric values. Therefore we first have to convert all not-numeric values to numeric values.
over50k to a booleanpd.get_dummies().pd.get_dummies`pd.get_dummies() work?See http://pbpython.com/categorical-encoding.html for further alternatives to convert not-numeric values to numeric values.
In [3]:
# TODO convert over50k to boolean
In [4]:
# TODO convert independend variables
In [5]:
# TODO (hint: use drop(columns,axis=1))
In [6]:
from sklearn.model_selection import train_test_split
In [7]:
# TODO
In [8]:
from sklearn.ensemble import RandomForestClassifier
# TODO
In [9]:
from plotting_utilities import plot_feature_importances
import matplotlib.pyplot as plt
%matplotlib inline
# TODO
In [11]:
# TODO predict
In [12]:
from sklearn.metrics import confusion_matrix
# TODO