In this notebook, we'll use the pandas library to get to know the adult dataset. If you need to install Python and the necessary packages like pandas and IPython Notebook, one quick option is to download and install the Anaconda distribution.
In [1]:
# pandas is a powerful library for data manipulation
import pandas as pd
# matplotlib is used to plot the histograms
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# it turns out that the dataset doens't actually have a header, so we'll have to insert it manually
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'martial-status', 'occupation', 'relationship',
'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
# load in the adult dataset from url
adult = pd.io.parsers.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
header=None, names=names)
# remove empty rows
adult.dropna(how='all', inplace=True)
Let's see what the first five rows look like.
In [3]:
adult.head(5)
Out[3]:
Suppose we want to compute some sort of similarity between different values of the education attribute. This is an ordinal attribute, so one way to do this is to map each value to an integer starting from 0. Then the dissimilarity is just the difference between the integers. In fact, this has already been done for us in the education-num column. But let's see whether we can do this ourselves.
In [4]:
# define a list that ranks the 16 types of education level from lowest to highest
education_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad',
'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']
# map each value of the `education` attribute to its corresponding integer
adult['education'] = adult['education'].map(lambda x: education_order.index(x.strip()))
As we can see from the updated table below, each education level has now been mapped to an integer.
In [5]:
adult.head(5)
Out[5]:
It's easy to compute the mean and variance of the age in first five rows. Try to compute these manually and compare with the answers below.
In [6]:
# mean of first 5 ages
adult['age'][:5].mean()
Out[6]:
In [7]:
# variance of first 5 ages (this is the unbiased estimate with the (n-1) in the denominator)
adult['age'][:5].var()
Out[7]:
We can also divide the table into two groups, one consisting of people with income less than \$50k and the other with income more than \$50k.
In [8]:
age_groups = adult.groupby('income')['age']
Looking at the mean of each group, we can see that the more wealthy group of people tend to be older.
In [9]:
age_groups.mean()
Out[9]:
Finally, from the variances, it looks like the ages in the less wealthy group are more spread out.
In [10]:
age_groups.var()
Out[10]:
It might be easier to see these effects if we plot the histogram of the ages from each income group.
In [11]:
axes = adult.hist('age', by='income', sharey=True, figsize=(13, 5))