Getting to Know Your Data

In this notebook, we'll use the pandas library to get to know the adult dataset. If you need to install Python and the necessary packages like pandas and IPython Notebook, one quick option is to download and install the Anaconda distribution.

Loading in the Data


In [1]:
# pandas is a powerful library for data manipulation
import pandas as pd

# matplotlib is used to plot the histograms
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# it turns out that the dataset doens't actually have a header, so we'll have to insert it manually
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'martial-status', 'occupation', 'relationship',
          'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

# load in the adult dataset from url
adult = pd.io.parsers.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                               header=None, names=names)

# remove empty rows
adult.dropna(how='all', inplace=True)

Let's see what the first five rows look like.


In [3]:
adult.head(5)


Out[3]:
age workclass fnlwgt education education-num martial-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Similarity Between of Ordinal Attributes

Suppose we want to compute some sort of similarity between different values of the education attribute. This is an ordinal attribute, so one way to do this is to map each value to an integer starting from 0. Then the dissimilarity is just the difference between the integers. In fact, this has already been done for us in the education-num column. But let's see whether we can do this ourselves.


In [4]:
# define a list that ranks the 16 types of education level from lowest to highest
education_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad',
                   'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']

# map each value of the `education` attribute to its corresponding integer
adult['education'] = adult['education'].map(lambda x: education_order.index(x.strip()))

As we can see from the updated table below, each education level has now been mapped to an integer.


In [5]:
adult.head(5)


Out[5]:
age workclass fnlwgt education education-num martial-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov 77516 12 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 12 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 8 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 6 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 12 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Mean and Variance of Age

It's easy to compute the mean and variance of the age in first five rows. Try to compute these manually and compare with the answers below.


In [6]:
# mean of first 5 ages
adult['age'][:5].mean()


Out[6]:
41.600000000000001

In [7]:
# variance of first 5 ages (this is the unbiased estimate with the (n-1) in the denominator)
adult['age'][:5].var()


Out[7]:
101.30000000000018

Relationship Between Age and Income

We can also divide the table into two groups, one consisting of people with income less than \$50k and the other with income more than \$50k.


In [8]:
age_groups = adult.groupby('income')['age']

Looking at the mean of each group, we can see that the more wealthy group of people tend to be older.


In [9]:
age_groups.mean()


Out[9]:
income
 <=50K    36.783738
 >50K     44.249841
Name: age, dtype: float64

Finally, from the variances, it looks like the ages in the less wealthy group are more spread out.


In [10]:
age_groups.var()


Out[10]:
income
 <=50K    196.562881
 >50K     110.649944
Name: age, dtype: float64

It might be easier to see these effects if we plot the histogram of the ages from each income group.


In [11]:
axes = adult.hist('age', by='income', sharey=True, figsize=(13, 5))