Getting to Know Your Data

In this notebook, we'll use the pandas library to get to know the adult dataset. If you need to install Python and the necessary packages like pandas and IPython Notebook, one quick option is to download and install the Anaconda distribution.

Loading in the Data



In [1]:

    
# pandas is a powerful library for data manipulation
import pandas as pd

# matplotlib is used to plot the histograms
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
# it turns out that the dataset doens't actually have a header, so we'll have to insert it manually
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'martial-status', 'occupation', 'relationship',
          'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

# load in the adult dataset from url
adult = pd.io.parsers.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                               header=None, names=names)

# remove empty rows
adult.dropna(how='all', inplace=True)

Let's see what the first five rows look like.



In [3]:

    
adult.head(5)









    Out[3]:






  
    
      
      age
      workclass
      fnlwgt
      education
      education-num
      martial-status
      occupation
      relationship
      race
      sex
      capital-gain
      capital-loss
      hours-per-week
      native-country
      income
    
  
  
    
      0
       39
               State-gov
        77516
        Bachelors
       13
             Never-married
             Adm-clerical
        Not-in-family
        White
          Male
       2174
       0
       40
        United-States
        <=50K
    
    
      1
       50
        Self-emp-not-inc
        83311
        Bachelors
       13
        Married-civ-spouse
          Exec-managerial
              Husband
        White
          Male
          0
       0
       13
        United-States
        <=50K
    
    
      2
       38
                 Private
       215646
          HS-grad
        9
                  Divorced
        Handlers-cleaners
        Not-in-family
        White
          Male
          0
       0
       40
        United-States
        <=50K
    
    
      3
       53
                 Private
       234721
             11th
        7
        Married-civ-spouse
        Handlers-cleaners
              Husband
        Black
          Male
          0
       0
       40
        United-States
        <=50K
    
    
      4
       28
                 Private
       338409
        Bachelors
       13
        Married-civ-spouse
           Prof-specialty
                 Wife
        Black
        Female
          0
       0
       40
                 Cuba
        <=50K

Similarity Between of Ordinal Attributes

Suppose we want to compute some sort of similarity between different values of the education attribute. This is an ordinal attribute, so one way to do this is to map each value to an integer starting from 0. Then the dissimilarity is just the difference between the integers. In fact, this has already been done for us in the education-num column. But let's see whether we can do this ourselves.



In [4]:

    
# define a list that ranks the 16 types of education level from lowest to highest
education_order = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad',
                   'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']

# map each value of the `education` attribute to its corresponding integer
adult['education'] = adult['education'].map(lambda x: education_order.index(x.strip()))

As we can see from the updated table below, each education level has now been mapped to an integer.



In [5]:

    
adult.head(5)









    Out[5]:






  
    
      
      age
      workclass
      fnlwgt
      education
      education-num
      martial-status
      occupation
      relationship
      race
      sex
      capital-gain
      capital-loss
      hours-per-week
      native-country
      income
    
  
  
    
      0
       39
               State-gov
        77516
       12
       13
             Never-married
             Adm-clerical
        Not-in-family
        White
          Male
       2174
       0
       40
        United-States
        <=50K
    
    
      1
       50
        Self-emp-not-inc
        83311
       12
       13
        Married-civ-spouse
          Exec-managerial
              Husband
        White
          Male
          0
       0
       13
        United-States
        <=50K
    
    
      2
       38
                 Private
       215646
        8
        9
                  Divorced
        Handlers-cleaners
        Not-in-family
        White
          Male
          0
       0
       40
        United-States
        <=50K
    
    
      3
       53
                 Private
       234721
        6
        7
        Married-civ-spouse
        Handlers-cleaners
              Husband
        Black
          Male
          0
       0
       40
        United-States
        <=50K
    
    
      4
       28
                 Private
       338409
       12
       13
        Married-civ-spouse
           Prof-specialty
                 Wife
        Black
        Female
          0
       0
       40
                 Cuba
        <=50K

Mean and Variance of Age

It's easy to compute the mean and variance of the age in first five rows. Try to compute these manually and compare with the answers below.



In [6]:

    
# mean of first 5 ages
adult['age'][:5].mean()









    Out[6]:





41.600000000000001



In [7]:

    
# variance of first 5 ages (this is the unbiased estimate with the (n-1) in the denominator)
adult['age'][:5].var()









    Out[7]:





101.30000000000018

Relationship Between Age and Income

We can also divide the table into two groups, one consisting of people with income less than \$50k and the other with income more than \$50k.



In [8]:

    
age_groups = adult.groupby('income')['age']

Looking at the mean of each group, we can see that the more wealthy group of people tend to be older.



In [9]:

    
age_groups.mean()









    Out[9]:





income
 <=50K    36.783738
 >50K     44.249841
Name: age, dtype: float64

Finally, from the variances, it looks like the ages in the less wealthy group are more spread out.



In [10]:

    
age_groups.var()









    Out[10]:





income
 <=50K    196.562881
 >50K     110.649944
Name: age, dtype: float64

It might be easier to see these effects if we plot the histogram of the ages from each income group.



In [11]:

    
axes = adult.hist('age', by='income', sharey=True, figsize=(13, 5))

	age	workclass	fnlwgt	education	education-num	martial-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K