In this blog post I will show you how to slice'n dice the data set from Adult Data Set MLR which contains income data for about 32000 people. We will look at the data and build a machine learning model, which tries to predict if a person will make more than $50K a year, given data like education, gender and martial status.

Let's first import some libraries that we are going to need for our analysis



In [1]:

    
import pandas as pd
import numpy as np
import statsmodels as sm
import sklearn as skl
import sklearn.preprocessing as preprocessing
import sklearn.linear_model as linear_model
import sklearn.cross_validation as cross_validation
import sklearn.metrics as metrics
import sklearn.tree as tree
import seaborn as sns

First we need to read the data from the file, which contains comma separated columns. With the command below we will read the data skipping any spaces before/after the commans and mark the values '?' as missing data points.

Load the data



In [2]:

    
original_data = pd.read_csv(
    "adult.data.txt",
    names=[
        "Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
        "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
        "Hours per week", "Country", "Target"],
        sep=r'\s*,\s*',
        engine='python',
        na_values="?")
original_data.tail()









    Out[2]:






  
    
      
      Age
      Workclass
      fnlwgt
      Education
      Education-Num
      Martial Status
      Occupation
      Relationship
      Race
      Sex
      Capital Gain
      Capital Loss
      Hours per week
      Country
      Target
    
  
  
    
      32556
       27
            Private
       257302
       Assoc-acdm
       12
       Married-civ-spouse
            Tech-support
            Wife
       White
       Female
           0
       0
       38
       United-States
       <=50K
    
    
      32557
       40
            Private
       154374
          HS-grad
        9
       Married-civ-spouse
       Machine-op-inspct
         Husband
       White
         Male
           0
       0
       40
       United-States
        >50K
    
    
      32558
       58
            Private
       151910
          HS-grad
        9
                  Widowed
            Adm-clerical
       Unmarried
       White
       Female
           0
       0
       40
       United-States
       <=50K
    
    
      32559
       22
            Private
       201490
          HS-grad
        9
            Never-married
            Adm-clerical
       Own-child
       White
         Male
           0
       0
       20
       United-States
       <=50K
    
    
      32560
       52
       Self-emp-inc
       287927
          HS-grad
        9
       Married-civ-spouse
         Exec-managerial
            Wife
       White
       Female
       15024
       0
       40
       United-States
        >50K

Analyze the data

Let's plot the distribution of each feature, so that we have a better understanding what we have in our data. We draw the number of values for each category feature and the histogram of the values for each continuous feature.



In [3]:

    
fig = plt.figure(figsize=(20,15))
cols = 5
rows = ceil(float(original_data.shape[1]) / cols)
for i, column in enumerate(original_data.columns):
    ax = fig.add_subplot(rows, cols, i + 1)
    ax.set_title(column)
    if original_data.dtypes[column] == np.object:
        original_data[column].value_counts().plot(kind="bar", axes=ax)
    else:
        original_data[column].hist(axes=ax)
        plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=0.7, wspace=0.2)

As you can see from the plots above our data is mostly concentrated in the USA with mostly male white people. This is a good thing to notice, as it may impact the conclusions we come to later.



In [4]:

    
(original_data["Country"].value_counts() / original_data.shape[0]).head()









    Out[4]:





United-States    0.895857
Mexico           0.019748
Philippines      0.006081
Germany          0.004207
Canada           0.003716
dtype: float64

Indeed! 89% of the samples are for people from the US. Mexico comes next with less than 2%.

Now, let's explore something else. The correlation between the different features. Generally it is not a good idea to have many correlated features, as it might be a sign that your data is not very good. For this purpose we will need to encode the categorical features as numbers. This can be done using the LabelEncoder in the scikit-learn package.



In [5]:

    
# Encode the categorical features as numbers
def number_encode_features(df):
    result = df.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    return result, encoders
        
# Calculate the correlation and plot it
encoded_data, _ = number_encode_features(original_data)
sns.heatmap(encoded_data.corr(), square=True)
plt.show()

We see there is a high correlation between Education and Education-Num. Let's look at these columns



In [6]:

    
original_data[["Education", "Education-Num"]].head(15)









    Out[6]:






  
    
      
      Education
      Education-Num
    
  
  
    
      0 
          Bachelors
       13
    
    
      1 
          Bachelors
       13
    
    
      2 
            HS-grad
        9
    
    
      3 
               11th
        7
    
    
      4 
          Bachelors
       13
    
    
      5 
            Masters
       14
    
    
      6 
                9th
        5
    
    
      7 
            HS-grad
        9
    
    
      8 
            Masters
       14
    
    
      9 
          Bachelors
       13
    
    
      10
       Some-college
       10
    
    
      11
          Bachelors
       13
    
    
      12
          Bachelors
       13
    
    
      13
         Assoc-acdm
       12
    
    
      14
          Assoc-voc
       11

As you can see these two columns actually represent the same features, but encoded as strings and as numbers. We don't need the string representation, so we can just delete this column. Note that it is a much better option to delete the Education column as the Education-Num has the important property that the values are ordered: the higher the number, the higher the education that person has. This is a vaulable information a machine learning algorithm can use.



In [7]:

    
del original_data["Education"]

So it seems that the data is mostly OK with the exception of Sex and Relationship, which seems to be negatively correlated. Let's explore that for a bit



In [8]:

    
original_data[["Sex", "Relationship"]].head(15)









    Out[8]:






  
    
      
      Sex
      Relationship
    
  
  
    
      0 
         Male
       Not-in-family
    
    
      1 
         Male
             Husband
    
    
      2 
         Male
       Not-in-family
    
    
      3 
         Male
             Husband
    
    
      4 
       Female
                Wife
    
    
      5 
       Female
                Wife
    
    
      6 
       Female
       Not-in-family
    
    
      7 
         Male
             Husband
    
    
      8 
       Female
       Not-in-family
    
    
      9 
         Male
             Husband
    
    
      10
         Male
             Husband
    
    
      11
         Male
             Husband
    
    
      12
       Female
           Own-child
    
    
      13
         Male
       Not-in-family
    
    
      14
         Male
             Husband

Yes. The data looks correlated, because for example Male and Husband are highly correlated values, as well as Female and Wife. There is no easy way to tackle this problem, so let's carry on.

Build a classifier

Now that we explored our data, let's try to build a classifier which tries to predict what will be the income of a given person given the features we have in our dataset.

First we need to encode the features as numbers as the classifiers cannot work with string features. As we saw a while ago this can be achieved easily with the function we defined earlier. Let's encode the data and show the histograms of the values again.



In [9]:

    
encoded_data, encoders = number_encode_features(original_data)
fig = plt.figure(figsize=(20,15))
cols = 5
rows = ceil(float(encoded_data.shape[1]) / cols)
for i, column in enumerate(encoded_data.columns):
    ax = fig.add_subplot(rows, cols, i + 1)
    ax.set_title(column)
    encoded_data[column].hist(axes=ax)
    plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=0.7, wspace=0.2)

As you can see we have our data properly encoded and it seems to make sense. Now, let's try to build a classifier for it. Before we do that, let's split the data into a train and test set. This is a common approach to avoid overfitting. If we train and test the classifiers on the same data we will always get awesome results and we will most probably overfit the model. However if we test a classifier on data it has never seen we can be more confident it will perform better when ran on new data.

Split and scale the features

Most machine learning algorithms like the features to be scaled with mean 0 and variance 1. This is called "removing the mean and scaling to unit variance". This can be easily done with the StandardScaler from scikit-learn. Let's scale the features and look at them again.



In [10]:

    
X_train, X_test, y_train, y_test = cross_validation.train_test_split(encoded_data[encoded_data.columns - ["Target"]], encoded_data["Target"], train_size=0.70)
scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train.astype("f64")), columns=X_train.columns)
X_test = scaler.transform(X_test.astype("f64"))

Logistic regression



In [11]:

    
cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
plt.figure(figsize=(12,12))
plt.subplot(2,1,1)
sns.heatmap(cm, annot=True, fmt="d", xticklabels=encoders["Target"].classes_, yticklabels=encoders["Target"].classes_)
plt.ylabel("Real value")
plt.xlabel("Predicted value")
print "F1 score: %f" % skl.metrics.f1_score(y_test, y_pred)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
coefs.sort()
plt.subplot(2,1,2)
coefs.plot(kind="bar")
plt.show()









    



F1 score: 0.562793

As you can see we managed to achieve F1 score of 0.55 and the features that seems to contribute most positevely to have an income of more than $50K are Capital Gain, Education-Num and Sex, while the features that contribute most negatively are Martia Status and Relationship. There is a problem here, though. Features like Martial Status have values ranging from 0 to 6 and the order is really important here. In practice there is no particular order in that feature (unline Education-Num for which the higher the number, the better the education). We can fix this using binary features.

Classify using binary features

As a last step we can try to improve our classifier using binary attributes. Our current approach for encoding our data has the drawback that we put arbitrary order in our classes. For example we encode Relationship with a number between 1 and 5 and the logistic regression interprets these values as continuous variables and plugs them into an optimization function. This will cause different classes to have different weight into our model, which is not correct. Each class is theoretically equally weighted compared to the rest of the classes. In order to fix this we can use dummy variables.



In [12]:

    
binary_data = pd.get_dummies(original_data)
# Let's fix the Target as it will be converted to dummy vars too
binary_data["Target"] = binary_data["Target_>50K"]
del binary_data["Target_<=50K"]
del binary_data["Target_>50K"]
plt.subplots(figsize=(20,20))
sns.heatmap(binary_data.corr(), square=True)
plt.show()

Now we have a bunch of features that have only the values 0 and 1. There is a lot of correlation between some of them, but let's not look at this for now (for example Male and Female are negatively correlated).



In [13]:

    
X_train, X_test, y_train, y_test = cross_validation.train_test_split(binary_data[binary_data.columns - ["Target"]], binary_data["Target"], train_size=0.70)
scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)

Logistic Regression



In [16]:

    
cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)

plt.figure(figsize=(20,20))
plt.subplot(2,1,1)
sns.heatmap(cm, annot=True, fmt="d", xticklabels=encoders["Target"].classes_, yticklabels=encoders["Target"].classes_)
plt.ylabel("Real value")
plt.xlabel("Predicted value")
print "F1 score: %f" % skl.metrics.f1_score(y_test, y_pred)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
coefs.sort()
ax = plt.subplot(2,1,2)
coefs.plot(kind="bar")
plt.show()









    



F1 score: 0.665267

Summary

We managed to improve the F1 score significantly by converting the data to use dummy variables. Also it seems that we managed to uncover some interesting insign from our model. It seems that the features that impacts the income of a person positevely are Capital Gain, Married-civ-spounce, Age, Hours per week and Exec-managerial. The features that impact it most negateively are Never married, Own child, Priv-house-serv, Divorsed and unfortunately Female. One more proof that there is a gender inequality in our society.

As you can see we not only managed to build a machine learning model that we can use to classify new data, but we also managed to uncover some interesting insigns from our data. This is one of the nice features of the linear models. They are not "black boxes", like neural networks for example and allow to see what exactly the model is doing.

You can download the ipython notebook, which was used for generating the above analysis from here.



In [ ]:

	Age	Workclass	fnlwgt	Education	Education-Num	Martial Status	Occupation	Relationship	Race	Sex	Capital Gain	Hours per week	Country	Target
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	40	United-States	>50K

	Education	Education-Num
0	Bachelors	13
1	Bachelors	13
2	HS-grad	9
3	11th	7
4	Bachelors	13
5	Masters	14
6	9th	5
7	HS-grad	9
8	Masters	14
9	Bachelors	13
10	Some-college	10
11	Bachelors	13
12	Bachelors	13
13	Assoc-acdm	12
14	Assoc-voc	11

	Sex	Relationship
0	Male	Not-in-family
1	Male	Husband
2	Male	Not-in-family
3	Male	Husband
4	Female	Wife
5	Female	Wife
6	Female	Not-in-family
7	Male	Husband
8	Female	Not-in-family
9	Male	Husband
10	Male	Husband
11	Male	Husband
12	Female	Own-child
13	Male	Not-in-family
14	Male	Husband