Exploratory analysis of the census income data set

read the data

read_csv has all kinds of important tuning parameters which can make the reading and the saving of the data more efficient


In [2]:
import pandas as pd 
df = pd.read_csv("adult.csv").drop('Unnamed: 0',axis=1)

In [3]:
len(df)


Out[3]:
32561

In [4]:
df.dtypes


Out[4]:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
target            object
dtype: object

Explore the data

Check how many classes do we have and if the data is imbalanced


In [5]:
df.target.value_counts()


Out[5]:
<=50K    24720
>50K      7841
Name: target, dtype: int64

In [6]:
df.target.value_counts()/len(df)


Out[6]:
<=50K    0.75919
>50K     0.24081
Name: target, dtype: float64

View a few rows of the data


In [7]:
df.head()


Out[7]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country target
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Some statistics for the numerical variables - if you like numbers


In [8]:
df.describe()


Out[8]:
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 1077.648844 87.303830 40.437456
std 13.640433 1.055500e+05 2.572720 7385.292085 402.960219 12.347429
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000 99.000000

In [9]:
numerical_columns = df.describe().columns
numerical_columns


Out[9]:
Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')

Some boxplots - if you like visualizations


In [10]:
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
sns.set_style("darkgrid")
plt.figure(figsize=(30,5))
for i,col in enumerate(numerical_columns):
    plt.subplot(1,len(numerical_columns),i+1)
    df[[col]].boxplot(fontsize=20)


View the most frequent categories for the categorical features - notice missing values (?)


In [11]:
def describe_categorical_values(df, non_interesting_columns=[], num_categories=5):
    values_df = pd.DataFrame()
    for i, column in enumerate(df.columns):
        if column in non_interesting_columns:
            continue
        top_values0 = ["{}: {}%".format(x,int(round(100*y/len(df))))
                       for x, y in zip(df[column].value_counts(dropna=False).head(num_categories).index,
                                       df[column].value_counts(dropna=False).head(num_categories).values)]
        if len(top_values0) < num_categories:
            top_values = [None]*num_categories
            top_values[:len(top_values0)] = top_values0
        else:
            top_values = top_values0
        values_df[column] = top_values
    return values_df.transpose()

describe_categorical_values(df, non_interesting_columns=numerical_columns, num_categories=6)


Out[11]:
0 1 2 3 4 5
workclass Private: 70% Self-emp-not-inc: 8% Local-gov: 6% ?: 6% State-gov: 4% Self-emp-inc: 3%
education HS-grad: 32% Some-college: 22% Bachelors: 16% Masters: 5% Assoc-voc: 4% 11th: 4%
marital-status Married-civ-spouse: 46% Never-married: 33% Divorced: 14% Separated: 3% Widowed: 3% Married-spouse-absent: 1%
occupation Prof-specialty: 13% Craft-repair: 13% Exec-managerial: 12% Adm-clerical: 12% Sales: 11% Other-service: 10%
relationship Husband: 41% Not-in-family: 26% Own-child: 16% Unmarried: 11% Wife: 5% Other-relative: 3%
race White: 85% Black: 10% Asian-Pac-Islander: 3% Amer-Indian-Eskimo: 1% Other: 1% None
sex Male: 67% Female: 33% None None None None
native-country United-States: 90% Mexico: 2% ?: 2% Philippines: 1% Germany: 0% Canada: 0%
target <=50K: 76% >50K: 24% None None None None

Visualize the distributions of the numerical variables per class using KDE (Kernel Density Estimation) plots


In [12]:
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
for col in numerical_columns:
    plt.figure(figsize=(30,3))
    sns.kdeplot(df.loc[df.target=='<=50K',col], label='<=50K')
    sns.kdeplot(df.loc[df.target=='>50K',col], label='>50K')
    plt.legend(fontsize=22)
    plt.title(col, fontsize=22);


KDE plots doesn't always work so great, especially for integers - let's try bubble plot!

bubble plot


In [13]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt


def get_point(x, digits=2):
    if not isinstance(x,pd.Interval):
        a, b = x.split(',') 
        a = float(a.strip("("))
        b = float(b.strip("]"))
    else:
        a,b = x.left, x.right
    c = (a+b)/2
    return np.round(c,digits)


def bubble_plot(df, x, y, z_boolean=None, ordered_x_values=None, ordered_y_values=None, bins_x=10,
                bins_y=10, fontsize=16, figsize=(10,5), maximal_bubble_size=3500,
                normalization_by_all=False, log=False):
    """
    :param df: dataframe
    :param x:  name of first numerical/categorical field (string) (for x-axis)
    :param y: name of second numerical/categorical field (string) (for y-axis)
    :param z_boolean: name of categorical field with two categories / boolean field (for coloring)
    :param ordered_x_values: the values we would like to map from x categorical variable 
    according to the order we would like to present them
    :param ordered_y_values: the values we would like to map from the y categorical variable 
    according to the order we would like to present them
    :param bins_x: the bins for x values if x is numberic
    :param bins_y: the bins for y values if y is numberic
    :param normalization_by_all: True - shows joint distribution p(x,y), False - shows conditional distribution p(y|x)
    :param maximal_bubble_size: if the bubbles are too big or too small this is the parameter you should change!
    :param log: whether to apply log on the count (influence the size of the bubbles)
    :return: nice bubble plot, bubble size is propotional to the frequency of the bucket :)
    """
    plt.figure(figsize=figsize)
    x_is_numeric = df[x].dtype in (float, int) and ordered_x_values is None
    y_is_numeric = df[y].dtype in (float, int) and ordered_y_values is None 
    count_table = pd.concat([pd.cut(df[x], bins=bins_x) if x_is_numeric else df[x],
                             pd.cut(df[y], bins=bins_y) if y_is_numeric else df[y]], axis=1)
    count_table = count_table.groupby(x)[y].value_counts().unstack().fillna(0)
    ordered_x_values = count_table.index.values if ordered_x_values is None else ordered_x_values
    ordered_y_values = count_table.columns if ordered_y_values is None else ordered_y_values
    if z_boolean is not None:
        count_table_long, xticks, yticks, xticklabels, yticklabels = plot_with_z(df, x, y, z_boolean, bins_x, bins_y, x_is_numeric, y_is_numeric, ordered_x_values, ordered_y_values, maximal_bubble_size, 
                                                                                 normalization_by_all=normalization_by_all)
    else:
        count_table_long, xticks, yticks, xticklabels, yticklabels = plot_without_z(df, x, y, z_boolean, count_table, bins_x, bins_y, x_is_numeric, y_is_numeric, ordered_x_values, ordered_y_values, 
                                                                                    normalization_by_all=normalization_by_all, log=log, maximal_bubble_size=maximal_bubble_size )
    plt.xticks(xticks, xticklabels,fontsize=fontsize)
    plt.yticks(yticks, yticklabels,fontsize=fontsize)
    plt.xlabel(x, fontsize=fontsize)
    plt.ylabel(y, fontsize=fontsize)
    if z_boolean is None:
        plt.title("{} vs {} ".format(y,x),fontsize=fontsize+4);
    else:
        plt.title("{} vs {} and {} (in colors)".format(y,x, z_boolean),fontsize=fontsize+4);

def plot_without_z(df, x, y, z, count_table, bins_x, bins_y, x_is_numeric, y_is_numeric, ordered_x_values, ordered_y_values, normalization_by_all=False, log=False, maximal_bubble_size=4000):
    if normalization_by_all:
        count_table /= count_table.sum().sum()
    else:
        count_table = count_table.transpose()
        for col in count_table.columns:
            count_table[col] /= count_table[col].sum()
        count_table = count_table.transpose()
    if log:
        count_table = np.log(count_table)
        maximal_bubble_size /= 2
    size_factor = maximal_bubble_size/count_table.max().max()
    count_table_long = pd.melt(count_table.reset_index(), id_vars=x) 
    x_values_dict = {x:i for i, x in enumerate(ordered_x_values)} \
        if not x_is_numeric else {xx:get_point(xx) for xx in ordered_x_values}
    y_values_dict = {x:i for i, x in enumerate(ordered_y_values)} \
        if not y_is_numeric else {xx: get_point(xx) for xx in ordered_y_values}
    xticks = np.arange(count_table.shape[0]) if not x_is_numeric else [get_point(xx) for xx in ordered_x_values]
    yticks = np.arange(count_table.shape[1]) if not y_is_numeric else [get_point(xx) for xx in ordered_y_values]
    xticklabels = ordered_x_values if not x_is_numeric else [get_point(xx) for xx in ordered_x_values]
    yticklabels = ordered_y_values if not y_is_numeric else [get_point(xx) for xx in ordered_y_values]
    count_table_long[x] = count_table_long[x].map(x_values_dict)
    count_table_long[y] = count_table_long[y].map(y_values_dict) 
    plt.scatter(count_table_long[x], count_table_long[y], s=size_factor*count_table_long['value'],
                c=count_table_long['value'], cmap='cool')
 
    return count_table_long, xticks, yticks, xticklabels, yticklabels
    
def plot_with_z(df, x, y, z_boolean, bins_x, bins_y, x_is_numeric, y_is_numeric, ordered_x_values, 
                ordered_y_values, maximal_bubble_size=4000, normalization_by_all=False):
    count_table = pd.concat([pd.cut(df[x], bins=bins_x) if x_is_numeric else df[x],
                         pd.cut(df[y], bins=bins_y) if y_is_numeric else df[y], df[z_boolean]], axis=1)
    count_table = count_table.groupby([x,z_boolean])[y].value_counts().unstack().fillna(0)
    count_table = count_table.unstack().unstack()
    count_table_long = count_table.reset_index().rename(columns={0: 'value'}) #pd.melt(count_table.reset_index(), id_vars=x)
    z_boolean_values = count_table_long[z_boolean].unique()
    ratio = pd.DataFrame({'ratio':count_table_long.set_index([x,y,z_boolean]).unstack()['value'][z_boolean_values[1]] / (
    count_table_long.set_index([x,y,z_boolean]).unstack()['value'].sum(axis=1) )})
    count_table_long = count_table_long.set_index([x, y ])[['value']].merge(ratio, left_index=True, right_index=True).reset_index()
    size_factor = maximal_bubble_size/count_table_long['value'].max()
    x_values_dict = {x:i for i, x in enumerate(ordered_x_values)} \
        if not x_is_numeric else {xx:get_point(xx) for xx in ordered_x_values}
    y_values_dict = {x:i for i, x in enumerate(ordered_y_values)} \
        if not y_is_numeric else {xx: get_point(xx) for xx in ordered_y_values}
    xticks = np.arange(len(ordered_x_values)) if not x_is_numeric else [get_point(xx) for xx in ordered_x_values]
    yticks = np.arange(len(ordered_y_values)) if not y_is_numeric else [get_point(xx) for xx in ordered_y_values]
    xticklabels = ordered_x_values if not x_is_numeric else [get_point(xx) for xx in ordered_x_values]
    yticklabels = ordered_y_values if not y_is_numeric else [get_point(xx) for xx in ordered_y_values]
    count_table_long[x] = count_table_long[x].map(x_values_dict)
    count_table_long[y] = count_table_long[y].map(y_values_dict)
    plt.scatter(count_table_long[x], count_table_long[y], s=size_factor*count_table_long['value'],
                c=count_table_long['ratio'],  alpha=0.5,
                cmap='cool')
    return count_table_long, xticks, yticks, xticklabels, yticklabels

Look at one variable vs. the target


In [14]:
%matplotlib inline
bubble_plot(df,'age','target', normalization_by_all=False)
bubble_plot(df,'hours-per-week','target', normalization_by_all=False)
bubble_plot(df,'education-num','target', normalization_by_all=False)


Look at two variables vs the target (the color simbols the target ratio)


In [15]:
bubble_plot(df, x='age', y='hours-per-week', z_boolean='target', maximal_bubble_size=8000)


Impute missing values for categorical columns with the most common values

How to deal with missing values?

(1) missing at random - imputation - using common values or estimation of the value using regression methods

(2) missing not at random - treat missing value as a separate indicator variable


In [16]:
df = pd.read_csv("adult.csv").drop('Unnamed: 0',axis=1)
categorical_columns = [x for x in df.columns if x not in numerical_columns and x!='target']
for col in categorical_columns:
    df.loc[df[col]==' ?',col] = df[col].value_counts().index.values[0]

Remove infrequent categorical values


In [17]:
def remove_infrequent_values(df, categorical_columns, frequency_threshold=20):
    infrequent_cols_to_remove = []
    for cat_col in categorical_columns:
        col_count = df[cat_col].value_counts()
        infrequent_values = col_count[col_count<frequency_threshold].index.values
        if not col_count[col_count<20].empty:
            print("removing columns: \n{} \n".format(col_count[col_count<frequency_threshold]))
        infrequent_cols_to_remove += ["{}_{}".format(cat_col, x) for x in infrequent_values]
    return df

df = remove_infrequent_values(df, categorical_columns)


removing columns: 
 Without-pay     14
 Never-worked     7
Name: workclass, dtype: int64 

removing columns: 
 Armed-Forces    9
Name: occupation, dtype: int64 

removing columns: 
 Trinadad&Tobago               19
 Cambodia                      19
 Laos                          18
 Thailand                      18
 Yugoslavia                    16
 Outlying-US(Guam-USVI-etc)    14
 Hungary                       13
 Honduras                      13
 Scotland                      12
 Holand-Netherlands             1
Name: native-country, dtype: int64 

Convert categorical variables to dummy variables


In [18]:
df = pd.read_csv("adult.csv").drop('Unnamed: 0',axis=1)
y = df.target==">50K"
features = [x for x in df.columns if x!='target']
X = pd.get_dummies(df[features])

In [19]:
X.head()


Out[19]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 39 77516 13 2174 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 50 83311 13 0 0 13 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 38 215646 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 53 234721 7 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 28 338409 13 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 108 columns

Split to train and test data - using StratifiedKFold - to keep the target distribution in the data


In [20]:
from sklearn.cross_validation import StratifiedKFold
kf = StratifiedKFold(y, 5, random_state=12345, shuffle=True)
train_index, test_index = list(kf)[0]
train, test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]


/Users/slador/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [21]:
df.target.value_counts()/len(df)


Out[21]:
<=50K    0.75919
>50K     0.24081
Name: target, dtype: float64

In [22]:
pd.Series(y_train).value_counts()/len(y_train)


Out[22]:
False    0.759214
True     0.240786
Name: target, dtype: float64

In [23]:
pd.Series(y_test).value_counts()/len(y_test)


Out[23]:
False    0.759097
True     0.240903
Name: target, dtype: float64

Check correlation between the variables and between the variables and the target

Important specifically for Logistic regression where correlation between variables can contribute to a numerically unstable solution


In [24]:
def get_most_correlated_variables(corr, num_pairs=10):
    correlation_melted = pd.melt(corr.reset_index().rename(columns={"index": "var_1"}), id_vars=("var_1"),var_name='var_2')
    correlation_melted = correlation_melted[correlation_melted.var_1!=correlation_melted.var_2]
    correlation_melted['var_couple'] = correlation_melted[['var_1','var_2']].apply(lambda x:tuple(sorted([x[0],x[1]])), axis=1)
    correlation_melted = correlation_melted.drop_duplicates(subset='var_couple').drop(['var_couple'],axis=1)
    correlation_melted['abs_value'] = correlation_melted['value'].abs().round(3)
    return correlation_melted.sort_values(by='abs_value').tail(num_pairs).drop('abs_value', axis=1).reset_index(drop=True)

def plot_correlation_matrix(X, features2):
    corr = X[features2].corr()
     # return the most correlated variables
    most_correlated_variables = get_most_correlated_variables(corr, num_pairs=10)
    max_correlation = 1.25*most_correlated_variables['value'].abs().max()
    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))
    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    ax.set_yticklabels(features2, fontsize=18)
    ax.set_xticklabels(features2, rotation='vertical', fontsize=18)
    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=max_correlation, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .8})
    return most_correlated_variables
if 'target' not in X:
    X['target'] = y
features2 = ['target'] + X.columns[:20].tolist()
plot_correlation_matrix(X, features2)


Out[24]:
var_1 var_2 value
0 age target 0.234037
1 workclass_ Private workclass_ Federal-gov -0.264369
2 education_ 10th education-num -0.272428
3 education_ 5th-6th education-num -0.279766
4 workclass_ Self-emp-inc workclass_ Private -0.285747
5 workclass_ State-gov workclass_ Private -0.309064
6 education-num target 0.335154
7 workclass_ Private workclass_ ? -0.370780
8 workclass_ Private workclass_ Local-gov -0.397547
9 workclass_ Self-emp-not-inc workclass_ Private -0.441289

In [25]:
features2 =['target']+[x for x in X.columns if x.startswith('education') ]
plot_correlation_matrix(X, features2)


Out[25]:
var_1 var_2 value
0 education_ Doctorate education-num 0.260786
1 education_ 10th education-num -0.272428
2 education_ 5th-6th education-num -0.279766
3 education_ HS-grad education-num -0.289817
4 education_ HS-grad education_ Bachelors -0.306098
5 education-num target 0.335154
6 education_ 7th-8th education-num -0.336268
7 education_ Masters education-num 0.360101
8 education_ Some-college education_ HS-grad -0.370598
9 education_ Bachelors education-num 0.503435

In [26]:
features2 =['target']+[x for x in X.columns if x.startswith('relationship_') or x.startswith('marital')]
plot_correlation_matrix(X, features2)


Out[26]:
var_1 var_2 value
0 marital-status_ Married-civ-spouse marital-status_ Divorced -0.366837
1 relationship_ Own-child marital-status_ Married-civ-spouse -0.380070
2 relationship_ Husband target 0.401035
3 marital-status_ Married-civ-spouse target 0.444696
4 relationship_ Not-in-family relationship_ Husband -0.482936
5 relationship_ Own-child marital-status_ Never-married 0.509226
6 relationship_ Not-in-family marital-status_ Married-civ-spouse -0.537588
7 relationship_ Husband marital-status_ Never-married -0.576730
8 marital-status_ Never-married marital-status_ Married-civ-spouse -0.644866
9 relationship_ Husband marital-status_ Married-civ-spouse 0.893210

Find features which are highly correlated to each other


In [27]:
if 'target' in X:
    X = X.drop('target', axis=1)
corr = X.corr()
most_correlated_variables = get_most_correlated_variables(corr, num_pairs=10)

In [28]:
most_correlated_variables


Out[28]:
var_1 var_2 value
0 marital-status_ Never-married age -0.534359
1 relationship_ Not-in-family marital-status_ Married-civ-spouse -0.537588
2 relationship_ Husband marital-status_ Never-married -0.576730
3 sex_ Female relationship_ Husband -0.580135
4 sex_ Male relationship_ Husband 0.580135
5 marital-status_ Never-married marital-status_ Married-civ-spouse -0.644866
6 race_ White race_ Black -0.788747
7 relationship_ Husband marital-status_ Married-civ-spouse 0.893210
8 occupation_ ? workclass_ ? 0.997985
9 sex_ Male sex_ Female -1.000000

In [29]:
most_correlated_variables.tail(5).var_2.values


Out[29]:
array(['marital-status_ Married-civ-spouse', 'race_ Black',
       'marital-status_ Married-civ-spouse', 'workclass_ ?', 'sex_ Female'], dtype=object)

Find features which are highly correlated with the target


In [30]:
if 'target' not in X:
    X['target'] = y
corr = X.corr()
corr_with_target = corr[['target']].drop('target', axis=0)
corr_with_target['abs_value'] = corr_with_target.abs()
corr_with_target.sort_values(by='abs_value').tail(10).drop('abs_value',axis=1)


Out[30]:
target
sex_ Male 0.215980
sex_ Female -0.215980
capital-gain 0.223329
relationship_ Own-child -0.228532
hours-per-week 0.229689
age 0.234037
marital-status_ Never-married -0.318440
education-num 0.335154
relationship_ Husband 0.401035
marital-status_ Married-civ-spouse 0.444696

In [31]:
best_features_by_correlation = corr_with_target[corr_with_target.abs_value>0.05].index.values

In [32]:
len(best_features_by_correlation)


Out[32]:
43

Dimension reduction with PCA

Data is projected into a new space where the new transformed features has no mutual correlation.


In [33]:
from sklearn.decomposition import PCA
pca = PCA(n_components=train.shape[1])
train_pca = pca.fit_transform(train, )
test_pca = pca.transform(test)

In [34]:
plt.figure(figsize=(8,5))
plt.scatter(train_pca[(y_train==0).values,0], train_pca[(y_train==0).values,1], color='r', label='>50K')
plt.scatter(train_pca[(y_train==1).values,0], train_pca[(y_train==1).values,1], color='b', label='<=50K')
plt.legend()


Out[34]:
<matplotlib.legend.Legend at 0x11bd47390>

Plot the Explained variance in each component


In [35]:
plt.figure(figsize=(30,5))
pd.DataFrame(pca.explained_variance_ratio_).plot.bar(log=False, figsize=(35,7), fontsize=16)
plt.xlabel('n_components',fontsize=16)
plt.ylabel('explained_variance_', fontsize=16);


<matplotlib.figure.Figure at 0x1166c8ac8>

In [36]:
plt.figure(figsize=(30,5))
pd.DataFrame(pca.explained_variance_ratio_).plot.bar(log=True, figsize=(35,7), fontsize=16)
plt.xlabel('n_components',fontsize=16)
plt.ylabel('explained_variance_', fontsize=16);
plt.ylim([0,10]);


<matplotlib.figure.Figure at 0x11bb358d0>

Train a Logistic regression model and evaluate its results


In [37]:
%matplotlib inline
import seaborn as sns
from sklearn.metrics import roc_auc_score, classification_report, roc_curve, confusion_matrix
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline

def find_best_threshold(thresholds, fpr, tpr):
    """
    find the best threshold from the roc curve. by finding the threshold for the point which is closest to (fpr=0,tpr=1)
    """
    fpr_tpr = pd.DataFrame({'thresholds': thresholds, 'fpr': fpr, 'tpr': tpr})
    fpr_tpr['dist'] = (fpr_tpr['fpr'])**2 + (fpr_tpr['tpr']-1)**2
    return fpr_tpr.ix[fpr_tpr.dist.idxmin(), 'thresholds']


def get_model_results(model, train, test, y_train, y_test):
    probabilities = model.predict_proba(test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, probabilities)
    threshold = find_best_threshold(thresholds, fpr, tpr)
    predictions = probabilities>threshold
    plt.figure()
    plt.plot(fpr, tpr, label='test')
    roc_auc = roc_auc_score(y_test, probabilities)
    probabilities = model.predict_proba(train)[:,1]
    fpr, tpr, thresholds = roc_curve(y_train, probabilities)
    plt.plot(fpr, tpr, label='train')
    plt.plot([0, 1], [0, 1], 'r--', label='random guess')
    plt.title("area under the ROC curve = {}".format(roc_auc), fontsize=18);
    print(classification_report(y_test, predictions))
    plt.legend()

model = make_pipeline(LogisticRegression())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


             precision    recall  f1-score   support

      False       0.79      0.60      0.69      4944
       True       0.29      0.50      0.37      1569

avg / total       0.67      0.58      0.61      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

Logistic regression with normalization


In [38]:
model = make_pipeline(MinMaxScaler(),LogisticRegression())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.94      0.81      0.87      4944
       True       0.59      0.84      0.69      1569

avg / total       0.85      0.82      0.83      6513


In [39]:
feature_importance = pd.Series(index=train.columns , data= model.named_steps['logisticregression'].coef_[0])
feature_importance = feature_importance.sort_values()
plt.figure(figsize=(14,5))
feature_importance.tail(10).plot.barh(fontsize=16,color='b' )
plt.figure(figsize=(14,5))
feature_importance.head(10).plot.barh(fontsize=16, figsize=(14,7), color='r')


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x116028cc0>

Logistic regression with PCA


In [40]:
model = make_pipeline(PCA(n_components=30,whiten=False),LogisticRegression())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


             precision    recall  f1-score   support

      False       0.94      0.80      0.87      4944
       True       0.57      0.84      0.68      1569

avg / total       0.85      0.81      0.82      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

Logistic regression with Polynomial features


In [41]:
model = make_pipeline(PolynomialFeatures(),MinMaxScaler(), LogisticRegression())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.94      0.81      0.87      4944
       True       0.58      0.84      0.69      1569

avg / total       0.85      0.81      0.82      6513

Logistic regression with Polynomial features only for the numerical features


In [42]:
poly = PolynomialFeatures(2)
train_poly_features = poly.fit_transform(train[numerical_columns])
test_poly_features = poly.transform(test[numerical_columns])
train_with_poly_features = pd.concat([train.reset_index(drop=True),pd.DataFrame(train_poly_features)],axis=1)
test_with_poly_features = pd.concat([test.reset_index(drop=True),pd.DataFrame(test_poly_features)],axis=1)

In [43]:
model = make_pipeline(MinMaxScaler(),LogisticRegression())
model.fit(train_with_poly_features, y_train)
get_model_results(model, train_with_poly_features, test_with_poly_features, y_train, y_test)


             precision    recall  f1-score   support

      False       0.94      0.82      0.87      4944
       True       0.59      0.84      0.70      1569

avg / total       0.86      0.82      0.83      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

Explore the logistic regression coefficients


In [74]:
feature_importance=pd.DataFrame(index=train_with_poly_features.columns, 
                             data=model.named_steps['logisticregression'].coef_[0])
feature_importance['abs_value']=feature_importance.abs()
feature_importance.sort_values(by='abs_value')[0].tail(15).plot.barh()


Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a21ef2e48>

Hyper-parameter tuning for logistic regression


In [76]:
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
params = {#'logisticregression__penalty': ['l2','l1'], 'logisticregression__C': [1, 10, 100, 1000],
          'logisticregression__penalty': ['l1'], 'logisticregression__C': [1,],
#                  'solver': ['newton-cg'],
#                  'logisticregression__fit_intercept': [False, True]
         }
gscv = GridSearchCV(make_pipeline(MinMaxScaler(), LogisticRegression()), params, scoring='roc_auc', cv=3)
gscv.fit(train, y_train)
get_model_results(gscv, train, test, y_train, y_test)
print(gscv.best_params_)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.94      0.81      0.87      4944
       True       0.58      0.85      0.69      1569

avg / total       0.86      0.82      0.83      6513

{'logisticregression__C': 1, 'logisticregression__penalty': 'l1'}

Decision tree classifier


In [77]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier() 
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


             precision    recall  f1-score   support

      False       0.76      1.00      0.86      4944
       True       0.00      0.00      0.00      1569

avg / total       0.58      0.76      0.66      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
/Users/slador/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

min_samples_split=10 to reduce overfitting


In [78]:
model = DecisionTreeClassifier(min_samples_split=10) 
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


             precision    recall  f1-score   support

      False       0.91      0.80      0.85      4944
       True       0.54      0.75      0.63      1569

avg / total       0.82      0.79      0.80      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

min_samples_split=50 to reduce overfitting


In [79]:
model = DecisionTreeClassifier(min_samples_split=50) 
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


             precision    recall  f1-score   support

      False       0.93      0.81      0.86      4944
       True       0.57      0.80      0.67      1569

avg / total       0.84      0.81      0.82      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

Random forest classifier


In [80]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=3, n_estimators=100) # max_depth=4
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.92      0.82      0.87      4944
       True       0.58      0.79      0.67      1569

avg / total       0.84      0.81      0.82      6513

plot one of the random forest trees


In [81]:
from sklearn import tree
from os import system

tree_dot_path = "models/tree1.dot"
tree_png_path = "models/tree1.png"
dotfile = open(tree_dot_path, 'w')
tree.export_graphviz(model.estimators_[10], out_file = dotfile, 
                     feature_names = train.columns, 
                     filled=True)
dotfile.close()
system("dot -Tpng {tree_dot_path} -o {tree_png_path}".format(
        tree_dot_path=tree_dot_path, tree_png_path=tree_png_path));

XGBoost - Extreme gradient boosted trees

https://github.com/dmlc/xgboost


In [82]:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.95      0.82      0.88      4944
       True       0.60      0.86      0.71      1569

avg / total       0.86      0.83      0.84      6513

Feature importnace for xgboost


In [83]:
import xgboost as xgb
feature_importance = pd.Series(index=train.columns , data= model.feature_importances_)
feature_importance = feature_importance.sort_values()
feature_importance.tail(20).plot.barh(fontsize=16, figsize=(14,7))


Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a262d84a8>

XGBoost with polynomial features


In [84]:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(train_with_poly_features, y_train)
get_model_results(model, train_with_poly_features, test_with_poly_features, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.95      0.83      0.88      4944
       True       0.61      0.85      0.71      1569

avg / total       0.87      0.83      0.84      6513

XGBoost classifier with the 15 strongest features


In [85]:
best_features = feature_importance.tail(15).index.values
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(train[best_features].rename(columns={x:x.replace(" ","") for x in best_features}), y_train)
get_model_results(model, train[best_features].rename(columns={x:x.replace(" ","") for x in best_features}),  test[best_features].rename(columns={x:x.replace(" ","") for x in best_features}), y_train, y_test)


             precision    recall  f1-score   support

      False       0.95      0.83      0.88      4944
       True       0.61      0.85      0.71      1569

avg / total       0.87      0.84      0.84      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

plot one of the xgboost trees


In [88]:
import matplotlib
from xgboost import plot_tree
from matplotlib.pylab import rcParams
ax = plot_tree(model,num_trees=10, rankdir='LR')
fig = ax.figure
fig.set_size_inches(20, 20)


Visualization of the data with TSNE


In [89]:
%%time
from sklearn.manifold import TSNE
tsne = TSNE() #n_iter=200)
# train_transformed = tsne.fit_transform(train[best_features])
# pd.DataFrame(train_transformed).to_pickle("train_transformed.pkl")
train_transformed = pd.read_pickle("train_transformed.pkl").values
plt.scatter(train_transformed[(y_train==0).values,0], train_transformed[(y_train==0).values,1], color='r', label='class 1')
plt.scatter(train_transformed[(y_train==1).values,0], train_transformed[(y_train==1).values,1], color='b', label='class 2')
plt.legend()


CPU times: user 85 ms, sys: 8.57 ms, total: 93.6 ms
Wall time: 106 ms

Bonus: try encoding the variables based on conditional distribution P(y=1 / x=xi)


In [90]:
def encode_column_by_frequency(train, test, col, log=True):
    country_by_target = train.groupby(col)['target'].value_counts().unstack().fillna(0)
    if log:
        country_probability = np.log((country_by_target['>50K']+1e-5)/(country_by_target.sum(axis=1)+1e-5))
    else:
        country_probability = (country_by_target['>50K']+1e-5)/(country_by_target.sum(axis=1)+1e-5)
    for val in test[col].unique():    
        if val in country_probability.index.values: 
            continue
        country_probability[val] = country_probability.median()
    train[col] = train[col].map(country_probability)
    test[col] = test[col].map(country_probability)
    return train, test

def encode_numeric_column_by_frequency(train, test, col, log=True, plot_flag=False, window_size=3):
    country_by_target = train.groupby(col)['target'].value_counts().unstack().fillna(0)
    if log:
        country_probability = np.log((country_by_target['>50K']+1e-5)/(country_by_target.sum(axis=1)+1e-5))
    else:
        country_probability = (country_by_target['>50K']+1e-5)/(country_by_target.sum(axis=1)+1e-5)
    values = np.unique(train[col].unique().tolist()+test[col].unique().tolist())
    resolution = min(1,np.min(np.diff(values)))
    start = np.min(country_probability.index.unique().tolist())
    end = np.max(country_probability.index.unique().tolist())
    country_probability = pd.merge(pd.DataFrame(country_probability),
                                   pd.DataFrame(index=np.arange(start, end+resolution, resolution)),
                                   left_index=True, right_index=True, how='outer').interpolate(method='linear')
    if plot_flag:
        plt.figure()
        country_probability[0].plot(marker='o')
    country_probability=((country_probability[0].cumsum() - country_probability[0].shift(window_size).cumsum())/window_size).shift(-1).fillna(0)
    if plot_flag:
        country_probability.plot()
        plt.title(col)
    train[col] = train[col].map(country_probability)
    test[col] = test[col].map(country_probability)
    return train, test

In [91]:
def get_data_with_encoded_fetures(plot_flag=False):
    df = pd.read_csv("adult.csv").drop('Unnamed: 0',axis=1)#.drop('fnlwgt', axis=1)
    categorical_columns = [x for x in df.columns if x not in numerical_columns and x!='target']
    columns_to_encode_with_frequencies=categorical_columns
    for col in categorical_columns:
        df.loc[df[col]==' ?',col] = df[col].value_counts().index.values[0]
    features = [x for x in df if x != 'target']
    X = df[features]
    y = df.target==">50K"
    kf = StratifiedKFold(y, 5, random_state=12345, shuffle=True)
    train_index, test_index = list(kf)[0]
    train, test = df.iloc[train_index], df.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    for col in categorical_columns:
        train, test = encode_column_by_frequency(train, test, col, log=False)
    for col in numerical_columns: 
        if col=='fnlwgt': continue
        train, test = encode_numeric_column_by_frequency(train, test, col,log=False, plot_flag=plot_flag)
    train=train.drop('target',axis=1)
    test=test.drop('target',axis=1)
    return train, test, y_train, y_test

Logistic regression with features replaced with conditional probabilities


In [92]:
train, test, y_train, y_test = get_data_with_encoded_fetures(plot_flag=False)
model = make_pipeline(MinMaxScaler(),LogisticRegression())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test )


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:35: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:36: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
             precision    recall  f1-score   support

      False       0.94      0.83      0.88      4944
       True       0.61      0.84      0.71      1569

avg / total       0.86      0.83      0.84      6513

/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

Logistic regression with features replaced with conditional probabilities + Polynomial features


In [93]:
model = make_pipeline(PolynomialFeatures(), MinMaxScaler(),LogisticRegression())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.95      0.83      0.89      4944
       True       0.62      0.85      0.72      1569

avg / total       0.87      0.84      0.85      6513

XGBOOST with features replaced with conditional probabilities


In [94]:
model = XGBClassifier()
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.95      0.83      0.89      4944
       True       0.62      0.87      0.72      1569

avg / total       0.87      0.84      0.85      6513

XGBOOST with features replaced with conditional probabilities + Polynomial features


In [96]:
model = make_pipeline(PolynomialFeatures(), XGBClassifier())
model.fit(train, y_train)
get_model_results(model, train, test, y_train, y_test)


/Users/slador/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
             precision    recall  f1-score   support

      False       0.95      0.83      0.89      4944
       True       0.62      0.86      0.72      1569

avg / total       0.87      0.84      0.85      6513

What's next?

  • Undetstand which further cool features are missing - use different transformations on the features, use combinations of features, find more data sources which could help to the problem...
  • Analyze your model errors and feature importance
  • Use tools like LIME to understand your model decisions and errors
  • Further tuning of the model - model selection, ensembling and stacking models, hyper-parameters tuning
  • Present the results in a clear way according to the relevant business metrics

Questions?

For additional questions - your'e welcome to stay in touch - meir.shir86@gmail.com