In [502]:

    
%matplotlib inline
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sns



In [555]:

    
data_dict = pickle.load(open("../ud120-projects/final_project/final_project_dataset.pkl", "r") )

Holdout

Since the dataset for this project is so small, a hold-out set will not be used, and only k-fold testing and training splits will be used to measure accuracy.

This is because even with a stratified hold-out set of 20%, with only 146 data points, lots of missing data and and 18 poi's, there would be only 3 or so points to do a final test on. This does not give much confidence in the precision of the performance metrics on such a small hold-out set, while also negatively impacting the ability to create the model.

"when the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. (...) Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements. "

[1] Kuhn M., Kjell J.(2013). Applied Predictive Modeling. Springer. pp.67

Hawkins et al. (2003) concisely summarize this point:“holdout samples of tolerable size [. . . ] do not match the cross-validation itself for reliability in assessing model fit and are hard to motivate.”

[2] Hawkins D, Basak S, Mills D (2003). “Assessing Model Fit by Cross– Validation.” Journal of Chemical Information and Computer Sciences, 43(2), 579–586

This will be addressed with K-fold cross-validation resampling techniques.

Cross-Validation scheme

Stratified K-fold (10) - For each k/K split of the data (10 total)
1. For each 9-piece fold training set
  1. Create further Stratified K-folds for any grid searching/ model selection
  2. Tune model in k/K training set folds.
    1. Impute data (create function with training data)
    2. Perform Feature selection (create function with training data)
    3. Scale variables: (x_i - mean)/std (create function with training set)
    4. Pick final model parameters. (Select top 70th percentile)
  3. Create final model with best parameters.
  4. Test on 1/k test set.
2. For each remaining 1-piece test set
  1. Impute, Feature selection with training set function
  2. Fit training set model to test set
3. Assess Model metrics for each run through all 10 different variations.
Output final results/parameters for 'best' model.
Train final model on full dataset.

Version 2

Define sets of model parameters values to evaluate
for each parameter set in grid search DO
1. For each k-fold resampling iteration DO
  1. Hold-out 1/k samples/fold
  2. Pre-Process Data (Create functions on training set, apply to test set with same)
    1. Impute data (median)
    2. Scale features (x_i - mean))/std
    3. Perform any univariate feature selection (remove very low variation features)
    4. Modeling feature selection (ExtraTreesClassifier)
  3. Fit the model on the k/K training fold
  4. Predict the hold-out samples/fold
2. END
3. Calculate the average performance across hold-out predictions
END
Determine the optimal parameter set
Fit the final model to all training data using the optimal parameter set



In [567]:

    
data_dict['BELFER ROBERT'] = {'bonus': 'NaN',
                              'deferral_payments': 'NaN',
                              'deferred_income': -102500,
                              'director_fees': 102500,
                              'email_address': 'NaN',
                              'exercised_stock_options': 'NaN',
                              'expenses': 3285,
                              'from_messages': 'NaN',
                              'from_poi_to_this_person': 'NaN',
                              'from_this_person_to_poi': 'NaN',
                              'loan_advances': 'NaN',
                              'long_term_incentive': 'NaN',
                              'other': 'NaN',
                              'poi': False,
                              'restricted_stock': -44093,
                              'restricted_stock_deferred': 44093,
                              'salary': 'NaN',
                              'shared_receipt_with_poi': 'NaN',
                              'to_messages': 'NaN',
                              'total_payments': 3285,
                              'total_stock_value': 'NaN'}

data_dict['BHATNAGAR SANJAY'] = {'bonus': 'NaN',
                                 'deferral_payments': 'NaN',
                                 'deferred_income': 'NaN',
                                 'director_fees': 'NaN',
                                 'email_address': 'sanjay.bhatnagar@enron.com',
                                 'exercised_stock_options': 15456290,
                                 'expenses': 137864,
                                 'from_messages': 29,
                                 'from_poi_to_this_person': 0,
                                 'from_this_person_to_poi': 1,
                                 'loan_advances': 'NaN',
                                 'long_term_incentive': 'NaN',
                                 'other': 'NaN',
                                 'poi': False,
                                 'restricted_stock': 2604490,
                                 'restricted_stock_deferred': -2604490,
                                 'salary': 'NaN',
                                 'shared_receipt_with_poi': 463,
                                 'to_messages': 523,
                                 'total_payments': 137864,
                                 'total_stock_value': 15456290}



In [568]:

    
df = pd.DataFrame.from_dict(data_dict, orient='index')

'NaN' was imported as a string instead of a a missing value. We will convert these to NaN type and look how many missing values our data has.



In [498]:

    
# Replace 'NaN' strings with actual np.nan values
df = df.replace('NaN', np.nan)
# Replace email strings with True/False boolean as to whether an email was present or not
df['email_address'] = df['email_address'].fillna(0).apply(lambda x: x != 0, 1)



In [499]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary                       95 non-null float64
to_messages                  86 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
bonus                        82 non-null float64
restricted_stock             110 non-null float64
shared_receipt_with_poi      86 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
loan_advances                4 non-null float64
from_messages                86 non-null float64
other                        93 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          146 non-null bool
director_fees                17 non-null float64
deferred_income              49 non-null float64
long_term_incentive          66 non-null float64
email_address                146 non-null bool
from_poi_to_this_person      86 non-null float64
dtypes: bool(2), float64(19)



In [500]:

    
df[df['salary'] > 1000000]
df[df.index == 'TOTAL']
df = df.drop('TOTAL', axis=0)



In [569]:

    
df.ix[11,:]









    Out[569]:





salary                                              NaN
to_messages                                         523
deferral_payments                                   NaN
total_payments                                   137864
exercised_stock_options                        15456290
bonus                                               NaN
restricted_stock                                2604490
shared_receipt_with_poi                             463
restricted_stock_deferred                      -2604490
total_stock_value                              15456290
expenses                                         137864
loan_advances                                       NaN
from_messages                                        29
other                                               NaN
from_this_person_to_poi                               1
poi                                               False
director_fees                                       NaN
deferred_income                                     NaN
long_term_incentive                                 NaN
email_address                sanjay.bhatnagar@enron.com
from_poi_to_this_person                               0
Name: BHATNAGAR SANJAY, dtype: object



In [570]:

    
df.ix[8,:]









    Out[570]:





salary                           NaN
to_messages                      NaN
deferral_payments                NaN
total_payments                  3285
exercised_stock_options          NaN
bonus                            NaN
restricted_stock              -44093
shared_receipt_with_poi          NaN
restricted_stock_deferred      44093
total_stock_value                NaN
expenses                        3285
loan_advances                    NaN
from_messages                    NaN
other                            NaN
from_this_person_to_poi          NaN
poi                            False
director_fees                 102500
deferred_income              -102500
long_term_incentive              NaN
email_address                    NaN
from_poi_to_this_person          NaN
Name: BELFER ROBERT, dtype: object



In [571]:

    
df = df.replace('NaN', np.nan)
df = df.apply(lambda x: x.fillna(0), axis=0)
df['total_stock_check'] = df['exercised_stock_options'] + df['restricted_stock'] + df['restricted_stock_deferred']



In [572]:

    
df['stocks_match'] = df['total_stock_check'] == df['total_stock_value']
stocks_df = df[['total_stock_check', 'total_stock_value', 'stocks_match']]
stocks_df[stocks_df.stocks_match==False]









    Out[572]:






  
    
      
      total_stock_check
      total_stock_value
      stocks_match



In [576]:



In [573]:

    
df['total_stock_check'] = df['exercised_stock_options'] + df['restricted_stock'] + df['restricted_stock_deferred']



In [573]:



In [574]:

    
df['total_pay'] = df['salary'] + df['bonus'] + df['long_term_incentive'] + df['deferred_income'] + \
df['deferral_payments'] + df['loan_advances'] + df['other'] + df['expenses'] + df['director_fees']



In [575]:

    
df['pay_matches'] = df['total_pay'] == df['total_payments']
pay_df = df[['total_pay', 'total_payments', 'pay_matches']]
pay_df[pay_df.pay_matches==False]









    Out[575]:






  
    
      
      total_pay
      total_payments
      pay_matches



In [565]:

    
data_dict['BHATNAGAR SANJAY']









    Out[565]:





{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 137864,
 'email_address': 'sanjay.bhatnagar@enron.com',
 'exercised_stock_options': 2604490,
 'expenses': 'NaN',
 'from_messages': 29,
 'from_poi_to_this_person': 0,
 'from_this_person_to_poi': 1,
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 137864,
 'poi': False,
 'restricted_stock': -2604490,
 'restricted_stock_deferred': 15456290,
 'salary': 'NaN',
 'shared_receipt_with_poi': 463,
 'to_messages': 523,
 'total_payments': 15456290,
 'total_stock_value': 'NaN'}



In [566]:

    
data_dict['BELFER ROBERT']









    Out[566]:





{'bonus': 'NaN',
 'deferral_payments': -102500,
 'deferred_income': 'NaN',
 'director_fees': 3285,
 'email_address': 'NaN',
 'exercised_stock_options': 3285,
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 44093,
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 102500,
 'total_stock_value': -44093}



In [ ]:



In [8]:

    
# df.pivot(index=df.index, columns='poi')
df.columns









    Out[8]:





Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments', u'exercised_stock_options', u'bonus', u'restricted_stock', u'shared_receipt_with_poi', u'restricted_stock_deferred', u'total_stock_value', u'expenses', u'loan_advances', u'from_messages', u'other', u'from_this_person_to_poi', u'poi', u'director_fees', u'deferred_income', u'long_term_incentive', u'email_address', u'from_poi_to_this_person'], dtype='object')



In [ ]:



In [ ]:



In [ ]:



In [118]:

    
cols = [x for x in df.columns if x not in ['restricted_stock_deferred', 'loan_advances', 'director_fees']]



In [119]:

    
for each in cols:
    g = sns.FacetGrid(df, col='poi', margin_titles=True, size=6)
    g.map(plt.hist, each, color='steelblue')



In [114]:

    
print cols









    



['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']



In [10]:

    
df2 = df[['poi', 'exercised_stock_options']]
df2['log_eso'] = df['exercised_stock_options'].apply(np.log)



In [11]:

    
g = sns.FacetGrid(df2, col='poi', margin_titles=True, size=6)
g.map(plt.hist, 'log_eso', color='steelblue', bins=25)









    Out[11]:





<seaborn.axisgrid.FacetGrid at 0x189a52b0>



In [12]:

    
df.columns









    Out[12]:





Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments', u'exercised_stock_options', u'bonus', u'restricted_stock', u'shared_receipt_with_poi', u'restricted_stock_deferred', u'total_stock_value', u'expenses', u'loan_advances', u'from_messages', u'other', u'from_this_person_to_poi', u'poi', u'director_fees', u'deferred_income', u'long_term_incentive', u'email_address', u'from_poi_to_this_person'], dtype='object')



In [13]:

    
df2 = df
df2['ratio_messages'] = df['from_messages']/df['to_messages']



In [14]:

    
g = sns.FacetGrid(df2, col='poi', margin_titles=True, size=6)
g.map(plt.hist, 'ratio_messages', color='steelblue', bins=20)









    Out[14]:





<seaborn.axisgrid.FacetGrid at 0x19f8f588>



In [8]:

    
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.base import TransformerMixin
from pandas import DataFrame
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import f_classif
from sklearn.naive_bayes import GaussianNB



In [16]:

    
my_imputer = Imputer(missing_values = 'NaN', strategy='median', axis=0)



In [17]:

    
my_scaler = StandardScaler(with_mean=True, with_std=True)

By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.

http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html



In [18]:

    
from sklearn.base import TransformerMixin

class ColumnExtractor(TransformerMixin):
    '''
    Columns extractor transformer for sklearn pipelines.
    Inherits fit_transform() from TransformerMixin, but this is explicitly
    defined here for clarity.
    
    Methods to extract pandas dataframe columns are defined for this class.
    
    '''
    def __init__(self, columns=[]):
        self.column = columns
    
    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)
    
    def transform(self, X, **transform_params):
        '''
        Input: A pandas dataframe and a list of column names to extract.
        Output: A pandas dataframe containing only the columns of the names passed in.
        '''
        return X[self.columns]
    
    def fit(self, X, y=None, **fit_params):
        return self



In [19]:

    
class DenseTransformer(TransformerMixin):
    '''
    to_dense() transformer for sklearn pipelines.
    Inherits fit_transform() from TransformerMixin, but this is explicitly
    defined here for clarity.
    
    Methods to apply to_dense to the pandas dataframe are defined for this class.
    '''
    def transform(self, X, y=None, **fit_params):
        '''
        Input: A pandas dataframe.
        Output: A pandas dataframe with to_dense applied.
        '''
        return X.todense()
    
    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)
    
    def fit(self, X, y=None, **fit_params):

        return self



In [20]:

    
from pandas import DataFrame

class ModelTransformer(TransformerMixin):
    '''
    Transformer for sklearn pipeline which applies a prediction model to the input.
    
    Inherits fit_transform() from TransformerMixin.
    
    Methods to apply a model transformation to the input are defined for this class.
    
    e.g. Apply Kmeans clustering to X data and output clusters to be used as features
    for a subsequent model. 
    
    '''
    def __init__(self, model):
        '''
        Initialize with model to be used fit/transform methods.
        '''
        self.model = model
        
    def fit(self, *args, **kwargs):
        '''
        Fit model using models' required *args.
        '''
        self.model.fit(*args, **kwargs)
        return self
    
    def transform(self, X, **transform_params):
        '''
        Input: pandas DataFrame.
        Output: pandas DataFrame of predictions from model.
        '''
        return DataFrame(self.model.predict(X))



In [21]:

    
class HoursOfDayTransformer(TransformerMixin):
    '''
    Transformer for sklearn pipeline which extracts the hours from a 'DateTime' column
    of a pandas dataframe.
    
    Inherits fit_transform() from TransformerMixin.
    
    Methods to extract the hours from a 'DateTime' column of pandas dataframe are defined
    for this class.
    '''
    def transform(self, X, **transform_params):
        '''
        Input: pandas DataFrame with ['datetime] column to extract hours from.
        Output: pandas DataFrame of the extracted hours.
        '''
        hours = DataFrame(X['datetime'].apply(lambda x: x.hour))
        return hours
    
    def fit(self, X, y=None, **fit_params):
        '''
        Nothing to fit, return self.
        '''
        return self



In [21]:



In [21]:



In [22]:

    
from sklearn.preprocessing import Binarizer



In [23]:

    
df.columns









    Out[23]:





Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments', u'exercised_stock_options', u'bonus', u'restricted_stock', u'shared_receipt_with_poi', u'restricted_stock_deferred', u'total_stock_value', u'expenses', u'loan_advances', u'from_messages', u'other', u'from_this_person_to_poi', u'poi', u'director_fees', u'deferred_income', u'long_term_incentive', u'email_address', u'from_poi_to_this_person', u'ratio_messages'], dtype='object')



In [24]:

    
binarizer = Binarizer(threshold=1000000)



In [25]:

    
binarizer.fit_transform(df['exercised_stock_options'].fillna(0))









    Out[25]:





array([ 1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,
        1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        1.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
        1.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,
        0.,  0.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,
        0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        1.,  0.])



In [45]:









    Out[45]:





ALLEN PHILLIP K            8.563992
BADUM JAMES P                   NaN
BANNANTINE JAMES M      8482.509434
BAXTER JOHN C             25.011209
BAY FRANKLIN R                  NaN
BAZELIDES PHILIP J        19.793128
BECK SALLY W                    NaN
BELDEN TIMOTHY N           4.453927
BELFER ROBERT                   NaN
BERBERIAN DAVID            7.500143
BERGSIEKER RICHARD P            NaN
BHATNAGAR SANJAY                NaN
BIBI PHILIPPE A            6.861248
BLACHMAN JEREMY M          3.079160
BLAKE JR. NORMAN P              NaN
...
UMANOFF ADAM S                 NaN
URQUHART JOHN A                NaN
WAKEHAM JOHN                   NaN
WALLS JR ROBERT H        12.172091
WALTERS GARETH W               NaN
WASAFF GEORGE             6.416483
WESTFAHL RICHARD K             NaN
WHALEY DAVID A                 NaN
WHALLEY LAWRENCE G        6.432585
WHITE JR THOMAS E         4.084641
WINOKUR JR. HERBERT S          NaN
WODRASKA JOHN                  NaN
WROBEL BRUCE                   NaN
YEAGER F SCOTT           52.451986
YEAP SOON                      NaN
Name: exercise_sal, Length: 145, dtype: float64



In [71]:

    
X_df = df2.drop('poi', axis=1)
y_df = df2['poi']



In [12]:



In [13]:



In [14]:



In [15]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.cross_validation import LeaveOneOut
from sklearn.cross_validation import StratifiedShuffleSplit



In [23]:

    
sk_fold = StratifiedShuffleSplit(y_df, n_iter=10, test_size=0.1)
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]
    
    pipeline = Pipeline([
      ('imputer', Imputer(missing_values = 'NaN', axis=0)),
      ('standardizer', StandardScaler(with_mean=True, with_std=True)),
      ('low_var_remover', VarianceThreshold(threshold=.10)),
      # ('pca', PCA()),
      ('ET', ExtraTreesClassifier(oob_score=True, bootstrap=True) )
    ])
    
    params = {'ET__n_estimators': [1500],
              'ET__max_features': ['auto', None, 3, 5, 10, 15],
              'ET__min_samples_split': [2, 4, 10],
              'ET__min_samples_leaf': [1, 2, 5],
              'ET__criterion' : ['gini', 'entropy'],
              'imputer__strategy': ['median', 'mean']}
    
    grid_search = GridSearchCV(pipeline, param_grid=params, n_jobs = -1, cv=3, scoring='f1')
    grid_search.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    print "Best Estimator: ", grid_search.best_estimator_
    print "F1: ", f1_score(y_test, test_pred)
    print "Confusion Matrix: "
    print confusion_matrix(y_test, test_pred)
    print "Accuracy Score: ", accuracy_score(y_test, test_pred)
    print "Best Params: ", grid_search.best_params_
    print ""









    



Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.5
Confusion Matrix: 
[[12  1]
 [ 1  1]]
Accuracy Score:  0.866666666667
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': None, 'imputer__strategy': 'median', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.0
Confusion Matrix: 
[[12  1]
 [ 2  0]]
Accuracy Score:  0.8
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': 15, 'imputer__strategy': 'median', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,...es_split=10, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.666666666667
Confusion Matrix: 
[[13  0]
 [ 1  1]]
Accuracy Score:  0.933333333333
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': None, 'imputer__strategy': 'median', 'ET__min_samples_split': 10, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...les_split=4, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.4
Confusion Matrix: 
[[11  2]
 [ 1  1]]
Accuracy Score:  0.8
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': None, 'imputer__strategy': 'mean', 'ET__min_samples_split': 4, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.0
Confusion Matrix: 
[[13  0]
 [ 2  0]]
Accuracy Score:  0.866666666667
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'entropy', 'ET__max_features': 15, 'imputer__strategy': 'mean', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...les_split=4, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.0
Confusion Matrix: 
[[13  0]
 [ 2  0]]
Accuracy Score:  0.866666666667
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': 'auto', 'imputer__strategy': 'mean', 'ET__min_samples_split': 4, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...es_split=10, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.0
Confusion Matrix: 
[[13  0]
 [ 2  0]]
Accuracy Score:  0.866666666667
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'entropy', 'ET__max_features': 15, 'imputer__strategy': 'mean', 'ET__min_samples_split': 10, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.0
Confusion Matrix: 
[[13  0]
 [ 2  0]]
Accuracy Score:  0.866666666667
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': None, 'imputer__strategy': 'median', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 2}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.666666666667
Confusion Matrix: 
[[13  0]
 [ 1  1]]
Accuracy Score:  0.933333333333
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': 'auto', 'imputer__strategy': 'mean', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
F1:  0.0
Confusion Matrix: 
[[13  0]
 [ 2  0]]
Accuracy Score:  0.866666666667
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': 'auto', 'imputer__strategy': 'median', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}



In [ ]:

    
F1:  0.666666666667
Confusion Matrix: 
[[13  0]
 [ 1  1]]
Accuracy Score:  0.933333333333
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': None, 'imputer__strategy': 'median', 'ET__min_samples_split': 10, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...les_split=4, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])



In [ ]:

    
F1:  0.666666666667
Confusion Matrix: 
[[13  0]
 [ 1  1]]
Accuracy Score:  0.933333333333
Best Params:  {'ET__n_estimators': 1500, 'ET__criterion': 'gini', 'ET__max_features': 'auto', 'imputer__strategy': 'mean', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,...les_split=2, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])



In [163]:

    
n_iter = 100
sk_fold = StratifiedShuffleSplit(y_df, n_iter=n_iter, test_size=0.1)
f1_avg = []
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]
    
    grid_search = Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), 
                               ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), 
                               ('low_var_remover', VarianceThreshold(threshold=0.1)),
                               ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None, 
                                                           n_estimators=1500, n_jobs=1, criterion='gini',
                                                           oob_score=True, random_state=None, verbose=0,
                                                           max_features=None, min_samples_split=10,
                                                           min_samples_leaf=1))])
    
    #params = {'ET__n_estimators': [1500], 
    #          'ET__criterion': ['gini'], 
    #          'ET__max_features': [None], 
    #          'imputer__strategy': ['median'],
    #          'ET__min_samples_split': [10], 
    #          'ET__min_samples_leaf': [1]}
    
    #grid_search = GridSearchCV(pipeline, param_grid=params, n_jobs = -1, cv=3, scoring='f1')
    grid_search.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    #print "Best Estimator: ", grid_search.best_estimator_
    f1_avg.append(f1_score(y_test, test_pred))
    #print "F1: ", f1_score(y_test, test_pred)
    #print "Confusion Matrix: "
    #print confusion_matrix(y_test, test_pred)
    #print "Accuracy Score: ", accuracy_score(y_test, test_pred)
    #print "Best Params: ", grid_search.best_params_
    #print ""
    
print sum(f1_avg)/n_iter









    



0.278333333333

Impute, standardize, remove low variance, linearSVC, ExtraTrees



In [204]:

    
from sklearn.svm import LinearSVC
sk_fold = StratifiedShuffleSplit(y_df, n_iter=10, test_size=0.2) 
        
pipeline = Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
                           ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
                           ('low_var_remover', VarianceThreshold(threshold=0.1)), 
                           ('feature_selection', LinearSVC()),
                           ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
                                                       criterion='gini', n_estimators=1500, n_jobs=1,
                                                       oob_score=True, random_state=None, verbose=0,
                                                       max_features='auto', min_samples_split=2,
                                                       min_samples_leaf=1))])
    
params = {'ET__n_estimators': [1500],
          'ET__max_features': ['auto', None],
          'ET__min_samples_split': [2, 4, 10],
          'ET__min_samples_leaf': [1, 2, 5],
          'feature_selection__C': [1, 10, 100],
          #'ET__criterion' : ['gini', 'entropy'],
          #'imputer__strategy': ['median', 'mean'],
          'low_var_remover': [0, 0.1, .25, .50]}
    
grid_search = GridSearchCV(pipeline, param_grid=params, cv=sk_fold, n_jobs = -1, scoring='f1')
grid_search.fit(X_df, y=y_df)
#test_pred = grid_search.predict(X_test)
#print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
print "Best Estimator: ", grid_search.best_estimator_
    #f1_avg.append(f1_score(y_test, test_pred))
#print "F1: ", f1_score(y_test, test_pred)
#print "Confusion Matrix: "
#print confusion_matrix(y_test, test_pred)
#print "Accuracy Score: ", accuracy_score(y_test, test_pred)
print "Best Params: ", grid_search.best_params_
#print ""
    
#print sum(f1_avg)/n_iter









    



Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('feature_selection', LinearSVC(C=10, class_weight=None, dual=True, f...les_split=4, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
Best Params:  {'ET__n_estimators': 1500, 'ET__max_features': None, 'feature_selection__C': 10, 'low_var_remover': 0.5, 'ET__min_samples_split': 4, 'ET__min_samples_leaf': 1}



In [ ]:



In [ ]:

Cross validation of final model



In [212]:

    
n_iter = 100
sk_fold = StratifiedShuffleSplit(y_df, n_iter=n_iter, test_size=0.1)
f1_avg = []
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]

    grid_search.best_estimator_.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    #print "Best Estimator: ", grid_search.best_estimator_
    #print f1_score(y_test, test_pred)
    f1_avg.append(f1_score(y_test, test_pred))
print sum(f1_avg)/n_iter









    



0.263952380952

Imputation



In [276]:

    
from sklearn.neighbors import KNeighborsRegressor



In [277]:

    
income_imputer = KNeighborsRegressor(n_neighbors=1)



In [353]:

    
df_salary = df_50[df_50.salary.isnull()==False]
df_null_salary = df_50[df_50.salary.isnull()==True]



In [354]:

    
import seaborn as sns
df_salary.corr()
sns.set(style='darkgrid')

f, ax = plt.subplots(figsize=(14, 14))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.corrplot(df_salary.corr(), annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()



In [355]:

    
df_salary.corr().ix[: ,0] # Pick the first column which we are predicting.









    Out[355]:





salary                     1.000000
to_messages                0.187047
total_payments             0.579260
exercised_stock_options    0.607324
bonus                      0.523190
restricted_stock           0.550824
shared_receipt_with_poi    0.284995
total_stock_value          0.614736
expenses                   0.145364
from_messages             -0.003541
other                      0.606903
from_this_person_to_poi    0.021288
poi                        0.264976
email_address              0.234134
from_poi_to_this_person    0.179055
Name: salary, dtype: float64



In [440]:

    
def kcluster_null(df=None, cols=None):
    # Asssert values are passed in. Very lax check here for now.
    #assert df not None, "Please prove a pandas dataframe"
    #assert cols not None, "please provide a list of columns to impute"
    
    # Create a KNN regression estimator for 
    income_imputer = KNeighborsRegressor(n_neighbors=1)
    # Loops through the columns passed in to impute each one sequentially.
    # Ideally these should be somewhat correlated since they will be used in KNN to
    # predict each other, one column at a time.
    for each in cols:
        # Create a temp list that does not include the column being predicted.
        temp_cols = [col for col in cols if col != each]
        # Create a dataframe that contains no missing values in the columns being predicted.
        # This will be used to train the KNN estimator
        df_col = df[df[each].isnull()==False]
        # Create a dataframe with all of the nulls in the column being predicted.
        df_null_col = df[df[each].isnull()==True]
        
        # Create a temp dataframe filling in the medians for each column being used to
        # predict that is missing values.
        # This step is needed since we have so many missing values distributed through 
        # all of the columns.
        temp_df_medians = df_col[temp_cols].apply(lambda x: x.fillna(x.median()), axis=0)
        # Fit our KNN imputer to this dataframe now that we have values for every column.
        income_imputer.fit(temp_df_medians, df_col[each])
        
        # Fill the df (that has null values being predicted) with medians in the other
        # columns not being predicted.
        # ** This currently uses its own medians and should ideally use the predictor df's
        # ** median values to fill in NA's of columns being used to predict.
        temp_null_medians = df_null_col[temp_cols].apply(lambda x: x.fillna(x.median()), axis=0)
        
        # Predict the null values for the current 'each' variable.
        new_values = income_imputer.predict(temp_null_medians[temp_cols])
        # Replace the null values of the original null dataframe with the predicted values.
        df_null_col[each] = new_values
        # Append the new predicted nulls dataframe to the dataframe which containined
        # no null values.
        # Overwrite the original df with this one containing predicted columns. 
        # Index order will not be preserved since it is rearranging each time by 
        # null values.
        df = df_col.append(df_null_col)
    # Returned final dataframe sorted by the index names.
    return df.sort_index(axis=0)



In [447]:

    
cols = ['salary', 'other', 'total_stock_value', 'exercised_stock_options', 
        'total_payments', 'restricted_stock']

imputed_df = kcluster_null(df_50, cols = cols)
imputed_df.salary.plot()









    Out[447]:





<matplotlib.axes._subplots.AxesSubplot at 0x2acb2240>



In [448]:

    
df_50_salary_done.sort_index(axis=0).salary.plot()









    Out[448]:





<matplotlib.axes._subplots.AxesSubplot at 0x2ae9b2b0>



In [450]:

    
# Same values as manually imputing! Yay
len(df_50_salary_done.sort_index(axis=0).salary == imputed_df.salary)









    Out[450]:





144



In [418]:

    
# Find some the highest correlated columns to predict salary.
cols = ['other', 'total_stock_value', 'exercised_stock_options', 
        'total_payments', 'restricted_stock']
# Temporarily fill in any missing values in our predictors with the median.
# Medians will be more robust for our skewed data.

## median_imputer = Imputer(missing_values = 'NaN', axis=0)
## median_imputer.fit(df_salary[cols])
## temp_df_medians = median_imputer.transform(df_salary[cols])
#temp_df_medians = df_salary[cols].apply(lambda x: x.fillna(x.median()), axis=0)


df_col_medians = list(df_salary[cols].apply(lambda x: x.median(), axis=0))


income_imputer.fit(temp_df_medians, df_salary.salary)









    Out[418]:





KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=1, p=2, weights='uniform')



In [419]:

    
## temp_null_medians[cols] = median_imputer.transform(df_null_salary[cols])
temp_null_medians = df_null_salary[cols].apply(lambda x: x.fillna(x.median()), axis=0)

new_values = income_imputer.predict(temp_null_medians[cols])



In [420]:

    
df_null_salary['salary'] = new_values



In [421]:

    
df_50_salary_done = df_salary.append(df_null_salary)



In [422]:

    
df_50_salary_done.salary.plot(figsize=(14,12), label='kcluster')
#df.salary.fillna(df.salary.median()).plot()
#df.salary.plot(label='missing')









    Out[422]:





<matplotlib.axes._subplots.AxesSubplot at 0x1b646550>



In [390]:

    
df_50_salary_done.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 144 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 15 columns):
salary                     144 non-null float64
to_messages                86 non-null float64
total_payments             124 non-null float64
exercised_stock_options    101 non-null float64
bonus                      81 non-null float64
restricted_stock           109 non-null float64
shared_receipt_with_poi    86 non-null float64
total_stock_value          125 non-null float64
expenses                   94 non-null float64
from_messages              86 non-null float64
other                      92 non-null float64
from_this_person_to_poi    86 non-null float64
poi                        144 non-null bool
email_address              144 non-null bool
from_poi_to_this_person    86 non-null float64
dtypes: bool(2), float64(13)



In [451]:

    
df_50_salary_done.salary.nunique()









    Out[451]:





93



In [453]:

    
imputed_df.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 144 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 15 columns):
salary                     144 non-null float64
to_messages                86 non-null float64
total_payments             144 non-null float64
exercised_stock_options    144 non-null float64
bonus                      81 non-null float64
restricted_stock           144 non-null float64
shared_receipt_with_poi    86 non-null float64
total_stock_value          144 non-null float64
expenses                   94 non-null float64
from_messages              86 non-null float64
other                      144 non-null float64
from_this_person_to_poi    86 non-null float64
poi                        144 non-null bool
email_address              144 non-null bool
from_poi_to_this_person    86 non-null float64
dtypes: bool(2), float64(13)



In [493]:

    
df_50 = df_50.apply(lambda x: x.fillna(0), axis=0)

df_50['total_adj_compensation'] = df_50['salary'] + df_50['bonus'] + \
df_50['expenses'] + df_50['restricted_stock'] + df_50['other'] + \
df_50['exercised_stock_options']



df_50[['poi', 'total_adj_compensation']].sort('total_adj_compensation', ascending = False).head(50)









    Out[493]:






  
    
      
      poi
      total_adj_compensation
    
  
  
    
      LAY KENNETH L
        True
       67641960
    
    
      SKILLING JEFFREY K
        True
       32856388
    
    
      HIRKO JOSEPH
        True
       30846898
    
    
      PAI LOU L
       False
       26941313
    
    
      FREVERT MARK A
       False
       25197725
    
    
      RICE KENNETH D
        True
       24934964
    
    
      WHITE JR THOMAS E
       False
       17078482
    
    
      BAXTER JOHN C
       False
       14761863
    
    
      LAVORATO JOHN J
       False
       13557521
    
    
      YEAGER F SCOTT
        True
       12245058
    
    
      DERRICK JR. JAMES V
       False
       11970274
    
    
      DIMICHELE RICHARD G
       False
        9991071
    
    
      WHALLEY LAWRENCE G
       False
        9948365
    
    
      HANNON KEVIN P
        True
        8179747
    
    
      REDMOND BRIAN L
       False
        8001853
    
    
      KEAN STEVEN J
       False
        7601164
    
    
      IZZO LAWRENCE L
       False
        7487076
    
    
      OVERDYKE JR JERE C
       False
        7421545
    
    
      ELLIOTT STEVEN
       False
        7291189
    
    
      HORTON STANLEY C
       False
        7256648
    
    
      WALLS JR ROBERT H
       False
        7157026
    
    
      DELAINEY DAVID W
        True
        7067259
    
    
      SHERRIFF JOHN R
       False
        6909948
    
    
      BELDEN TIMOTHY N
        True
        6802756
    
    
      BANNANTINE JAMES M
       False
        6725010
    
    
      ALLEN PHILLIP K
       False
        6246543
    
    
      CHRISTODOULOU DIOMEDES
       False
        6077885
    
    
      FALLON JAMES B
       False
        5634392
    
    
      THORN TERENCE H
       False
        5512663
    
    
      MARTIN AMANDA K
       False
        5246458
    
    
      BUY RICHARD B
       False
        5075588
    
    
      MCMAHON JEFFREY
       False
        5067764
    
    
      REYNOLDS LAWRENCE
       False
        4749015
    
    
      MCCONNELL MICHAEL S
       False
        4648221
    
    
      TAYLOR MITCHELL S
       False
        4610262
    
    
      SHANKMAN JEFFREY A
       False
        4556315
    
    
      SHELBY REX
        True
        4501668
    
    
      CAUSEY RICHARD A
        True
        4255821
    
    
      KITCHEN LOUISE
       False
        4018284
    
    
      FASTOW ANDREW S
        True
        3868495
    
    
      OLSON CINDY K
       False
        3750604
    
    
      LINDHOLM TOD A
       False
        3561022
    
    
      BIBI PHILIPPE A
       False
        3521688
    
    
      KOENIG MARK E
        True
        3207476
    
    
      KOPPER MICHAEL J
        True
        3034973
    
    
      DETMERING TIMOTHY J
       False
        3031793
    
    
      RIEKER PAULA H
        True
        2903309
    
    
      HAEDICKE MARK E
       False
        2785595
    
    
      MULLER MARK S
       False
        2769449
    
    
      BERBERIAN DAVID
       False
        2722090



In [486]:

    
g = sns.FacetGrid(df_50, col='poi', margin_titles=True, size=6)
g.map(plt.hist, 'total_adj_compensation', color='steelblue')









    Out[486]:





<seaborn.axisgrid.FacetGrid at 0x2ba08a58>

Remove columns with less than 50% of entries present.

Remove rows with no non-NA values



In [291]:

    
low_var_remover = VarianceThreshold(threshold=.5)



In [336]:

    
# Remove columns with more than 50% NA's
df_50 = df.dropna(axis=1, thresh=len(df)/2)
# Since email_address and poi are True/False, every record should have at least 2 non-NA.
# We'll next remove any rows that don't have at least 3 non-NA values.
# This will only remove records that are completely blank except for poi/email_address.
df_50 = df_50.dropna(axis=0, thresh=3)
# One record was removed
df_50.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 144 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 15 columns):
salary                     94 non-null float64
to_messages                86 non-null float64
total_payments             124 non-null float64
exercised_stock_options    101 non-null float64
bonus                      81 non-null float64
restricted_stock           109 non-null float64
shared_receipt_with_poi    86 non-null float64
total_stock_value          125 non-null float64
expenses                   94 non-null float64
from_messages              86 non-null float64
other                      92 non-null float64
from_this_person_to_poi    86 non-null float64
poi                        144 non-null bool
email_address              144 non-null bool
from_poi_to_this_person    86 non-null float64
dtypes: bool(2), float64(13)



In [337]:

    
pd.value_counts(df_50.poi)









    Out[337]:





False    126
True      18
dtype: int64



In [455]:

    
imputed_df.corr()
sns.set(style='darkgrid')

f, ax = plt.subplots(figsize=(14, 14))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.corrplot(imputed_df.corr(), annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()



In [300]:

    
help(df.dropna)









    



Help on method dropna in module pandas.core.frame:

dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Return object with labels on given axis omitted where alternately any
    or all of the data are missing
    
    Parameters
    ----------
    axis : {0, 1}, or tuple/list thereof
        Pass tuple or list to drop on multiple axes
    how : {'any', 'all'}
        * any : if any NA values are present, drop that label
        * all : if all values are NA, drop that label
    thresh : int, default None
        int value : require that many non-NA values
    subset : array-like
        Labels along other axis to consider, e.g. if you are dropping rows
        these would be a list of columns to include
    inplace : boolean, defalt False
        If True, do operation inplace and return None.
    
    Returns
    -------
    dropped : DataFrame

Feature Importances



In [186]:

    
## Add features
df2 = df.copy()
df2 = df2.apply(lambda x: x.fillna(x.median()), axis=0)
df2['EXERCISE_SALARY'] = df2['exercised_stock_options']/df2['salary']
df2['LOG_EX_STOCK_OPT'] = df2['exercised_stock_options']
df2['DEFFERED_STOCK'] = df2['deferred_income']/df2['total_stock_value']
df2['EXERCISE_TOTAL_STOCK'] = df2['exercised_stock_options']/df2['total_stock_value']
df2['EXERCISE_DEFFERED'] = df2['exercised_stock_options']/df2['deferred_income']
#df2['TWO_FROM'] = df2['from_poi_to_this_person']/df2['from_this_person_to_poi']
X_df = df2.drop('poi', axis=1)
y_df = df2['poi']



In [222]:

    
#print X_noNA
rf = ExtraTreesClassifier(n_estimators=2000, min_samples_split=4, 
                          max_features=None, min_samples_leaf=1)
rf.fit(X_df, y_df)
rf.feature_importances_









    Out[222]:





array([  1.87941663e-02,   1.77841193e-02,   2.03552876e-02,
         2.91209831e-02,   9.39857160e-02,   5.05642266e-02,
         3.34005427e-02,   5.80753065e-02,   2.86125730e-03,
         5.41747300e-02,   5.87527473e-02,   1.84430721e-04,
         1.63960459e-02,   5.01092647e-02,   6.31635285e-02,
         6.44387166e-06,   7.74600005e-02,   4.55756682e-02,
         6.67689742e-03,   2.86507662e-02,   9.11663567e-02,
         1.27086049e-02,   9.36818482e-02,   4.17423414e-02,
         3.46087200e-02])



In [223]:

    
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)



In [224]:

    
padding = np.arange(len(X_df.columns)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, X_df.columns[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable IMportance")
plt.show()

# http://nbviewer.ipython.org/github/yhat/DataGotham2013/blob/master/notebooks/7%20-%20Feature%20Engineering.ipynb



In [225]:

    
pd.crosstab(df.poi, df.email_address)









    Out[225]:






  
    
      email_address
      False
      True
    
    
      poi
      
      
    
  
  
    
      False
       34
       93
    
    
      True 
        0
       18



In [ ]:

Test New Methods



In [ ]:

    
from sklearn.svm import LinearSVC
sk_fold = StratifiedShuffleSplit(y_df, n_iter=10, test_size=0.2) 
        
pipeline = Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
                           ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
                           ('low_var_remover', VarianceThreshold(threshold=0.1)), 
                           ('feature_selection', LinearSVC()),
                           ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
                                                       criterion='gini', n_estimators=1500, n_jobs=1,
                                                       oob_score=True, random_state=None, verbose=0,
                                                       max_features='auto', min_samples_split=2,
                                                       min_samples_leaf=1))])
    
params = {'ET__n_estimators': [1500],
          'ET__max_features': ['auto', None],
          'ET__min_samples_split': [2, 4, 10],
          'ET__min_samples_leaf': [1, 2, 5],
          'feature_selection__C': [1, 10, 100],
          #'ET__criterion' : ['gini', 'entropy'],
          #'imputer__strategy': ['median', 'mean'],
          'low_var_remover': [0, 0.1, .25, .50]}
    
grid_search = GridSearchCV(pipeline, param_grid=params, cv=sk_fold, n_jobs = -1, scoring='f1')
grid_search.fit(X_df, y=y_df)
#test_pred = grid_search.predict(X_test)
#print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
print "Best Estimator: ", grid_search.best_estimator_
    #f1_avg.append(f1_score(y_test, test_pred))
#print "F1: ", f1_score(y_test, test_pred)
#print "Confusion Matrix: "
#print confusion_matrix(y_test, test_pred)
#print "Accuracy Score: ", accuracy_score(y_test, test_pred)
print "Best Params: ", grid_search.best_params_



In [ ]:

    
n_iter = 100
sk_fold = StratifiedShuffleSplit(y_df, n_iter=n_iter, test_size=0.1)
f1_avg = []
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]

    grid_search.best_estimator_.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    #print "Best Estimator: ", grid_search.best_estimator_
    #print f1_score(y_test, test_pred)
    f1_avg.append(f1_score(y_test, test_pred))
print sum(f1_avg)/n_iter



In [ ]:

Just pipeline



In [20]:

    
pipeline = Pipeline([
    ('imputer', Imputer(missing_values = 'NaN', axis=0)),
    ('standardizer', StandardScaler(with_mean=True, with_std=True)),
    ('low_var_remover', VarianceThreshold(threshold=.10)),
    # ('pca', PCA()),
    ('ET', ExtraTreesClassifier(oob_score=True, bootstrap=True) )
])
    
params = {'ET__n_estimators': [500],
          'ET__max_features': ['auto', None, 3, 5, 10, 15],
          'ET__min_samples_split': [2, 4, 10],
          'ET__min_samples_leaf': [1, 2, 5],
          'ET__criterion' : ['gini', 'entropy'],
          'imputer__strategy': ['median', 'mean']}
    
grid_search = GridSearchCV(pipeline, param_grid=params, n_jobs = -1, cv=3, scoring='f1', verbose=1)
grid_search.fit(X_train, y=y_train)
test_pred = grid_search.predict(X_test)
#print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
print "Best Estimator: ", grid_search.best_estimator_
print "F1: ", f1_score(y_test, test_pred)
print "Confusion Matrix: "
print confusion_matrix(y_test, test_pred)
print "Accuracy Score: ", accuracy_score(y_test, test_pred)
print "Best Params: ", grid_search.best_params_
print ""









    



[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:    5.3s
[Parallel(n_jobs=-1)]: Done 200 jobs       | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done 450 jobs       | elapsed:   34.9s
[Parallel(n_jobs=-1)]: Done 634 out of 648 | elapsed:   48.6s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done 648 out of 648 | elapsed:   49.5s finished






    



Fitting 3 folds for each of 216 candidates, totalling 648 fits
Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
 ...ples_split=2, n_estimators=500, n_jobs=1, oob_score=True,
           random_state=None, verbose=0))])
F1:  0.285714285714
Confusion Matrix: 
[[42  0]
 [ 5  1]]
Accuracy Score:  0.895833333333
Best Params:  {'ET__n_estimators': 500, 'ET__criterion': 'gini', 'ET__max_features': 15, 'imputer__strategy': 'mean', 'ET__min_samples_split': 2, 'ET__min_samples_leaf': 1}



In [ ]:



In [ ]:

    
# Logistic Regression Baseline



In [55]:

    
from sklearn.linear_model import LogisticRegression

sk_fold = StratifiedKFold(y_df, n_folds=4, shuffle=True)
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]
    
    pipeline = Pipeline([
      ('imputer', Imputer(missing_values = 'NaN', strategy='median', axis=0)),
      ('standardizer', StandardScaler(with_mean=True, with_std=True)),
      ('low_var_remover', VarianceThreshold(threshold=.10)),
      ('LR', LogisticRegression() )
    ])
    
    params = {'LR__C': [.01, 1, 10]}
    
    grid_search = GridSearchCV(pipeline, param_grid=params, n_jobs = -1, cv=None, scoring='f1')
    grid_search.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    print "Best Estimator: ", grid_search.best_estimator_
    print "F1: ", f1_score(y_test, test_pred)
    print "Confusion Matrix: "
    print confusion_matrix(y_test, test_pred)
    print "Accuracy Score: ", accuracy_score(y_test, test_pred)
    print "Best Params: ", grid_search.best_params_
    print ""









    



Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('LR', LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001))])
F1:  0.285714285714
Confusion Matrix: 
[[31  1]
 [ 4  1]]
Accuracy Score:  0.864864864865
Best Params:  {'LR__C': 0.01}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('LR', LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001))])
F1:  0.333333333333
Confusion Matrix: 
[[27  5]
 [ 3  2]]
Accuracy Score:  0.783783783784
Best Params:  {'LR__C': 0.01}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('LR', LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001))])
F1:  0.0
Confusion Matrix: 
[[28  4]
 [ 4  0]]
Accuracy Score:  0.777777777778
Best Params:  {'LR__C': 0.01}

Best Estimator:  Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('LR', LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001))])
F1:  0.666666666667
Confusion Matrix: 
[[31  0]
 [ 2  2]]
Accuracy Score:  0.942857142857
Best Params:  {'LR__C': 0.01}



In [54]:

    
## Baseline braindead scores for returning the highest frequency 0 as a guess.









    Out[54]:





{'LR__C': 1}



In [40]:

    
sk_fold = StratifiedKFold(y_df, n_folds=4, shuffle=True)
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]
    
    test_pred = np.zeros(y_test.shape)
    print "F1: ", f1_score(y_test, test_pred)
    print "Confusion Matrix: "
    print confusion_matrix(y_test, test_pred)
    print "Accuracy Score: ", accuracy_score(y_test, test_pred)
    print ""









    



F1:  0.0
Confusion Matrix: 
[[32  0]
 [ 5  0]]
Accuracy Score:  0.864864864865

F1:  0.0
Confusion Matrix: 
[[32  0]
 [ 5  0]]
Accuracy Score:  0.864864864865

F1:  0.0
Confusion Matrix: 
[[32  0]
 [ 4  0]]
Accuracy Score:  0.888888888889

F1:  0.0
Confusion Matrix: 
[[31  0]
 [ 4  0]]
Accuracy Score:  0.885714285714



In [ ]:



In [ ]:



In [ ]:

	poi	total_adj_compensation
LAY KENNETH L	True	67641960
SKILLING JEFFREY K	True	32856388
HIRKO JOSEPH	True	30846898
PAI LOU L	False	26941313
FREVERT MARK A	False	25197725
RICE KENNETH D	True	24934964
WHITE JR THOMAS E	False	17078482
BAXTER JOHN C	False	14761863
LAVORATO JOHN J	False	13557521
YEAGER F SCOTT	True	12245058
DERRICK JR. JAMES V	False	11970274
DIMICHELE RICHARD G	False	9991071
WHALLEY LAWRENCE G	False	9948365
HANNON KEVIN P	True	8179747
REDMOND BRIAN L	False	8001853
KEAN STEVEN J	False	7601164
IZZO LAWRENCE L	False	7487076
OVERDYKE JR JERE C	False	7421545
ELLIOTT STEVEN	False	7291189
HORTON STANLEY C	False	7256648
WALLS JR ROBERT H	False	7157026
DELAINEY DAVID W	True	7067259
SHERRIFF JOHN R	False	6909948
BELDEN TIMOTHY N	True	6802756
BANNANTINE JAMES M	False	6725010
ALLEN PHILLIP K	False	6246543
CHRISTODOULOU DIOMEDES	False	6077885
FALLON JAMES B	False	5634392
THORN TERENCE H	False	5512663
MARTIN AMANDA K	False	5246458
BUY RICHARD B	False	5075588
MCMAHON JEFFREY	False	5067764
REYNOLDS LAWRENCE	False	4749015
MCCONNELL MICHAEL S	False	4648221
TAYLOR MITCHELL S	False	4610262
SHANKMAN JEFFREY A	False	4556315
SHELBY REX	True	4501668
CAUSEY RICHARD A	True	4255821
KITCHEN LOUISE	False	4018284
FASTOW ANDREW S	True	3868495
OLSON CINDY K	False	3750604
LINDHOLM TOD A	False	3561022
BIBI PHILIPPE A	False	3521688
KOENIG MARK E	True	3207476
KOPPER MICHAEL J	True	3034973
DETMERING TIMOTHY J	False	3031793
RIEKER PAULA H	True	2903309
HAEDICKE MARK E	False	2785595
MULLER MARK S	False	2769449
BERBERIAN DAVID	False	2722090