Enron Scandal: Indentifying Person of Interest

Identification of Enron employees who may have committed fraud

Supervised Learning. Classification

Data: Enron financial dataset from Udacity



In [1]:

    
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras

helper.info_gpu()
#sns.set_palette("Reds")
helper.reproducible(seed=0)  # setup reproducible results from run to run using Keras

%matplotlib inline
%load_ext autoreload
%autoreload









    



Using TensorFlow backend.






    



/device:GPU:0
Keras		v2.1.4
TensorFlow	v1.4.1

1. Data Processing and Exploratory Data Analysis

Load the Data



In [2]:

    
data_path = 'data/enron_financial_data.pkl'
target = ['poi']

df = pd.read_pickle(data_path)
df = pd.DataFrame.from_dict(df, orient='index')

Explore the Data



In [3]:

    
helper.info_data(df, target)









    



Samples: 	146 
Features: 	20
Target: 	poi
Binary target: 	{False: 128, True: 18}
Ratio 		7.1 : 1.0
Dummy accuracy:	0.88

Imbalanced target: the evaluation metric used in this problem is the Area Under the ROC Curve
poi = person of interest (boolean)



In [4]:

    
df.head(3)









    Out[4]:







  
    
      
      salary
      to_messages
      deferral_payments
      total_payments
      loan_advances
      bonus
      email_address
      restricted_stock_deferred
      deferred_income
      total_stock_value
      ...
      from_poi_to_this_person
      exercised_stock_options
      from_messages
      other
      from_this_person_to_poi
      poi
      long_term_incentive
      shared_receipt_with_poi
      restricted_stock
      director_fees
    
  
  
    
      ALLEN PHILLIP K
      201955
      2902
      2869717
      4484442
      NaN
      4175000
      phillip.allen@enron.com
      -126027
      -3081055
      1729541
      ...
      47
      1729541
      2195
      152
      65
      False
      304805
      1407
      126027
      NaN
    
    
      BADUM JAMES P
      NaN
      NaN
      178980
      182466
      NaN
      NaN
      NaN
      NaN
      NaN
      257817
      ...
      NaN
      257817
      NaN
      NaN
      NaN
      False
      NaN
      NaN
      NaN
      NaN
    
    
      BANNANTINE JAMES M
      477
      566
      NaN
      916197
      NaN
      NaN
      james.bannantine@enron.com
      -560222
      -5104
      5243487
      ...
      39
      4046157
      29
      864523
      0
      False
      NaN
      465
      1757552
      NaN
    
  

3 rows × 21 columns

Transform the data



In [5]:

    
# delete 'TOTAL' row (at the bottom)
if 'TOTAL' in df.index:
    df.drop('TOTAL', axis='index', inplace=True)

# convert dataframe values (objects) to numerical. There are no categorical features
df = df.apply(pd.to_numeric, errors='coerce')

Missing features



In [6]:

    
helper.missing(df)

High-missing features, like 'loan_advances', are needed to obtain better models

Remove irrelevant features



In [7]:

    
df.drop('email_address', axis='columns', inplace=True)

Classify variables



In [8]:

    
num = list(df.select_dtypes(include=[np.number]))

df = helper.classify_data(df, target, numerical=num)

helper.get_types(df)









    



Numerical features: 	19
Categorical features: 	0
Target: 		poi (category)






    Out[8]:







  
    
      
      salary
      to_messages
      deferral_payments
      total_payments
      loan_advances
      bonus
      restricted_stock_deferred
      deferred_income
      total_stock_value
      expenses
      from_poi_to_this_person
      exercised_stock_options
      from_messages
      other
      from_this_person_to_poi
      long_term_incentive
      shared_receipt_with_poi
      restricted_stock
      director_fees
      poi
    
  
  
    
      Type
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      float32
      category

Fill missing values



In [9]:

    
# Reeplace NaN values with the median
df.fillna(df.median(), inplace=True)
#helper.fill_simple(df, target, inplace=True) # same result

Visualize the data



In [10]:

    
df.describe(percentiles=[0.5]).astype(int)









    Out[10]:







  
    
      
      salary
      to_messages
      deferral_payments
      total_payments
      loan_advances
      bonus
      restricted_stock_deferred
      deferred_income
      total_stock_value
      expenses
      from_poi_to_this_person
      exercised_stock_options
      from_messages
      other
      from_this_person_to_poi
      long_term_incentive
      shared_receipt_with_poi
      restricted_stock
      director_fees
    
  
  
    
      count
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
      145
    
    
      mean
      275172
      1722
      383687
      2402823
      2537413
      1002369
      -50907
      -293981
      3040758
      51503
      52
      2455073
      377
      314211
      27
      567548
      999
      972059
      104361
    
    
      std
      142866
      2029
      708602
      8785497
      6606450
      1097889
      1305242
      575095
      6112358
      37235
      68
      4646612
      1441
      1122664
      78
      597613
      930
      1972272
      14229
    
    
      min
      477
      57
      -102500
      148
      400000
      70000
      -1787380
      -3504386
      -44093
      148
      0
      3285
      12
      2
      0
      69223
      2
      -2604490
      3285
    
    
      50%
      258741
      1211
      221063
      1100246
      2000000
      750000
      -140264
      -151927
      1095040
      46547
      35
      1297049
      41
      51984
      8
      422158
      740
      441096
      106164
    
    
      max
      1111258
      15149
      6426990
      103559792
      81525000
      8000000
      15456290
      -833
      49110080
      228763
      528
      34348384
      14368
      10359729
      609
      5145434
      5521
      14761694
      137864

Numerical features



In [11]:

    
helper.show_numerical(df, kde=True, ncols=5)

Target vs Numerical features



In [12]:

    
helper.show_target_vs_numerical(df, target, jitter=0.05, point_size=50, ncols=5)

Total stock value vs some features



In [13]:

    
# df.plot.scatter(x='salary', y='total_stock_value')
# df.plot.scatter(x='long_term_incentive', y='total_stock_value')

# sns.lmplot(x="salary", y="total_stock_value", hue='poi', data=df)
# sns.lmplot(x="long_term_incentive", y="total_stock_value", hue='poi', data=df)

g = sns.PairGrid(
    df,
    y_vars=["total_stock_value"],
    x_vars=["salary", "long_term_incentive", "from_this_person_to_poi"],
    hue='poi',
    size=4)
g.map(sns.regplot).add_legend()
plt.ylim(
    ymin=0, ymax=0.5e8)

#sns.pairplot(df, hue='poi', vars=['long_term_incentive', 'total_stock_value', 'from_poi_to_this_person'], kind='reg', size=3)









    Out[13]:





(0, 50000000.0)

The person of interest seems to have a higher stock vs salary and long-term incentive, especially when his stock value is high. There is no dependency between POI and the amount of emails from or to another person of interest.

Correlation between numerical features and target



In [14]:

    
helper.correlation(df, target)

2. Neural Network model

Select the features



In [15]:

    
droplist = []  # features to drop from the model

# For the model 'data' instead of 'df'
data = df.copy()
data.drop(droplist, axis='columns', inplace=True)
data.head(3)









    Out[15]:







  
    
      
      salary
      to_messages
      deferral_payments
      total_payments
      loan_advances
      bonus
      restricted_stock_deferred
      deferred_income
      total_stock_value
      expenses
      from_poi_to_this_person
      exercised_stock_options
      from_messages
      other
      from_this_person_to_poi
      long_term_incentive
      shared_receipt_with_poi
      restricted_stock
      director_fees
      poi
    
  
  
    
      ALLEN PHILLIP K
      201955.0
      2902.0
      2869717.0
      4484442.0
      2000000.0
      4175000.0
      -126027.0
      -3081055.0
      1729541.0
      13868.0
      47.0
      1729541.0
      2195.0
      152.0
      65.0
      304805.0
      1407.0
      126027.0
      106164.5
      False
    
    
      BADUM JAMES P
      258741.0
      1211.0
      178980.0
      182466.0
      2000000.0
      750000.0
      -140264.0
      -151927.0
      257817.0
      3486.0
      35.0
      257817.0
      41.0
      51984.5
      8.0
      422158.0
      740.5
      441096.0
      106164.5
      False
    
    
      BANNANTINE JAMES M
      477.0
      566.0
      221063.5
      916197.0
      2000000.0
      750000.0
      -560222.0
      -5104.0
      5243487.0
      56301.0
      39.0
      4046157.0
      29.0
      864523.0
      0.0
      422158.0
      465.0
      1757552.0
      106164.5
      False

Scale numerical features

Shift and scale numerical variables to a standard normal distribution. The scaling factors are saved to be used for predictions.



In [16]:

    
data, scale_param = helper.scale(data)

There are no categorical variables

Split the data into training and test sets

Data leakage: Test set hidden when training the model, but seen when preprocessing the dataset

No validation set (small dataset)



In [17]:

    
test_size = 0.4
random_state = 9

x_train, y_train, x_test, y_test = helper.simple_split(data, target, True, test_size,
                                                       random_state)

Encode the output



In [18]:

    
y_train, y_test = helper.one_hot_output(y_train, y_test)



In [19]:

    
print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("test size  \t X:{} \t Y:{} ".format(x_test.shape, y_test.shape))









    



train size 	 X:(87, 19) 	 Y:(87, 2)
test size  	 X:(58, 19) 	 Y:(58, 2)

Build a dummy classifier



In [20]:

    
helper.dummy_clf(x_train, y_train, x_test, y_test)









    



Confusion matrix: 
 [[51  0]
 [ 7  0]]






    Out[20]:







  
    
      
      Loss
      Accuracy
      Precision
      Recall
      ROC-AUC
      F1-score
    
  
  
    
      Dummy
      4.17
      0.88
      0.0
      0
      0
      0

Build the Neural Network for Binary Classification



In [21]:

    
# class weight for imbalance target

cw = helper.get_class_weight(y_train[:,1])









    



{0: 0.5723684210526315, 1: 3.9545454545454546}



In [29]:

    
model_path = os.path.join("models", "enron_scandal.h5")

model = None
model = helper.build_nn_clf(x_train.shape[1], y_train.shape[1], dropout=0.3, summary=True)

helper.train_nn(model, x_train, y_train, class_weight=cw, path=model_path)

from sklearn.metrics import roc_auc_score
y_pred_train = model.predict(x_train, verbose=0)
print('\nROC_AUC train:\t{:.2f} \n'.format(roc_auc_score(y_train, y_pred_train)))









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 19)                380       
_________________________________________________________________
dropout_3 (Dropout)          (None, 19)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 40        
=================================================================
Total params: 420
Trainable params: 420
Non-trainable params: 0
_________________________________________________________________
Training ....
time: 	 0.5 s






    












    



Training loss:  	0.5039

Training accuracy: 	0.793

Model saved at models/enron_scandal.h5

ROC_AUC train:	0.89

Evaluate the model



In [30]:

    
# Dataset too small for train, validation, and test sets. More data is needed for a proper
y_pred = model.predict(x_test, verbose=0)

helper.binary_classification_scores(y_test[:, 1], y_pred[:, 1], return_dataframe=True, index="DNN")









    



Confusion matrix: 
 [[40 11]
 [ 2  5]]






    Out[30]:







  
    
      
      Loss
      Accuracy
      Precision
      Recall
      ROC-AUC
      F1-score
    
  
  
    
      DNN
      0.52
      0.78
      0.31
      0.71
      0.77
      0.43

Compare with non-neural network models



In [31]:

    
helper.ml_classification(x_train, y_train[:,1], x_test, y_test[:,1])









    



Naive Bayes
AdaBoost
Decision Tree
Random Forest
Extremely Randomized Trees






    Out[31]:







  
    
      
      Time (s)
      Loss
      Accuracy
      Precision
      Recall
      ROC-AUC
      F1-score
    
  
  
    
      Decision Tree
      0.00
      5.36
      0.84
      0.38
      0.43
      0.67
      0.40
    
    
      Random Forest
      0.08
      0.31
      0.84
      0.00
      0.00
      0.00
      0.00
    
    
      Extremely Randomized Trees
      0.08
      0.30
      0.84
      0.00
      0.00
      0.00
      0.00
    
    
      AdaBoost
      0.06
      0.50
      0.81
      0.25
      0.29
      0.73
      0.27
    
    
      Naive Bayes
      0.00
      4.88
      0.74
      0.25
      0.57
      0.83
      0.35

	salary	to_messages	deferral_payments	total_payments	loan_advances	bonus	email_address	restricted_stock_deferred	deferred_income	total_stock_value	...	from_poi_to_this_person	exercised_stock_options	from_messages	other	from_this_person_to_poi	poi	long_term_incentive	shared_receipt_with_poi	restricted_stock	director_fees
ALLEN PHILLIP K	201955	2902	2869717	4484442	NaN	4175000	phillip.allen@enron.com	-126027	-3081055	1729541	...	47	1729541	2195	152	65	False	304805	1407	126027	NaN
BADUM JAMES P	NaN	NaN	178980	182466	NaN	NaN	NaN	NaN	NaN	257817	...	NaN	257817	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN
BANNANTINE JAMES M	477	566	NaN	916197	NaN	NaN	james.bannantine@enron.com	-560222	-5104	5243487	...	39	4046157	29	864523	0	False	NaN	465	1757552	NaN

	salary	to_messages	deferral_payments	total_payments	loan_advances	bonus	restricted_stock_deferred	deferred_income	total_stock_value	expenses	from_poi_to_this_person	exercised_stock_options	from_messages	other	from_this_person_to_poi	long_term_incentive	shared_receipt_with_poi	restricted_stock	director_fees
count	145	145	145	145	145	145	145	145	145	145	145	145	145	145	145	145	145	145	145
mean	275172	1722	383687	2402823	2537413	1002369	-50907	-293981	3040758	51503	52	2455073	377	314211	27	567548	999	972059	104361
std	142866	2029	708602	8785497	6606450	1097889	1305242	575095	6112358	37235	68	4646612	1441	1122664	78	597613	930	1972272	14229
min	477	57	-102500	148	400000	70000	-1787380	-3504386	-44093	148	0	3285	12	2	0	69223	2	-2604490	3285
50%	258741	1211	221063	1100246	2000000	750000	-140264	-151927	1095040	46547	35	1297049	41	51984	8	422158	740	441096	106164
max	1111258	15149	6426990	103559792	81525000	8000000	15456290	-833	49110080	228763	528	34348384	14368	10359729	609	5145434	5521	14761694	137864

	salary	to_messages	deferral_payments	total_payments	loan_advances	bonus	restricted_stock_deferred	deferred_income	total_stock_value	expenses	from_poi_to_this_person	exercised_stock_options	from_messages	other	from_this_person_to_poi	long_term_incentive	shared_receipt_with_poi	restricted_stock	director_fees	poi
ALLEN PHILLIP K	201955.0	2902.0	2869717.0	4484442.0	2000000.0	4175000.0	-126027.0	-3081055.0	1729541.0	13868.0	47.0	1729541.0	2195.0	152.0	65.0	304805.0	1407.0	126027.0	106164.5	False
BADUM JAMES P	258741.0	1211.0	178980.0	182466.0	2000000.0	750000.0	-140264.0	-151927.0	257817.0	3486.0	35.0	257817.0	41.0	51984.5	8.0	422158.0	740.5	441096.0	106164.5	False
BANNANTINE JAMES M	477.0	566.0	221063.5	916197.0	2000000.0	750000.0	-560222.0	-5104.0	5243487.0	56301.0	39.0	4046157.0	29.0	864523.0	0.0	422158.0	465.0	1757552.0	106164.5	False

	Time (s)	Loss	Accuracy	Precision	Recall	ROC-AUC	F1-score
Decision Tree	0.00	5.36	0.84	0.38	0.43	0.67	0.40
Random Forest	0.08	0.31	0.84	0.00	0.00	0.00	0.00
Extremely Randomized Trees	0.08	0.30	0.84	0.00	0.00	0.00	0.00
AdaBoost	0.06	0.50	0.81	0.25	0.29	0.73	0.27
Naive Bayes	0.00	4.88	0.74	0.25	0.57	0.83	0.35