Exercise 08

Fraud Detection Dataset from Microsoft Azure: data

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.



In [1]:

    
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')
data.head()









    Out[1]:







  
    
      
      accountAge
      digitalItemCount
      sumPurchaseCount1Day
      sumPurchaseAmount1Day
      sumPurchaseAmount30Day
      paymentBillingPostalCode - LogOddsForClass_0
      accountPostalCode - LogOddsForClass_0
      paymentBillingState - LogOddsForClass_0
      accountState - LogOddsForClass_0
      paymentInstrumentAgeInAccount
      ipState - LogOddsForClass_0
      transactionAmount
      transactionAmountUSD
      ipPostalCode - LogOddsForClass_0
      localHour - LogOddsForClass_0
      Label
    
  
  
    
      0
      2000
      0
      0
      0.00
      720.25
      5.064533
      0.421214
      1.312186
      0.566395
      3279.574306
      1.218157
      599.00
      626.164650
      1.259543
      4.745402
      0
    
    
      1
      62
      1
      1
      1185.44
      2530.37
      0.538996
      0.481838
      4.401370
      4.500157
      61.970139
      4.035601
      1185.44
      1185.440000
      3.981118
      4.921349
      0
    
    
      2
      2000
      0
      0
      0.00
      0.00
      5.064533
      5.096396
      3.056357
      3.155226
      0.000000
      3.314186
      32.09
      32.090000
      5.008490
      4.742303
      0
    
    
      3
      1
      1
      0
      0.00
      0.00
      5.064533
      5.096396
      3.331154
      3.331239
      0.000000
      3.529398
      133.28
      132.729554
      1.324925
      4.745402
      0
    
    
      4
      1
      1
      0
      0.00
      132.73
      5.412885
      0.342945
      5.563677
      4.086965
      0.001389
      3.529398
      543.66
      543.660000
      2.693451
      4.876771
      0



In [2]:

    
X = data.drop(['Label'], axis=1)
y = data['Label']
y.value_counts(normalize=True)









    Out[2]:





0    0.994255
1    0.005745
Name: Label, dtype: float64

Exercice 08.1

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree Classifiers

Evaluate using the following metrics:

Accuracy
F1-Score
F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment



In [3]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier



In [11]:

    
models = {'lr': LogisticRegression(),
          'dt': DecisionTreeClassifier(),
          'nb': GaussianNB(),
          'nn': KNeighborsClassifier()}

from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
# Train all the models
for model in models.keys():
    models[model].fit(X_train, y_train)



In [12]:

    
# predict test for each model
y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
for model in models.keys():
    y_pred[model] = models[model].predict(X_test)
y_pred.sample(10)



In [13]:

    
y_pred_ensemble1 = (y_pred.mean(axis=1) > 0.5).astype(int)



In [14]:

    
y_pred_ensemble1.mean()









    Out[14]:





0.00020183962400161472



In [15]:

    
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score



In [16]:

    
stats = {'acc': accuracy_score,
         'f1': f1_score,
         'rec': recall_score,
         'pre': precision_score}
res = pd.DataFrame(index=models.keys(), columns=stats.keys())



In [17]:

    
for model in models.keys():
    for stat in stats.keys():
        res.loc[model, stat] = stats[stat](y_test, y_pred[model])



In [18]:

    
res



In [19]:

    
res.loc['ensemble1'] = 0
for stat in stats.keys():
    res.loc['ensemble1', stat] = stats[stat](y_test, y_pred_ensemble1)



In [20]:

    
res

Exercice 08.2

Apply random-undersampling with a target percentage of 0.5

how does the results change



In [ ]:

Exercice 08.3

For each model estimate a BaggingClassifier of 100 models using the under sampled datasets



In [ ]:

Exercice 08.4

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened



In [ ]:

	lr	dt	nb	nn
111018	0	0	0	0
120018	0	0	0	0
24895	0	0	0	0
23525	0	0	0	0
29535	0	0	0	0
52150	0	0	1	0
127077	0	0	0	0
83261	0	0	0	0
26716	0	0	0	0
45260	0	0	0	0

	acc	f1	rec	pre
lr	0.993829	0	0	0
dt	0.987918	0.121593	0.136792	0.109434
nb	0.923647	0.0314557	0.20283	0.01705
nn	0.993714	0.0840336	0.0471698	0.384615

	acc	f1	rec	pre
lr	0.993829	0	0	0
dt	0.987918	0.121593	0.136792	0.109434
nb	0.923647	0.0314557	0.20283	0.01705
nn	0.993714	0.0840336	0.0471698	0.384615
ensemble1	0.994031	0.0547945	0.0283019	0.857143

	accountAge	digitalItemCount	sumPurchaseCount1Day	sumPurchaseAmount1Day	sumPurchaseAmount30Day	paymentBillingPostalCode - LogOddsForClass_0	accountPostalCode - LogOddsForClass_0	paymentBillingState - LogOddsForClass_0	accountState - LogOddsForClass_0	paymentInstrumentAgeInAccount	ipState - LogOddsForClass_0	transactionAmount	transactionAmountUSD	ipPostalCode - LogOddsForClass_0	localHour - LogOddsForClass_0
0	2000	0	0	0.00	720.25	5.064533	0.421214	1.312186	0.566395	3279.574306	1.218157	599.00	626.164650	1.259543	4.745402
1	62	1	1	1185.44	2530.37	0.538996	0.481838	4.401370	4.500157	61.970139	4.035601	1185.44	1185.440000	3.981118	4.921349
2	2000	0	0	0.00	0.00	5.064533	5.096396	3.056357	3.155226	0.000000	3.314186	32.09	32.090000	5.008490	4.742303
3	1	1	0	0.00	0.00	5.064533	5.096396	3.331154	3.331239	0.000000	3.529398	133.28	132.729554	1.324925	4.745402
4	1	1	0	0.00	132.73	5.412885	0.342945	5.563677	4.086965	0.001389	3.529398	543.66	543.660000	2.693451	4.876771