Exercise 15

Fraud Detection

Introduction

  • Fraud Detection Dataset from Microsoft Azure: data

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [3]:
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
df = pd.read_csv(url, index_col=0)
df.head()


Out[3]:
accountAge digitalItemCount sumPurchaseCount1Day sumPurchaseAmount1Day sumPurchaseAmount30Day paymentBillingPostalCode - LogOddsForClass_0 accountPostalCode - LogOddsForClass_0 paymentBillingState - LogOddsForClass_0 accountState - LogOddsForClass_0 paymentInstrumentAgeInAccount ipState - LogOddsForClass_0 transactionAmount transactionAmountUSD ipPostalCode - LogOddsForClass_0 localHour - LogOddsForClass_0 Label
0 2000 0 0 0.00 720.25 5.064533 0.421214 1.312186 0.566395 3279.574306 1.218157 599.00 626.164650 1.259543 4.745402 0
1 62 1 1 1185.44 2530.37 0.538996 0.481838 4.401370 4.500157 61.970139 4.035601 1185.44 1185.440000 3.981118 4.921349 0
2 2000 0 0 0.00 0.00 5.064533 5.096396 3.056357 3.155226 0.000000 3.314186 32.09 32.090000 5.008490 4.742303 0
3 1 1 0 0.00 0.00 5.064533 5.096396 3.331154 3.331239 0.000000 3.529398 133.28 132.729554 1.324925 4.745402 0
4 1 1 0 0.00 132.73 5.412885 0.342945 5.563677 4.086965 0.001389 3.529398 543.66 543.660000 2.693451 4.876771 0

In [5]:
df.head()


Out[5]:
accountAge digitalItemCount sumPurchaseCount1Day sumPurchaseAmount1Day sumPurchaseAmount30Day paymentBillingPostalCode - LogOddsForClass_0 accountPostalCode - LogOddsForClass_0 paymentBillingState - LogOddsForClass_0 accountState - LogOddsForClass_0 paymentInstrumentAgeInAccount ipState - LogOddsForClass_0 transactionAmount transactionAmountUSD ipPostalCode - LogOddsForClass_0 localHour - LogOddsForClass_0 Label
0 2000 0 0 0.00 720.25 5.064533 0.421214 1.312186 0.566395 3279.574306 1.218157 599.00 626.164650 1.259543 4.745402 0
1 62 1 1 1185.44 2530.37 0.538996 0.481838 4.401370 4.500157 61.970139 4.035601 1185.44 1185.440000 3.981118 4.921349 0
2 2000 0 0 0.00 0.00 5.064533 5.096396 3.056357 3.155226 0.000000 3.314186 32.09 32.090000 5.008490 4.742303 0
3 1 1 0 0.00 0.00 5.064533 5.096396 3.331154 3.331239 0.000000 3.529398 133.28 132.729554 1.324925 4.745402 0
4 1 1 0 0.00 132.73 5.412885 0.342945 5.563677 4.086965 0.001389 3.529398 543.66 543.660000 2.693451 4.876771 0

In [6]:
df.shape, df.Label.sum(), df.Label.mean()


Out[6]:
((138721, 16), 797, 0.0057453449730033666)

Exercise 15.1

Estimate a Logistic Regression and a Decision Tree

Evaluate using the following metrics:

  • Accuracy
  • F1-Score
  • F_Beta-Score (Beta=10)

Comment about the results


In [ ]:

Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose? How the results change?

Only apply under-sampling to the training set, evaluate using the whole test set


In [ ]:

Exercise 15.3

Same analysis using random-over-sampling


In [ ]:

Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?


In [ ]:

Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1


In [ ]:

Exercise 15.6 (3 points)

Compare and comment about the results


In [ ]: