Exercise 08

  • Fraud Detection Dataset from Microsoft Azure: data

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses.


In [1]:
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/fraud_detection.csv.zip', 'r') as z:
    f = z.open('15_fraud_detection.csv')
    data = pd.io.parsers.read_table(f, index_col=0, sep=',')
data.head()


Out[1]:
accountAge digitalItemCount sumPurchaseCount1Day sumPurchaseAmount1Day sumPurchaseAmount30Day paymentBillingPostalCode - LogOddsForClass_0 accountPostalCode - LogOddsForClass_0 paymentBillingState - LogOddsForClass_0 accountState - LogOddsForClass_0 paymentInstrumentAgeInAccount ipState - LogOddsForClass_0 transactionAmount transactionAmountUSD ipPostalCode - LogOddsForClass_0 localHour - LogOddsForClass_0 Label
0 2000 0 0 0.00 720.25 5.064533 0.421214 1.312186 0.566395 3279.574306 1.218157 599.00 626.164650 1.259543 4.745402 0
1 62 1 1 1185.44 2530.37 0.538996 0.481838 4.401370 4.500157 61.970139 4.035601 1185.44 1185.440000 3.981118 4.921349 0
2 2000 0 0 0.00 0.00 5.064533 5.096396 3.056357 3.155226 0.000000 3.314186 32.09 32.090000 5.008490 4.742303 0
3 1 1 0 0.00 0.00 5.064533 5.096396 3.331154 3.331239 0.000000 3.529398 133.28 132.729554 1.324925 4.745402 0
4 1 1 0 0.00 132.73 5.412885 0.342945 5.563677 4.086965 0.001389 3.529398 543.66 543.660000 2.693451 4.876771 0

In [2]:
X = data.drop(['Label'], axis=1)
y = data['Label']
y.value_counts(normalize=True)


Out[2]:
0    0.994255
1    0.005745
Name: Label, dtype: float64

Exercice 08.1

Estimate a Logistic Regression, GaussianNB, K-nearest neighbors and a Decision Tree Classifiers

Evaluate using the following metrics:

  • Accuracy
  • F1-Score
  • F_Beta-Score (Beta=10)

Comment about the results

Combine the classifiers and comment


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

In [11]:
models = {'lr': LogisticRegression(),
          'dt': DecisionTreeClassifier(),
          'nb': GaussianNB(),
          'nn': KNeighborsClassifier()}

from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
# Train all the models
for model in models.keys():
    models[model].fit(X_train, y_train)

In [12]:
# predict test for each model
y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
for model in models.keys():
    y_pred[model] = models[model].predict(X_test)
y_pred.sample(10)


Out[12]:
lr dt nb nn
111018 0 0 0 0
120018 0 0 0 0
24895 0 0 0 0
23525 0 0 0 0
29535 0 0 0 0
52150 0 0 1 0
127077 0 0 0 0
83261 0 0 0 0
26716 0 0 0 0
45260 0 0 0 0

In [13]:
y_pred_ensemble1 = (y_pred.mean(axis=1) > 0.5).astype(int)

In [14]:
y_pred_ensemble1.mean()


Out[14]:
0.00020183962400161472

In [15]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

In [16]:
stats = {'acc': accuracy_score,
         'f1': f1_score,
         'rec': recall_score,
         'pre': precision_score}
res = pd.DataFrame(index=models.keys(), columns=stats.keys())

In [17]:
for model in models.keys():
    for stat in stats.keys():
        res.loc[model, stat] = stats[stat](y_test, y_pred[model])

In [18]:
res


Out[18]:
acc f1 rec pre
lr 0.993829 0 0 0
dt 0.987918 0.121593 0.136792 0.109434
nb 0.923647 0.0314557 0.20283 0.01705
nn 0.993714 0.0840336 0.0471698 0.384615

In [19]:
res.loc['ensemble1'] = 0
for stat in stats.keys():
    res.loc['ensemble1', stat] = stats[stat](y_test, y_pred_ensemble1)

In [20]:
res


Out[20]:
acc f1 rec pre
lr 0.993829 0 0 0
dt 0.987918 0.121593 0.136792 0.109434
nb 0.923647 0.0314557 0.20283 0.01705
nn 0.993714 0.0840336 0.0471698 0.384615
ensemble1 0.994031 0.0547945 0.0283019 0.857143

Exercice 08.2

Apply random-undersampling with a target percentage of 0.5

how does the results change


In [ ]:

Exercice 08.3

For each model estimate a BaggingClassifier of 100 models using the under sampled datasets


In [ ]:

Exercice 08.4

Using the under-sampled dataset

Evaluate a RandomForestClassifier and compare the results

change n_estimators=100, what happened


In [ ]: