%%html
al.bahnsen@gmail.com | |
http://github.com/albahnsen | |
http://linkedin.com/in/albahnsen | |
@albahnsen |
Estimate the probability of a transaction being fraud based on analyzing customer patterns and recent fraudulent behavior
Issues when constructing a fraud detection system:
Different machine learning methods are used in practice, and in the literature: logistic regression, neural networks, discriminant analysis, genetic programing, decision trees, random forests among others
Formally, a fraud detection is a statistical model that allows the estimation of the probability of transaction $i$ being a fraud ($y_i=1$)
$$\hat p_i=P(y_i=1|\mathbf{x}_i)$$
In [1]:
import pandas as pd
import numpy as np
from costcla import datasets
In [2]:
from costcla.datasets.base import Bunch
def load_fraud(cost_mat_parameters=dict(Ca=10)):
# data_ = pd.read_pickle("trx_fraud_data.pk")
data_ = pd.read_pickle("/home/al/DriveAl/EasySol/Projects/DetectTA/Tests/trx_fraud_data_v3_agg.pk")
target = data_['fraud'].values
data = data_.drop('fraud', 1)
n_samples = data.shape[0]
cost_mat = np.zeros((n_samples, 4))
cost_mat[:, 0] = cost_mat_parameters['Ca']
cost_mat[:, 1] = data['amount']
cost_mat[:, 2] = cost_mat_parameters['Ca']
cost_mat[:, 3] = 0.0
return Bunch(data=data.values, target=target, cost_mat=cost_mat,
target_names=['Legitimate Trx', 'Fraudulent Trx'], DESCR='',
feature_names=data.columns.values, name='FraudDetection')
datasets.load_fraud = load_fraud
In [3]:
data = datasets.load_fraud()
In [4]:
print(data.keys())
print('Number of examples ', data.target.shape[0])
In [5]:
target = pd.DataFrame(pd.Series(data.target).value_counts(), columns=('Frequency',))
target['Percentage'] = (target['Frequency'] / target['Frequency'].sum()) * 100
target.index = ['Negative (Legitimate Trx)', 'Positive (Fraud Trx)']
target.loc['Total Trx'] = [data.target.shape[0], 1.]
print(target)
In [6]:
pd.DataFrame(data.feature_names[:4], columns=('Features',))
Out[6]:
In [7]:
df = pd.DataFrame(data.data[:, :4], columns=data.feature_names[:4])
df.head(10)
Out[7]:
In [8]:
df = pd.DataFrame(data.data[:, 4:], columns=data.feature_names[4:])
df.head(10)
Out[8]:
In [9]:
from sklearn.cross_validation import train_test_split
X = data.data[:, [2, 3] + list(range(4, data.data.shape[1]))].astype(np.float)
X_train, X_test, y_train, y_test, cost_mat_train, cost_mat_test = \
train_test_split(X, data.target, data.cost_mat, test_size=0.33, random_state=10)
In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
classifiers = {"RF": {"f": RandomForestClassifier()},
"DT": {"f": DecisionTreeClassifier()}}
ci_models = ['DT', 'RF']
# Fit the classifiers using the training dataset
for model in classifiers.keys():
classifiers[model]["f"].fit(X_train, y_train)
classifiers[model]["c"] = classifiers[model]["f"].predict(X_test)
classifiers[model]["p"] = classifiers[model]["f"].predict_proba(X_test)
classifiers[model]["p_train"] = classifiers[model]["f"].predict_proba(X_train)
In [11]:
import warnings
warnings.filterwarnings('ignore')
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
import seaborn as sns
colors = sns.color_palette()
figsize(12, 8)
In [13]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
measures = {"F1Score": f1_score, "Precision": precision_score,
"Recall": recall_score, "Accuracy": accuracy_score}
results = pd.DataFrame(columns=measures.keys())
for model in ci_models:
results.loc[model] = [measures[measure](y_test, classifiers[model]["c"]) for measure in measures.keys()]
In [14]:
def fig_acc():
plt.bar(np.arange(results.shape[0])-0.3, results['Accuracy'], 0.6, label='Accuracy', color=colors[0])
plt.xticks(range(results.shape[0]), results.index)
plt.tick_params(labelsize=22); plt.title('Accuracy', size=30)
plt.show()
In [15]:
fig_acc()
In [16]:
def fig_f1():
plt.bar(np.arange(results.shape[0])-0.3, results['Precision'], 0.2, label='Precision', color=colors[0])
plt.bar(np.arange(results.shape[0])-0.3+0.2, results['Recall'], 0.2, label='Recall', color=colors[1])
plt.bar(np.arange(results.shape[0])-0.3+0.4, results['F1Score'], 0.2, label='F1Score', color=colors[2])
plt.xticks(range(results.shape[0]), results.index)
plt.tick_params(labelsize=22)
plt.ylim([0, 1])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5),fontsize=22)
plt.show()
In [17]:
fig_f1()
Actual Positive ($y_i=1$) | Actual Negative ($y_i=0$) | |
---|---|---|
Pred. Positive ($c_i=1$) | $C_{TP_i}=C_a$ | $C_{FP_i}=C_a$ |
Pred. Negative ($c_i=0$) | $C_{FN_i}=Amt_i$ | $C_{TN_i}=0$ |
Where:
For more info see [Correa Bahnsen et al., 2014]
In [18]:
# The cost matrix is already calculated for the dataset
# cost_mat[C_FP,C_FN,C_TP,C_TN]
print(data.cost_mat[[10, 17, 50]])
The financial cost of using a classifier $f$ on $\mathcal{S}$ is calculated by
$$ Cost(f(\mathcal{S})) = \sum_{i=1}^N y_i(1-c_i)C_{FN_i} + (1-y_i)c_i C_{FP_i}.$$
Then the financial savings are defined as the cost of the algorithm versus the cost of using no algorithm at all.
$$ Savings(f(\mathcal{S})) = \frac{ Cost_l(\mathcal{S}) - Cost(f(\mathcal{S}))} {Cost_l(\mathcal{S})},$$
where $Cost_l(\mathcal{S})$ is the cost of the costless class
In [19]:
# Calculation of the cost and savings
from costcla.metrics import savings_score, cost_loss
In [20]:
# Evaluate the savings for each model
results["Savings"] = np.zeros(results.shape[0])
for model in ci_models:
results["Savings"].loc[model] = savings_score(y_test, classifiers[model]["c"], cost_mat_test)
In [21]:
# Plot the results
def fig_sav():
plt.bar(np.arange(results.shape[0])-0.4, results['Precision'], 0.2, label='Precision', color=colors[0])
plt.bar(np.arange(results.shape[0])-0.4+0.2, results['Recall'], 0.2, label='Recall', color=colors[1])
plt.bar(np.arange(results.shape[0])-0.4+0.4, results['F1Score'], 0.2, label='F1Score', color=colors[2])
plt.bar(np.arange(results.shape[0])-0.4+0.6, results['Savings'], 0.2, label='Savings', color=colors[3])
plt.xticks(range(results.shape[0]), results.index)
plt.tick_params(labelsize=22)
plt.ylim([0, 1])
plt.xlim([-0.5, results.shape[0] -1 + .5])
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5),fontsize=22)
plt.show()
In [22]:
fig_sav()
Convert a classifier cost-sensitive by selecting a proper threshold from training instances according to the savings
costcla.models.ThresholdingOptimization(calibration=True)
fit(y_prob_train=None, cost_mat, y_true_train)
predict(y_prob)
Parameters
Returns
In [23]:
from costcla.models import ThresholdingOptimization
for model in ci_models:
classifiers[model+"-TO"] = {"f": ThresholdingOptimization()}
# Fit
classifiers[model+"-TO"]["f"].fit(classifiers[model]["p_train"], cost_mat_train, y_train)
# Predict
classifiers[model+"-TO"]["c"] = classifiers[model+"-TO"]["f"].predict(classifiers[model]["p"])
In [24]:
print('New thresholds')
for model in ci_models:
print(model + '-TO - ' + str(classifiers[model+'-TO']['f'].threshold_))
In [25]:
for model in ci_models:
# Evaluate
results.loc[model+"-TO"] = 0
results.loc[model+"-TO", measures.keys()] = \
[measures[measure](y_test, classifiers[model+"-TO"]["c"]) for measure in measures.keys()]
results["Savings"].loc[model+"-TO"] = savings_score(y_test, classifiers[model+"-TO"]["c"], cost_mat_test)
In [26]:
fig_sav()
Cost-sensitive classification ussualy refers to class-dependent costs, where the cost dependends on the class but is assumed constant accross examples.
In fraud detection, different transactions have different amounts, which implies that the costs are not constant
The BMR classifier is a decision model based on quantifying tradeoffs between various decisions using probabilities and the costs that accompany such decisions.
In particular:
$$ R(c_i=0|\mathbf{x}_i)=C_{TN_i}(1-\hat p_i)+C_{FN_i} \cdot \hat p_i, $$and $$ R(c_i=1|\mathbf{x}_i)=C_{TP_i} \cdot \hat p_i + C_{FP_i}(1- \hat p_i), $$
costcla.models.BayesMinimumRiskClassifier(calibration=True)
fit(y_true_cal=None, y_prob_cal=None)
predict(y_prob,cost_mat)
Parameters
Returns
In [27]:
from costcla.models import BayesMinimumRiskClassifier
for model in ci_models:
classifiers[model+"-BMR"] = {"f": BayesMinimumRiskClassifier()}
# Fit
classifiers[model+"-BMR"]["f"].fit(y_test, classifiers[model]["p"])
# Calibration must be made in a validation set
# Predict
classifiers[model+"-BMR"]["c"] = classifiers[model+"-BMR"]["f"].predict(classifiers[model]["p"], cost_mat_test)
In [28]:
for model in ci_models:
# Evaluate
results.loc[model+"-BMR"] = 0
results.loc[model+"-BMR", measures.keys()] = \
[measures[measure](y_test, classifiers[model+"-BMR"]["c"]) for measure in measures.keys()]
results["Savings"].loc[model+"-BMR"] = savings_score(y_test, classifiers[model+"-BMR"]["c"], cost_mat_test)
In [29]:
fig_sav()
Why so important focusing on the Recall
In [30]:
print(data.data[data.target == 1, 2].mean())
In [31]:
print(data.cost_mat[:,0].mean())
In [33]:
from costcla.models import CostSensitiveDecisionTreeClassifier
from costcla.models import CostSensitiveRandomForestClassifier
classifiers = {"CSDT": {"f": CostSensitiveDecisionTreeClassifier()},
"CSRF": {"f": CostSensitiveRandomForestClassifier(combination='majority_bmr')}}
# Fit the classifiers using the training dataset
for model in classifiers.keys():
classifiers[model]["f"].fit(X_train, y_train, cost_mat_train)
if model == "CSRF":
classifiers[model]["c"] = classifiers[model]["f"].predict(X_test, cost_mat_test)
else:
classifiers[model]["c"] = classifiers[model]["f"].predict(X_test)
In [34]:
for model in ['CSDT', 'CSRF']:
# Evaluate
results.loc[model] = 0
results.loc[model, measures.keys()] = \
[measures[measure](y_test, classifiers[model]["c"]) for measure in measures.keys()]
results["Savings"].loc[model] = savings_score(y_test, classifiers[model]["c"], cost_mat_test)
In [35]:
fig_sav()
CostCla is a Python open source cost-sensitive classification library built on top of Scikit-learn, Pandas and Numpy.
Source code, binaries and documentation are distributed under 3-Clause BSD license in the website http://albahnsen.com/CostSensitiveClassification/
Cost-proportionate over-sampling [Elkan, 2001]
SMOTE [Chawla et al., 2002]
Cost-proportionate rejection-sampling [Zadrozny et al., 2003]
Thresholding optimization [Sheng and Ling, 2006]
Bayes minimum risk [Correa Bahnsen et al., 2014a]
Cost-sensitive logistic regression [Correa Bahnsen et al., 2014b]
Cost-sensitive decision trees [Correa Bahnsen et al., 2015a]
Cost-sensitive ensemble methods: cost-sensitive bagging, cost-sensitive pasting, cost-sensitive random forest and cost-sensitive random patches [Correa Bahnsen et al., 2015c]
Credit Scoring1 - Kaggle credit competition [Data], cost matrix: [Correa Bahnsen et al., 2014]
Credit Scoring 2 - PAKDD2009 Credit [Data], cost matrix: [Correa Bahnsen et al., 2014a]
Direct Marketing - PAKDD2009 Credit [Data], cost matrix: [Correa Bahnsen et al., 2014b]
Churn Modeling, soon
Fraud Detection, soon
You find the presentation and the IPython Notebook here:
In [36]:
#Format from https://github.com/ellisonbg/talk-2013-scipy
from IPython.display import display, HTML
s = """
<style>
.rendered_html {
font-family: "proxima-nova", helvetica;
font-size: 100%;
line-height: 1.3;
}
.rendered_html h1 {
margin: 0.25em 0em 0.5em;
color: #015C9C;
text-align: center;
line-height: 1.2;
page-break-before: always;
}
.rendered_html h2 {
margin: 1.1em 0em 0.5em;
color: #26465D;
line-height: 1.2;
}
.rendered_html h3 {
margin: 1.1em 0em 0.5em;
color: #002845;
line-height: 1.2;
}
.rendered_html li {
line-height: 1.5;
}
.prompt {
font-size: 120%;
}
.CodeMirror-lines {
font-size: 120%;
}
.output_area {
font-size: 120%;
}
#notebook {
background-image: url('files/images/witewall_3.png');
}
h1.bigtitle {
margin: 4cm 1cm 4cm 1cm;
font-size: 300%;
}
h3.point {
font-size: 200%;
text-align: center;
margin: 2em 0em 2em 0em;
#26465D
}
.logo {
margin: 20px 0 20px 0;
}
a.anchor-link {
display: none;
}
h1.title {
font-size: 250%;
}
</style>
"""
display(HTML(s))