An important part in the profitability of a credit card product is the issuer's ability to detect and deny fraud. Purchase fraud can cost as much as 0.10% of purchase volumes, which must be paid for by the issuer. To help prevent fraud is to reduce the cost of purchase fraud to the issuer, so many issuers will spend tremendous analysis resources on detecting and denying fraudulent transactions.
Available in Kaggle is a dataset of purchase transactions with several attributes, including a flag for Fraud. Read more about the dataset here.
We'll start by importing the necessary packages and the dataset.
In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec
import xgboost as xgb
from sklearn.model_selection import train_test_split
In [2]:
transactions = pd.read_csv('creditcard.csv')
Alright, now that we have our transcations loaded, let's take a look at the data.
In [3]:
transactions.head(n=10)
Out[3]:
In [4]:
transactions.describe()
Out[4]:
Looks like the data is all transformed and renamed, probably to anonymize the fields. This will make our work much less interpretible. Oh well, onward!
In [5]:
transactions.isnull().sum()
Out[5]:
Awesome, no missing values. Wish real life was this clean.
Let's see how Amount varies by Fraud / Not Fraud
In [6]:
print('Fraud')
print(transactions.Amount[transactions.Class==1].describe())
print()
print('Not Fraud')
print(transactions.Amount[transactions.Class==0].describe())
print()
In [7]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))
bins = 50
ax1.hist(transactions.Amount[transactions.Class == 1], bins=bins)
ax1.set_title('Fraud')
ax2.hist(transactions.Amount[transactions.Class == 0], bins=bins)
ax2.set_title('Not Fraud')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()
Fraudulent transactions are larger, on average, than non fraudulent transactions, despit the much longer tail on non fraudulent transactions.
Okay, next let's see how cyclical fraudulent and non fraudulent transactions are, respecitively.
In [8]:
print('Fraud')
print(transactions.Time[transactions.Class==1].describe())
print()
print('Not Fraud')
print(transactions.Time[transactions.Class==0].describe())
print()
In [14]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))
bins = 50
ax1.hist(transactions.Time[transactions.Class == 1], bins=bins)
ax1.set_title('Fraud')
ax2.hist(transactions.Time[transactions.Class == 0], bins=bins)
ax2.set_title('Not Fraud')
plt.xlabel('Time (seconds from first transaction)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()
Fraud is less cyclical than not fraud. Legitimate transaction volume decreases dramatically twice, presumably at night. This may come in handy later.
Next let's consider both time and amount.
In [9]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(transactions.Time[transactions.Class == 0], transactions.Amount[transactions.Class == 0],\
c = 'b',label='Legit')
ax1.scatter(transactions.Time[transactions.Class == 1], transactions.Amount[transactions.Class == 1],\
c = 'g',label='Fraud')
plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount($)')
plt.legend(loc='upper left');
plt.show()
Yeah that wasn't very useful. Let's look now at the anonymized data.
In [10]:
#Select only the anonymized features.
v_features = transactions.ix[:,1:29].columns
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(transactions[v_features]):
ax = plt.subplot(gs[i])
sns.distplot(transactions[cn][transactions.Class == 1], bins=100,label='fraud')
sns.distplot(transactions[cn][transactions.Class == 0], bins=100,label='legit')
ax.set_xlabel('')
ax.set_title('histogram of feature: ' + str(cn))
ax.legend(loc='upper left');
plt.show()
Okay, I now have a good sense of which variables might be important in detecting fraud. Note that if the data were not anonymized and transformed, I would also use intuition in this step to choose which variables I expect to "pop".
Now that we analyzed the data, let's move on to building an actual model. Given that the independent variables are all non-null continuous variables and the outcome is binary, this is the perfect opportunity to use XGBoostClassifier.
In [22]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(transactions[v_features], transactions['Class'],
test_size=test_size, random_state=seed)
In [23]:
#train model on train data
model = xgb.XGBClassifier()
model.fit(X_train,y_train)
print(model)
In [21]:
from sklearn.metrics import accuracy_score
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
In [38]:
xgb.plot_importance(model)
plt.show()
for row in range(len(model.feature_importances_)):
print(v_features[row],model.feature_importances_[row],'')
In [50]:
def calc_lift(x,y,clf,bins=10):
"""
Takes input arrays and trained SkLearn Classifier and returns a Pandas
DataFrame with the average lift generated by the model in each bin
Parameters
-------------------
x: Numpy array or Pandas Dataframe with shape = [n_samples, n_features]
y: A 1-d Numpy array or Pandas Series with shape = [n_samples]
IMPORTANT: Code is only configured for binary target variable
of 1 for success and 0 for failure
clf: A trained SkLearn classifier object
bins: Number of equal sized buckets to divide observations across
Default value is 10
"""
#Actual Value of y
y_actual = y
#Predicted Probability that y = 1
y_prob = clf.predict_proba(x)
#Predicted Value of Y
y_pred = clf.predict(x)
cols = ['ACTUAL','PROB_POSITIVE','PREDICTED']
data = [y_actual,y_prob[:,1],y_pred]
df = pd.DataFrame(dict(zip(cols,data)))
#Observations where y=1
total_positive_n = df['ACTUAL'].sum()
#Total Observations
total_n = df.index.size
natural_positive_prob = total_positive_n/float(total_n)
#Create Bins where First Bin has Observations with the
#Highest Predicted Probability that y = 1
df['BIN_POSITIVE'] = pd.qcut(df['PROB_POSITIVE'],bins,labels=False)
pos_group_df = df.groupby('BIN_POSITIVE')
#Percentage of Observations in each Bin where y = 1
lift_positive = pos_group_df['ACTUAL'].sum()/pos_group_df['ACTUAL'].count()
lift_index_positive = (lift_positive/natural_positive_prob)*100
#Consolidate Results into Output Dataframe
lift_df = pd.DataFrame({'LIFT_POSITIVE':lift_positive,
'LIFT_POSITIVE_INDEX':lift_index_positive,
'BASELINE_POSITIVE':natural_positive_prob})
return lift_df
In [52]:
lift = calc_lift(X_test,y_test,model,bins=10)
In [54]:
lift
Out[54]: