In this notebook we will use GraphLab Create to identify a large majority of fraud cases in real-world data from an online retailer. Starting by a simple fraud classifier we will optimize it for the best available performance.
The dataset is higly sensitive, thus it is anonymized and cannot be shared.
The notebook is orginaized into the following sections:
This notebook is presented in the Detecting Credit Card Fraud webinar, one of many interesting webinars given by Turi. Check out upcoming webinars here.
In [1]:
import graphlab as gl
In [2]:
data = gl.SFrame('fraud_detection.sf')
In [3]:
data.head(3)
Out[3]:
In [4]:
len(data)
Out[4]:
In [5]:
data.show()
We see that the data is highly categorical, and highly unbalanced.
Let's visualize some part of the data.
In [6]:
# Tell GraphLab to display canvas in the notebook itself
gl.canvas.set_target('ipynb')
In [7]:
data.show(view='BoxWhisker Plot', x='fraud', y='payment amount')
In [8]:
# Transform string date into datetime type.
# This will help us further along to compare dates.
data['transaction date'] = data['transaction date'].str_to_datetime(str_format='%d.%m.%Y')
# Split date into its components and set them as categorical features
data.add_columns(data['transaction date'].split_datetime(limit=['year','month','day'], column_name_prefix='transaction'))
data['transaction.year'] = data['transaction.year'].astype(str)
data['transaction.month'] = data['transaction.month'].astype(str)
data['transaction.day'] = data['transaction.day'].astype(str)
In [9]:
# Create day of week feature and set it as a categorical feature
data['transaction week day'] = data['transaction date'].apply(lambda x: x.weekday())
data['transaction week day'] = data['transaction week day'].astype(str)
In [10]:
data.head(3)
Out[10]:
In [11]:
# Create new features and transform them into true/false indicators
data['same country'] = (data['customer country'] == data['business country']).astype(str)
data['same person'] = (data['customer'] == data['cardholder']).astype(str)
data['expiration near'] = (data['credit card expiration year'] == data['transaction.year']).astype(str)
In [12]:
counts = data.groupby('transaction id', {'unique cards per transaction' : gl.aggregate.COUNT_DISTINCT('credit card number'),
'unique cardholders per transaction' : gl.aggregate.COUNT_DISTINCT('cardholder'),
'tries per transaction' : gl.aggregate.COUNT()})
counts.head(3)
Out[12]:
In [13]:
counts.show()
We see that although most transactions have been paid for by a single credit card, some transactions have as much as 29 unique credit cards!
Let's join the counts back into our dataset so we can visualize the number of unique cards per transaction vs fraud.
In [14]:
data = data.join(counts)
In [15]:
data.show(view='BoxWhisker Plot', x='fraud', y='unique cards per transaction')
In [16]:
print 'Number of columns', len(data.column_names())
In total we created 9 new features. One can create any number of additional features which will be helpful to create a better fraud detector. For example, historical user features such as the number of transaction in a given timeframe.
For the purposes of the webinar these features will be enough.
First we will have to split the data into a training set and a testing set so we can evaluate our models. We will split it based on the date column, where the test set will be composed of the last six months of transactions.
In [17]:
from datetime import datetime
split = data['transaction date'] > datetime(2015, 6, 1)
data.remove_column('transaction date')
train = data[split == 0]
test = data[split == 1]
In [18]:
print 'Training set fraud'
train['fraud'].show()
In [19]:
print 'Test set fraud'
test['fraud'].show()
In [20]:
logreg_model = gl.logistic_classifier.create(train,
target='fraud',
validation_set=None)
In [21]:
print 'Logistic Regression Accuracy', logreg_model.evaluate(test)['accuracy']
print 'Logistic Regression Confusion Matrix\n', logreg_model.evaluate(test)['confusion_matrix']
Not a single fraud case was detected by the logistic regression model!
As indicated while training the logistic regression model, some features are highly categorical, and when expanded result in many coefficients. We could address this by removing these features from the dataset, or by transforming these features into a more manageable form (e.g. Count Thresholder). For this webinar, we will leave these features as-is and will move on to a stronger classifier.
In [22]:
boosted_trees_model = gl.boosted_trees_classifier.create(train,
target='fraud',
validation_set=None)
In [23]:
print 'Boosted trees Accuracy', boosted_trees_model.evaluate(test)['accuracy']
print 'Boosted trees Confusion Matrix\n', boosted_trees_model.evaluate(test)['confusion_matrix']
29 out of 33 fraud cases were detected by the boosted trees model.
Let's tune the parameters of the model so we can squeeze extra performance out of it. In this example I chose parameters that were evaluated before hand, but GraphLab offers the functionality to do a distributed search across a grid of parameters. To learn more click here.
In [24]:
boosted_trees_model = gl.boosted_trees_classifier.create(train,
target='fraud',
validation_set=None,
max_iterations=40,
max_depth=9,
class_weights='auto')
In [25]:
print 'Boosted trees Accuracy', boosted_trees_model.evaluate(test)['accuracy']
print 'Boosted trees Confusion Matrix\n', boosted_trees_model.evaluate(test)['confusion_matrix']
The tuned model found one more fraud case than the previous un-tuned model, at the price of a few more false positives. The desired balance between false positives and false negatives depends on the application. In fraud detection we may want to minimize false negatives so we can save more money, while false positives will just waste more time for a fraud detection expert inspecting transactions flagged by our model.
In [26]:
# Inspect the features most used by the boosted trees model
boosted_trees_model.get_feature_importance()
Out[26]:
To connect to AWS, you will have to set your own AWS credentials by calling:
gl.aws.set_credentials(<your public key>,
<your private key>)
In [27]:
state_path = 's3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5'
ps = gl.deploy.predictive_service.load(state_path)
In [28]:
# Pickle and send the model over to the server.
ps.add('fraud', boosted_trees_model)
ps.apply_changes()
In [29]:
# Predictive services must be displayed in a browser
gl.canvas.set_target('browser')
ps.show()
In [30]:
ps.query('fraud', method='predict', data={'dataset' : test[0]})
Out[30]:
In [31]:
test[0]['fraud']
Out[31]: