Detecting Credit Card Fraud

In this notebook we will use GraphLab Create to identify a large majority of fraud cases in real-world data from an online retailer. Starting by a simple fraud classifier we will optimize it for the best available performance.
The dataset is higly sensitive, thus it is anonymized and cannot be shared.

The notebook is orginaized into the following sections:

This notebook is presented in the Detecting Credit Card Fraud webinar, one of many interesting webinars given by Turi. Check out upcoming webinars here.

Load and explore the data


In [1]:
import graphlab as gl

In [2]:
data = gl.SFrame('fraud_detection.sf')


2016-03-24 12:26:26,697 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: C:\Users\Alon\AppData\Local\Temp\graphlab_server_1458815185.log.0
This commercial license of GraphLab Create is assigned to engr@turi.com.

In [3]:
data.head(3)


Out[3]:
transaction status fraud payment lead days days to event currency is customer email free blacklisted review by payment gateway
S1 yes 59.0 165.0 A7 yes no no
S1 yes 8.0 91.0 A7 no no no
S1 yes 2.0 16.0 A7 no no no
quote amount payment amount transaction id secure payment fully paid customer id
2963.0 1482.0 1b39953393a43642319365a3e
fbf68f6 ...
no yes 90f5c4ed986bb4e7d41fe2141
a75d451 ...
1440.0 1441.0 efa12fbba6a3f73b41c8fc1f7
aaa32f7 ...
no yes e11163ad83a7f893b94c28eed
3f3dc6c ...
983.0 492.0 b514124a97b987798fce8ac43
3a5c727 ...
no yes a9062874e5c02a4c1673cd942
e97a3af ...
customer cardholder business type credit card company
d4f4f004f6dfa77989cabbf75
64bbca310bc5dc2 ...
33b8e7c599fb5d8d0c9a17c49
edf2a2e88150a84 ...
E74 None
dc1894b3c9801372d775ae86a
936cc05129632ed ...
4977e7ad7ec55a58c99b95649
38de41677063fd1 ...
E98 T2
98d5fbbe4ee07096e325307e3
e30bb58801832ff ...
67bed2dbdfd986dcb5a172325
c70b6eabcfb3f9c ...
E98 T2
customer email domain business country customer country business email domain
7db8006395b7071940049e927
a53424292857302 ...
C75 C75 5411cdc646249d74e1fbb6151
015029f11ae176f ...
e289c31d6c18bfb805b680b16
6b5dcd3258290a5 ...
C75 C75 fcc17e12e4ea7f72fb32d9ebb
9e147ea17319baf ...
68ff4e7162144205e744ae00e
efe087cc7623407 ...
None C75 bed849f4a1278205ca2523383
149a55e1142cbe1 ...
credit card number credit card expiration
month ...
credit card expiration
year ...
ip transaction date
0c848d301fcbdbfdf7a7f8767
d8d803713cb0073965e67 ...
04 2015 209.149.63.230 17.10.2013
f56334df48679d5bd83357bd3
7bbd6672d37a38a53aee6 ...
08 2016 184.80.149.174 15.08.2013
4f14f4d46dff976de1d9170cb
a19886dd8de479cf76adf ...
08 2014 47.251.202.177 11.05.2013
[3 rows x 27 columns]


In [4]:
len(data)


Out[4]:
135967

In [5]:
data.show()


Canvas is accessible via web browser at the URL: http://localhost:55435/index.html
Opening Canvas in default web browser.

We see that the data is highly categorical, and highly unbalanced.
Let's visualize some part of the data.


In [6]:
# Tell GraphLab to display canvas in the notebook itself
gl.canvas.set_target('ipynb')

In [7]:
data.show(view='BoxWhisker Plot', x='fraud', y='payment amount')


Create new features

Date features


In [8]:
# Transform string date into datetime type.
# This will help us further along to compare dates.
data['transaction date'] = data['transaction date'].str_to_datetime(str_format='%d.%m.%Y')

# Split date into its components and set them as categorical features 
data.add_columns(data['transaction date'].split_datetime(limit=['year','month','day'], column_name_prefix='transaction'))
data['transaction.year'] = data['transaction.year'].astype(str)
data['transaction.month'] = data['transaction.month'].astype(str)
data['transaction.day'] = data['transaction.day'].astype(str)

In [9]:
# Create day of week feature and set it as a categorical feature
data['transaction week day'] = data['transaction date'].apply(lambda x: x.weekday())
data['transaction week day'] = data['transaction week day'].astype(str)

In [10]:
data.head(3)


Out[10]:
transaction status fraud payment lead days days to event currency is customer email free blacklisted review by payment gateway
S1 yes 59.0 165.0 A7 yes no no
S1 yes 8.0 91.0 A7 no no no
S1 yes 2.0 16.0 A7 no no no
quote amount payment amount transaction id secure payment fully paid customer id
2963.0 1482.0 1b39953393a43642319365a3e
fbf68f6 ...
no yes 90f5c4ed986bb4e7d41fe2141
a75d451 ...
1440.0 1441.0 efa12fbba6a3f73b41c8fc1f7
aaa32f7 ...
no yes e11163ad83a7f893b94c28eed
3f3dc6c ...
983.0 492.0 b514124a97b987798fce8ac43
3a5c727 ...
no yes a9062874e5c02a4c1673cd942
e97a3af ...
customer cardholder business type credit card company
d4f4f004f6dfa77989cabbf75
64bbca310bc5dc2 ...
33b8e7c599fb5d8d0c9a17c49
edf2a2e88150a84 ...
E74 None
dc1894b3c9801372d775ae86a
936cc05129632ed ...
4977e7ad7ec55a58c99b95649
38de41677063fd1 ...
E98 T2
98d5fbbe4ee07096e325307e3
e30bb58801832ff ...
67bed2dbdfd986dcb5a172325
c70b6eabcfb3f9c ...
E98 T2
customer email domain business country customer country business email domain
7db8006395b7071940049e927
a53424292857302 ...
C75 C75 5411cdc646249d74e1fbb6151
015029f11ae176f ...
e289c31d6c18bfb805b680b16
6b5dcd3258290a5 ...
C75 C75 fcc17e12e4ea7f72fb32d9ebb
9e147ea17319baf ...
68ff4e7162144205e744ae00e
efe087cc7623407 ...
None C75 bed849f4a1278205ca2523383
149a55e1142cbe1 ...
credit card number credit card expiration
month ...
credit card expiration
year ...
ip transaction date
0c848d301fcbdbfdf7a7f8767
d8d803713cb0073965e67 ...
04 2015 209.149.63.230 2013-10-17 00:00:00
f56334df48679d5bd83357bd3
7bbd6672d37a38a53aee6 ...
08 2016 184.80.149.174 2013-08-15 00:00:00
4f14f4d46dff976de1d9170cb
a19886dd8de479cf76adf ...
08 2014 47.251.202.177 2013-05-11 00:00:00
transaction.year transaction.month transaction.day transaction week day
2013 10 17 3
2013 8 15 3
2013 5 11 5
[3 rows x 31 columns]

Indicator features


In [11]:
# Create new features and transform them into true/false indicators
data['same country'] = (data['customer country'] == data['business country']).astype(str)
data['same person'] = (data['customer'] == data['cardholder']).astype(str)
data['expiration near'] = (data['credit card expiration year'] == data['transaction.year']).astype(str)

Count features


In [12]:
counts = data.groupby('transaction id', {'unique cards per transaction' : gl.aggregate.COUNT_DISTINCT('credit card number'),
                                         'unique cardholders per transaction' : gl.aggregate.COUNT_DISTINCT('cardholder'),
                                         'tries per transaction' : gl.aggregate.COUNT()})
counts.head(3)


Out[12]:
transaction id unique cards per
transaction ...
unique cardholders per
transaction ...
tries per transaction
ebce93534b35b56ea3cfae1d5
3786008 ...
1 1 1
3dc6c7c573bad62c4b36f76bd
695da66 ...
1 1 1
c41ea2458fa6e961dadd498bf
8528419 ...
1 1 1
[3 rows x 4 columns]


In [13]:
counts.show()


We see that although most transactions have been paid for by a single credit card, some transactions have as much as 29 unique credit cards!
Let's join the counts back into our dataset so we can visualize the number of unique cards per transaction vs fraud.


In [14]:
data = data.join(counts)

In [15]:
data.show(view='BoxWhisker Plot', x='fraud', y='unique cards per transaction')



In [16]:
print 'Number of columns', len(data.column_names())


Number of columns 37

In total we created 9 new features. One can create any number of additional features which will be helpful to create a better fraud detector. For example, historical user features such as the number of transaction in a given timeframe.
For the purposes of the webinar these features will be enough.

Split data into train and test sets

First we will have to split the data into a training set and a testing set so we can evaluate our models. We will split it based on the date column, where the test set will be composed of the last six months of transactions.


In [17]:
from datetime import datetime

split = data['transaction date'] > datetime(2015, 6, 1)
data.remove_column('transaction date')

train = data[split == 0]
test = data[split == 1]

In [18]:
print 'Training set fraud'
train['fraud'].show()


Training set fraud

In [19]:
print 'Test set fraud'
test['fraud'].show()


Test set fraud

Create model to predict if a given transaction is fraudulent

Logistic Regression baseline


In [20]:
logreg_model = gl.logistic_classifier.create(train,
                                             target='fraud',
                                             validation_set=None)


WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 125557
Number of classes           : 2
Number of feature columns   : 35
Number of unpacked features : 35
Number of coefficients    : 475596
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy |
+-----------+----------+-----------+--------------+-------------------+
| 1         | 3        | 0.000008  | 1.434012     | 0.992322          |
| 2         | 5        | 1.000000  | 1.806275     | 0.995094          |
| 3         | 6        | 1.000000  | 2.040439     | 0.999570          |
| 4         | 7        | 1.000000  | 2.272605     | 0.999992          |
| 5         | 8        | 1.000000  | 2.503765     | 0.999992          |
| 6         | 9        | 1.000000  | 2.745938     | 0.999992          |
| 10        | 13       | 1.000000  | 3.658582     | 0.999992          |
+-----------+----------+-----------+--------------+-------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.

In [21]:
print 'Logistic Regression Accuracy', logreg_model.evaluate(test)['accuracy']
print 'Logistic Regression Confusion Matrix\n', logreg_model.evaluate(test)['confusion_matrix']


Logistic Regression Accuracy 0.996733909702
Logistic Regression Confusion Matrix
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      no      |        no       | 10376 |
|     yes      |        no       |   33  |
|      no      |       yes       |   1   |
+--------------+-----------------+-------+
[3 rows x 3 columns]

Not a single fraud case was detected by the logistic regression model!
As indicated while training the logistic regression model, some features are highly categorical, and when expanded result in many coefficients. We could address this by removing these features from the dataset, or by transforming these features into a more manageable form (e.g. Count Thresholder). For this webinar, we will leave these features as-is and will move on to a stronger classifier.

Boosted Trees Classifier


In [22]:
boosted_trees_model = gl.boosted_trees_classifier.create(train, 
                                                         target='fraud',
                                                         validation_set=None)


WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 125557
Number of classes           : 2
Number of feature columns   : 35
Number of unpacked features : 35
+-----------+--------------+-------------------+-------------------+
| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |
+-----------+--------------+-------------------+-------------------+
| 1         | 0.285198     | 0.996878          | 0.439840          |
| 2         | 0.562394     | 0.997125          | 0.299877          |
| 3         | 0.842592     | 0.997244          | 0.211750          |
| 4         | 1.118787     | 0.997396          | 0.152803          |
| 5         | 1.389978     | 0.997483          | 0.111957          |
| 6         | 1.668174     | 0.997563          | 0.083166          |
| 10        | 2.756944     | 0.997650          | 0.028914          |
+-----------+--------------+-------------------+-------------------+

In [23]:
print 'Boosted trees Accuracy', boosted_trees_model.evaluate(test)['accuracy']
print 'Boosted trees Confusion Matrix\n', boosted_trees_model.evaluate(test)['confusion_matrix']


Boosted trees Accuracy 0.998366954851
Boosted trees Confusion Matrix
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      no      |        no       | 10364 |
|     yes      |        no       |   4   |
|     yes      |       yes       |   29  |
|      no      |       yes       |   13  |
+--------------+-----------------+-------+
[4 rows x 3 columns]

29 out of 33 fraud cases were detected by the boosted trees model.

Let's tune the parameters of the model so we can squeeze extra performance out of it. In this example I chose parameters that were evaluated before hand, but GraphLab offers the functionality to do a distributed search across a grid of parameters. To learn more click here.


In [24]:
boosted_trees_model = gl.boosted_trees_classifier.create(train, 
                                                         target='fraud',
                                                         validation_set=None,
                                                         max_iterations=40,
                                                         max_depth=9,
                                                         class_weights='auto')


WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 125557
Number of classes           : 2
Number of feature columns   : 35
Number of unpacked features : 35
+-----------+--------------+-------------------+-------------------+
| Iteration | Elapsed Time | Training-accuracy | Training-log_loss |
+-----------+--------------+-------------------+-------------------+
| 1         | 0.387272     | 0.981757          | 0.460638          |
| 2         | 0.764539     | 0.976711          | 0.330015          |
| 3         | 1.163821     | 0.979027          | 0.245213          |
| 4         | 1.560101     | 0.979787          | 0.188401          |
| 5         | 1.962385     | 0.980019          | 0.148694          |
| 6         | 2.347657     | 0.982517          | 0.119373          |
| 10        | 3.932776     | 0.988897          | 0.058315          |
| 11        | 4.448140     | 0.991498          | 0.050472          |
| 15        | 6.315458     | 0.996765          | 0.033629          |
| 20        | 8.432953     | 0.996934          | 0.025129          |
| 25        | 10.362315    | 0.997038          | 0.020979          |
| 30        | 12.290677    | 0.997179          | 0.018338          |
| 35        | 14.220039    | 0.997299          | 0.016659          |
| 40        | 16.114376    | 0.997383          | 0.015333          |
+-----------+--------------+-------------------+-------------------+

In [25]:
print 'Boosted trees Accuracy', boosted_trees_model.evaluate(test)['accuracy']
print 'Boosted trees Confusion Matrix\n', boosted_trees_model.evaluate(test)['confusion_matrix']


Boosted trees Accuracy 0.997502401537
Boosted trees Confusion Matrix
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      no      |        no       | 10354 |
|     yes      |        no       |   3   |
|     yes      |       yes       |   30  |
|      no      |       yes       |   23  |
+--------------+-----------------+-------+
[4 rows x 3 columns]

The tuned model found one more fraud case than the previous un-tuned model, at the price of a few more false positives. The desired balance between false positives and false negatives depends on the application. In fraud detection we may want to minimize false negatives so we can save more money, while false positives will just waste more time for a fraud detection expert inspecting transactions flagged by our model.


In [26]:
# Inspect the features most used by the boosted trees model
boosted_trees_model.get_feature_importance()


Out[26]:
name index count
payment amount None 216
days to event None 190
payment lead days None 137
quote amount None 135
blacklisted no 43
ip 184.80.149.174 28
customer country C75 27
business type E43 27
credit card company 26
credit card expiration
year ...
2016 25
[475623 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Deploying the model into a resilient & elastic service

To connect to AWS, you will have to set your own AWS credentials by calling:

gl.aws.set_credentials(<your public key>,
                       <your private key>)

In [27]:
state_path = 's3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5'

ps = gl.deploy.predictive_service.load(state_path)


2016-03-24 12:28:24,033 [WARNING] graphlab.deploy.predictive_service, 384: Overwriting existing Predictive Service "demolab-ps-one-eight-five" in local session.

In [28]:
# Pickle and send the model over to the server.
ps.add('fraud', boosted_trees_model)
ps.apply_changes()


2016-03-24 12:28:27,285 [INFO] graphlab.deploy._predictive_service._predictive_service, 1450: Endpoint 'fraud' is added. Use apply_changes() to deploy all pending changes, or continue with other modification.
2016-03-24 12:28:27,286 [INFO] graphlab.deploy._predictive_service._predictive_service, 1725: Persisting endpoint changes.
2016-03-24 12:28:27,578 [INFO] graphlab.util.file_util, 189: Uploading local path c:\users\alon\appdata\local\temp\predictive_object_japony to s3 path: s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\f3d116da-9e09-4208-9830-7b08fa911200\dir_archive.ini to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/f3d116da-9e09-4208-9830-7b08fa911200/dir_archive.ini
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\f3d116da-9e09-4208-9830-7b08fa911200\m_6c06b43b3c85d13c.sidx to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/f3d116da-9e09-4208-9830-7b08fa911200/m_6c06b43b3c85d13c.sidx
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\f3d116da-9e09-4208-9830-7b08fa911200\m_6c06b43b3c85d13c.frame_idx to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/f3d116da-9e09-4208-9830-7b08fa911200/m_6c06b43b3c85d13c.frame_idx
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\pickle_archive to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/pickle_archive
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\version to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/version
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\f3d116da-9e09-4208-9830-7b08fa911200\m_6c06b43b3c85d13c.0000 to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/f3d116da-9e09-4208-9830-7b08fa911200/m_6c06b43b3c85d13c.0000
Completed 10 of 10 part(s) with 1 file(s) remaining
2016-03-24 12:34:12,993 [INFO] graphlab.util.file_util, 244: Successfully uploaded to s3 path s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1
upload: c:\users\alon\appdata\local\temp\predictive_object_japony\f3d116da-9e09-4208-9830-7b08fa911200\objects.bin to s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5/predictive_objects/fraud/1/f3d116da-9e09-4208-9830-7b08fa911200/objects.bin

In [29]:
# Predictive services must be displayed in a browser
gl.canvas.set_target('browser')

ps.show()


Canvas is accessible via web browser at the URL: http://localhost:55975/index.html
Opening Canvas in default web browser.
2016-03-24 12:35:38,621 [INFO] graphlab.deploy._predictive_service._predictive_service, 2530: retrieving metrics from predictive service...

RESTfully query the service


In [30]:
ps.query('fraud', method='predict', data={'dataset' : test[0]})


Out[30]:
{u'from_cache': False,
 u'model': u'fraud',
 u'response': [u'yes'],
 u'uuid': u'5bc151bf-e47a-4f84-b29f-11bb7d3ba4db',
 u'version': 1}

In [31]:
test[0]['fraud']


Out[31]:
'yes'