Essentially, all models are wrong, but some are useful.
-George E. P. Box
Scikit-Learn is a Python library that provides a consistent and relatively painless API for a number of common machine learning algorithms. For our purposes, you can think of machine learning as a mechanism to create mathematical models that learn from data.
The Scikit-Learn package provides interfaces for supervised and unsupervised techniques like:
And supporting functions for things like:
I am not an expert on machine learning (or on anything else for that matter), so you should view all my claims with suspicion.
We will not be going into the principles behind these algorithms. If you want to do learn about machine learning from a mathematical perspective, take a look at Fundementals of Machine Learning for Predictive Data Analytics.
If you plan to use machine learning (or models in any capacity) for bank work, you need Model Risk Managment's blessing so you don't run afoul of OCC guidance on the matter.
This looks complicated, but it's not hard once you get used to it.
There are two fundemental steps: the creation of a predictive model and the usage of this model to make a prediction.
Ignore everything on the green pathway for now. What we are doing conceptually is taking a set of features (charactaristics of each sample) and matching them up with labels (target item that we will use our model to determine).
So say we want to make classifier that tells Great White Sharks ('SHARK') from Cod fish ('COD'). We go out, we catch some cod and some sharks, we weigh and measure their length.
| Weight (kilos) | Length (meters) | Target |
|---|---|---|
| 700 | 5 | SHARK |
| 100 | 2 | COD |
| 650 | 4 | SHARK |
| 90 | 2 | COD |
| 600 | 4 | SHARK |
In other words, we take data and translate it into 5 separate [weight, length] feature vectors [[700, 5], [100, 2], [650, 4], [90, 2], [600, 4]]. This is the same length as our target vector with labels: [SHARK, COD, SHARK, COD, SHARK].
For the purposes of this section we will use X to refer to a matrix, and y to refer to the label/target vector.
We plug in our feature vectors (X) and our labels into our machine learning algorithm (along with any hyperparameters, which are non-data elements that speak to how the machine learning algorithm itself will operate). Our output is a predictive model.
The data we use to train the algorithm is called the training data set. It is separate from our testing data set.
With this model, we can then take a feature vector and predict what class/target/label it will belong to. So for example, if we had
| Weight (kilos) | Length (meters) |
|---|---|
| 600 | 3 |
| 80 | 1 |
| 650 | 5 |
Intuitively, we can see we should get out [SHARK, COD, SHARK].
The data we use to make predictions is the test data set. We separate training and testing data to prevent overfitting.
First, we take our inputs, and if necessary we fit_transform() them into our feature vectors that the machine learning algorithm can digest. Certain types of data, such as text, must be transformed into numbers. Other types of algorithms work better when normalized. Generally, models will take numpy array objects filled with numbers.
Second, we fit() our data (X) and labels (y) with our algorithm, which gives us our model.
First we take our inputs for the items we want to predict. We fit_transform() them into vectors if necessary, and then we use the predict() to determine the predicted classification.
In [ ]:
## Here we will use a simple logistic regression.
# https://en.wikipedia.org/wiki/Logistic_regression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import sklearn
%matplotlib inline
matplotlib.style.use('fivethirtyeight')
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
###
# First we create the model
###
# First we instantiate/create the model we want to use
# You supply usually supply model parameters at creation
log_reg = LogisticRegression(random_state=True)
# Then we add a scalar (optional)
scaler = StandardScaler()
# Input data manually (never do this) as 5 samples of 2 features
train_data = np.array([[700.0, 5.0],
[100.0, 2.0],
[650.0, 5.0],
[ 80.0, 2.0],
[600.0, 3.0]])
# Normalize data for analysis.
train_vectors = scaler.fit_transform(train_data)
print('Normalized data:\n')
print(train_vectors)
# Set up our targets manually (never do this)
targets = np.array(['SHARK', 'COD', 'SHARK', 'COD', 'SHARK'])
# Then we fit our model to the data
log_reg.fit(train_vectors, targets)
In [ ]:
###
# Second we make our predictions
###
# Put above data into vectors manually
test_data = np.array([[600.0, 3.6],
[ 80.0, 1.5],
[650.0, 4.5]])
# Fit transform so our inputs are on the same scale
test_vectors = scaler.fit_transform(test_data)
# Run a basic prediction of our three entries
prediction = log_reg.predict(test_vectors)
# Get proababilities of our three entries
probabilities = log_reg.predict_proba(test_vectors)
# Get classes
classes = log_reg.classes_
# Get coefficients
coefficients = log_reg.coef_
print(prediction, '\n')
print(log_reg.coef_, '\n')
pd.DataFrame(data=probabilities, columns=log_reg.classes_)
In [ ]:
# Plot decision boundary to help visualize.
x = np.arange(-3, 4, .1)
y = np.arange(-3, 4, .1)
# Create meshgrid
xx, yy = np.meshgrid(x, y)
# Calculate for each value on the grid
Z = log_reg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
# Turn cod and shark into numeric features
np.place(Z, Z == 'COD', '001')
np.place(Z, Z == 'SHARK', '00000')
Z = Z.astype(int)
# Graph mesh and scatter.
plt.figure(1, figsize=(6, 6))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Pastel2)
plt.scatter(train_vectors.T[0], train_vectors.T[1])
# Add text for clarity
plt.annotate('Sharks, yo.', (0, 2), (1,2), arrowprops=dict(facecolor='black', shrink=0.05))
plt.annotate('DAT COD, THO.', (-2, -2))
plt.gca().set_xlabel('Weight Scaled Score')
plt.gca().set_ylabel('Length Scaled Score')
plt.gca().set_xlim(-3, 3)
plt.gca().set_ylim(-3, 3)
You will interact with most sklearn estimators in a stanrdard way. You will fit() or fit_transform(), and then you'll predict(). Because these estimators perform in a similar way, we can put these estimators into "pipelines" that can be treated as functional units.
To show you the benefits of pipelines, we will be creating a text classifier from publically availible CFPB data.
In [ ]:
import pandas as pd
# Create df
df = pd.read_csv('data/cfpb_complaints_with_fictitious_data.csv')
# Lets use product as the identifier
df = df[['Product', 'Consumer complaint narrative']]
df.head(5)
In [ ]:
###
# Create Pipeline
###
# First we import the Pipeline class
from sklearn.pipeline import Pipeline
# Import vectorizer
# This translate words into vectors (one dimension per word)
from sklearn.feature_extraction.text import CountVectorizer
# Import TFIDF scaler: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
# This helps adjust for word frequency and importance
from sklearn.feature_extraction.text import TfidfTransformer
# Import Stochastic Gradient Decent classifier
from sklearn.linear_model import SGDClassifier
# Now make our pipeline
# Supply our pipeline name and process
pipeline = Pipeline([('vec', CountVectorizer()),
('trans', TfidfTransformer()),
('clf', SGDClassifier(loss='log'))])
pipeline
In [ ]:
###
# Fit our text to our pipeline
###
# Complaint vector (notice pandas to numpy translation)
text_vec = df['Consumer complaint narrative']
# Type of product vector
target_vec = df['Product']
# These pipeline meta esimators only require fit() and predict()
pipeline.fit(text_vec, target_vec)
In [ ]:
dummy_complaints = ['Bank of America are such jerks. They screwed up my mortgage payments and then put me out of my home. I tried to get a loan modification program under HAMP but they screwed that up to.',
'JPMorgan Chase is the devil. My credit card was fraudulently used and my line of credit was screwed up and my credit score.',
'Wells Fargo\'s logo is a stupid color. Red and yellow is an ugly combo for a bank.']
pipeline.predict(dummy_complaints)
In [ ]:
# Import convenience method for spliting dataset
from sklearn.cross_validation import train_test_split
# Will be sklearn.model_selection in future versions
# Import basic accuracy score
from sklearn.metrics import accuracy_score
# Randomly divide into train and test
train, test = train_test_split(df, test_size=.2)
train_X = train['Consumer complaint narrative']
train_y = train['Product']
test_X = test['Consumer complaint narrative']
test_y = test['Product']
print('{} training samples.'.format(len(train_X)))
print('{} testing samples.'.format(len(test_X)))
# We need to re-fit our pipeline to our training data
pipeline.fit(train_X, train_y)
# Conduct a simple accuracy score. First predict
predicted_test_y = pipeline.predict(test_X)
true_test_y = test_y
# Types of scores here: http://scikit-learn.org/stable/modules/model_evaluation.html
score = accuracy_score(true_test_y, predicted_test_y)
print('Our accuracy score was {:.1%}.'.format(score))
# This is essentially calculating the percent of correct predictions.
pd.DataFrame({'Predicted': predicted_test_y,
'True': true_test_y}).head(5)
In [ ]:
# We can also use the built in cross validation tools
# https://en.wikipedia.org/wiki/Cross-validation_(statistics)
# Import cross validation score
from sklearn.cross_validation import cross_val_score
# Will be sklearn.model_selection in future versions
# Because this will create and test our model 5 times, a data subset
subset = df.sample(frac=.5)
# This takes our model and data, outputs series of scores
cvs = cross_val_score(# model
pipeline,
# Features
subset['Consumer complaint narrative'],
# Target
subset['Product'],
# Number of folds
cv=5,
# Scoring methodology
scoring='accuracy')
print(cvs)
print(cvs.mean())
In [ ]:
# We can also do fancy stuff like receiver operating charactaristics
# https://en.wikipedia.org/wiki/Receiver_operating_characteristic
# Here we do a ROC curve for Debt Collection
# We can do multiple curves if we want more than a single cateogry.
# Import ROC curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
# Split data
train_X, test_X, train_y, test_y = train_test_split(df['Consumer complaint narrative'],
df['Product'],
test_size=.2)
# Fit model
pipeline.fit(train_X, train_y)
# Create probailities
probs = pipeline.predict_proba(test_X)
# Class 6 is mortgage
mortgage_probs = probs[:,4]
# Convert Debt collection to true, else false
masked = test_y == 'Debt collection'
# Calculate ROC
false_pos, true_pos, threshold = roc_curve(masked, mortgage_probs)
# Calcululate ROC AUC
auc_score = roc_auc_score(masked, mortgage_probs)
# Plot ROC
plt.plot(false_pos, true_pos, label='ROC AUC = {:.1%}'.format(auc_score))
plt.plot([0, 1], [0, 1], color='black', label='Line of No Discrimination', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.gca().set_xlim(-.01, 1)
plt.gca().set_ylim(0, 1.02)
plt.legend(loc='lower right', fontsize='medium')
In [ ]:
# This takes a long time. Skipping.
# from sklearn.grid_search import GridSearchCV
# sklearn.model_selection in future versions.
# Remember what our pipeline looks like:
# pipeline = Pipeline([('vec', CountVectorizer()),
# ('trans', TfidfTransformer()),
# ('clf', SGDClassifier(loss='log'))])
# Put all the parameters we want to search through in a dict
# parameters = {'vec__ngram_range': [(1, 1), (1, 2)],
# 'trans__use_idf': (True, False),
# 'clf__loss': ['hinge']}
# Setup a meta-classifier
# optimized_clf = GridSearchCV(pipeline, parameters, n_jobs=1)
# Fit metaclassifier
# optimized_clf.fit(df['Consumer complaint narrative'],
# df['Product'])
# Use best model as a predictor
# print(optimized_clf.predict(['I hate Citibank because they screwed up my mortgage']))
# Get best parameters
# print(optimized_clf.best_params_)