Scikit-learn (http://scikit-learn.org/) is an open-source machine learning library for Python that offers a variety of regression, classification and clustering algorithms.
In this section we'll perform a fairly simple classification exercise with scikit-learn. In the next section we'll leverage the machine learning strength of scikit-learn to perform natural language classifications.
conda install scikit-learn
or
pip install -U scikit-learn
Scikit-learn additionally requires that NumPy and SciPy be installed. For more info visit http://scikit-learn.org/stable/install.html
For this exercise we'll be using the SMSSpamCollection dataset from UCI datasets that contains more than 5 thousand SMS phone messages.
You can check out the sms_readme file for more info.
The file is a tab-separated-values (tsv) file with four columns:
label - every message is labeled as either ham or spam
message - the message itself
length - the number of characters in each message
punct - the number of punctuation characters in each message
In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/smsspamcollection.tsv', sep='\t')
df.head()
Out[1]:
In [2]:
len(df)
Out[2]:
In [3]:
df.isnull().sum()
Out[3]:
In [4]:
df['label'].unique()
Out[4]:
In [5]:
df['label'].value_counts()
Out[5]:
We see that 4825 out of 5572 messages, or 86.6%, are ham.
This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance.
In [6]:
df['length'].describe()
Out[6]:
This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis.
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.xscale('log')
bins = 1.15**(np.arange(0,50))
plt.hist(df[df['label']=='ham']['length'],bins=bins,alpha=0.8)
plt.hist(df[df['label']=='spam']['length'],bins=bins,alpha=0.8)
plt.legend(('ham','spam'))
plt.show()
It looks like there's a small range of values where a message is more likely to be spam than ham.
Now let's look at the punct
column:
In [ ]:
In [8]:
df['punct'].describe()
Out[8]:
In [9]:
plt.xscale('log')
bins = 1.5**(np.arange(0,15))
plt.hist(df[df['label']=='ham']['punct'],bins=bins,alpha=0.8)
plt.hist(df[df['label']=='spam']['punct'],bins=bins,alpha=0.8)
plt.legend(('ham','spam'))
plt.show()
This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results.
If we wanted to divide the DataFrame into two smaller sets, we could use
train, test = train_test_split(df)
For our purposes let's also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the label
column in our data. For Features we'll use the length
and punct
columns. By convention, X is capitalized and y is lowercase.
There are two ways to build a feature set from the columns we want. If the number of features is small, then we can pass those in directly:
X = df[['length','punct']]
If the number of features is large, then it may be easier to drop the Label and any other unwanted columns:
X = df.drop(['label','message'], axis=1)
These operations make copies of df, but do not change the original DataFrame in place. All the original data is preserved.
In [10]:
# Create Feature and Label sets
X = df[['length','punct']] # note the double set of brackets
y = df['label']
In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print('Training Data Shape:', X_train.shape)
print('Testing Data Shape: ', X_test.shape)
Now we can pass these sets into a series of different training & testing algorithms and compare their results.
One of the simplest multi-class classification tools is logistic regression. Scikit-learn offers a variety of algorithmic solvers; we'll use L-BFGS.
In [12]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver='lbfgs')
lr_model.fit(X_train, y_train)
Out[12]:
In [13]:
from sklearn import metrics
# Create a prediction set:
predictions = lr_model.predict(X_test)
# Print a confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
In [14]:
# You can make the confusion matrix less confusing by adding labels:
df = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index=['ham','spam'], columns=['ham','spam'])
df
Out[14]:
These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (46) were confused as spam.
In [15]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))
In [16]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))
This model performed *worse* than a classifier that assigned all messages as "ham" would have!
One of the most common - and successful - classifiers is naïve Bayes.
In [17]:
from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
Out[17]:
In [18]:
predictions = nb_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))
The total number of confusions dropped from **287** to **256**. [241+46=287, 246+10=256]
In [19]:
print(metrics.classification_report(y_test,predictions))
In [20]:
print(metrics.accuracy_score(y_test,predictions))
Better, but still less accurate than 86.6%
Among the SVM options available, we'll use C-Support Vector Classification (SVC)
In [21]:
from sklearn.svm import SVC
svc_model = SVC(gamma='auto')
svc_model.fit(X_train,y_train)
Out[21]:
In [22]:
predictions = svc_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))
The total number of confusions dropped even further to **209**.
In [23]:
print(metrics.classification_report(y_test,predictions))
In [24]:
print(metrics.accuracy_score(y_test,predictions))
And finally we have a model that performs *slightly* better than random chance.