From the scikit-learn documentation:
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
We will use CountVectorizer to "convert text into a matrix of token counts":
In [5]:
from sklearn.feature_extraction.text import CountVectorizer
In [6]:
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'help']
In [10]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
# vect.get_feature_names()
vect.vocabulary_
Out[10]:
In [11]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
Out[11]:
In [6]:
# print the sparse matrix
print(simple_train_dtm)
In [14]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
Out[14]:
In [16]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
Out[16]:
In [9]:
# create a document-term matrix on your own
simple_train = ["call call Sorry, Ill later",
"K Did you me call ah just now",
"I call you later, don't have network. If urgnt, sms me"]
In [10]:
#complete your work below
# instantiate vectorizer
# fit
# transform
# convert to dense matrix
vec2 = CountVectorizer(binary=True)
vec2.fit(simple_train)
my_dtm2 = vec2.transform(simple_train)
pd.DataFrame(my_dtm2.toarray(), columns=vec2.get_feature_names())
Out[10]:
From the scikit-learn documentation:
In this scheme, features and samples are defined as follows:
- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
In [10]:
vect.get_feature_names()
Out[10]:
In [11]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me devon"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
Out[11]:
In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
Out[12]:
Summary:
vect.fit(train)
learns the vocabulary of the training datavect.transform(train)
uses the fitted vocabulary to build a document-term matrix from the training datavect.transform(test)
uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)
In [13]:
# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)
In [14]:
sms.head(5)
Out[14]:
In [15]:
sms.label.value_counts()
Out[15]:
In [16]:
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})
In [17]:
# define X and y
X = sms.message
y = sms.label
In [21]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
In [27]:
# instantiate the vectorizer
vect = CountVectorizer()
In [28]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm
Out[28]:
In [29]:
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm
Out[29]:
In [30]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
Out[30]:
In [31]:
# store token names
X_train_tokens = vect.get_feature_names()
In [32]:
# first 50 tokens
print(X_train_tokens[:50])
In [33]:
# last 50 tokens
print(X_train_tokens[-50:])
In [34]:
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()
Out[34]:
In [35]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts
Out[35]:
In [36]:
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending=True)
Out[36]:
In [29]:
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0] # ham
sms_spam = sms[sms.label==1] # spam
In [30]:
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()
In [31]:
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)
In [32]:
ham_dtm.shape, spam_dtm.shape
Out[32]:
In [33]:
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)
In [34]:
ham_counts
Out[34]:
In [35]:
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)
In [36]:
spam_counts
Out[36]:
In [37]:
all_tokens[0:5]
Out[37]:
In [38]:
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})
In [39]:
token_counts
Out[39]:
In [40]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1
In [41]:
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values(by='spam_ratio', ascending=False)
Out[41]:
In [43]:
#observe spam messages that contain the word 'claim'
claim_messages = sms.message[sms.message.str.contains('claim')]
for message in claim_messages[0:5]:
print(message, '\n')
We will use Multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
In [37]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
Out[37]:
In [38]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
In [39]:
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
In [41]:
print(metrics.classification_report(y_test, y_pred_class))
In [43]:
metrics.confusion_matrix(y_test, y_pred_class)
Out[43]:
In [47]:
?metrics.confusion_matrix
In [48]:
# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))
In [49]:
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
Out[49]:
In [50]:
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))
In [51]:
# print message text for the false positives
X_test[y_test < y_pred_class]
Out[51]:
In [52]:
# print message text for the false negatives
X_test[y_test > y_pred_class]
Out[52]:
In [ ]:
# what do you notice about the false negatives?
# X_test[3132]
In [ ]:
#Create a logitic regression
# import/instantiate/fit
In [ ]:
# class predictions and predicted probabilities
In [ ]:
# calculate accuracy and AUC