Text Learning with sklearn

This notebook will give you a short overview over text learning with skLearn.

At first we will install and import the required python packages.

Required Packages and import


In [1]:
%%capture
!pip install scikit-learn scipy numpy pandas matplotlib

import pandas as pd
import numpy as np
import math

%matplotlib inline
  • Scikit-learn is a machine learning package for Python build on top of SciPy, NumPy and matplotlib. It gives access to a huge set of different machine learning techniques and a lot of preprocessing methods.

  • SciPy and NumPy are libraries, which added high performant mathematical operations to Python. Thanks to NumPy, Python is now able to perform different mathematical operations on matrices and vectors. Since Python is only a interpreted language, most of the operations are written in C.

  • Matplotlib adds a wide selection of different plotting methods.

  • Pandas adds the dataframe strutcture to Python. Some people might recognize the similarities to the dataframe of R. You can easily read data of different sources into a dataframe, analyse it there and use a wide selection of manipulation methods on the data.

The Dataset

I chose a dataset of tagged SMS messages, that were collected for a SMS Spam research. It contains a set of english SMS messages, tagged according being harmless or spam.

So let's use the handy pandas method read_csv to import the CSV file to a dataframe.


In [2]:
data = pd.read_csv('spam.csv', encoding='latin-1')

In [3]:
# print the dimensions of the dataframe
print(data.shape)
data.head()


(5572, 5)
Out[3]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN

As we can see, the data has 5 columns for 5572 SMS messages. We won't need the empty columns Unnamed: 2 - Unnamed: 4, so we're going to drop them.


In [4]:
data.drop(data.columns[[2, 3, 4]], axis=1, inplace=True)

In [5]:
data.head()


Out[5]:
v1 v2
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

Now we only have the interesting columns selected. v1 needs some transforming too, so we can classify in numerical classes. v2 contains the text, which we're going to use for training our classifiers. Let's rename the columns to something useful and transform v1 to numerical data before we begin with the preprocessing of the data.


In [6]:
data.columns = ['class', 'message']
data['class'] = data['class'].map({'ham': 0, 'spam': 1})

In [7]:
print('Harmless messages in the dataset: {}\nSpam messages in the dataset: {}'
      .format(len(data[data['class'] == 0]),
              len(data[data['class'] == 1])
             )
     )


Harmless messages in the dataset: 4825
Spam messages in the dataset: 747

We have a total of 4825 harmless messages, facing 747 spam messages.

Preprocessing

Currently we only have our 2 features. One of them was already converted to numeric data. Since we can't apply our algorithms on text data, we need to somehow convert our text into a numerical representation. This is where the preprocessing capacities of sklearn come in handy.

Let's look at some methods we can use here. But the first important step we need to do, is splitting the data into test and train data.

Train/Test split

We can split our train and test data into different sets. This will be represented as 4 different variables.

The input data X will be split into 2 sets for training and testing of our model. The same idea applies to y.


In [8]:
X = data["message"]
y = data["class"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Bag of Words representation

The so called Bag of Words representation is the strategy used to convert text data into numerical vectors. We're going to convert our text by using this strategy.

Occurence count and Vectorizing

The first step for us, is converting the text into a numerical representation. This can be extremly memory intensive, but thanks to sklearn, we use a sparse representation. Generally we differ between a sparse and a dense matrix/vector.

  • sparse is a representation of a vector, where only the values and indexes of non zero values are stored. For example a vector [1, 0, 0, 5, 0, 0, 9, 0] is stored as [(0, 1), (3, 5), (6, 9)].
  • dense is the difference, so the all values are stored.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
# fit the vectorizer to our training data and vectorize it
X_train_cnt = count_vectorizer.fit_transform(X_train)
# since the count vectorizer is already fitted, we now only need to transform the test data
X_test_cnt = count_vectorizer.transform(X_test)

print(X_train_cnt[5])


  (0, 886)	1
  (0, 1388)	1
  (0, 4031)	1
  (0, 922)	1
  (0, 3369)	1
  (0, 1070)	1
  (0, 1148)	1
  (0, 6627)	1
  (0, 3416)	1
  (0, 1377)	1

Now we have converted our data in a numerical representation. Since there might be a lot of common words which hold no information like the, it, a etc. we might need to transform the data further.

Tf-idf transformation

The Tf-idf transformer will help us, to transform our data into a more helpful form and re-weight the data.


In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

# fit the transformer and convert our training data
X_train_tfidf = tfidf_transformer.fit_transform(X_train_cnt)
# transform test data
X_test_tfidf = tfidf_transformer.transform(X_test_cnt)

print(X_train_tfidf[5])


  (0, 1377)	0.453824589042
  (0, 3416)	0.235594989933
  (0, 6627)	0.351303369509
  (0, 1148)	0.273166267225
  (0, 1070)	0.224012703785
  (0, 3369)	0.275704138848
  (0, 922)	0.177379205347
  (0, 4031)	0.373014218532
  (0, 1388)	0.384504019259
  (0, 886)	0.309618445742

In [ ]: