In [1]:
%%capture
!pip install scikit-learn scipy numpy pandas matplotlib
import pandas as pd
import numpy as np
import math
%matplotlib inline
Scikit-learn is a machine learning package for Python build on top of SciPy, NumPy and matplotlib. It gives access to a huge set of different machine learning techniques and a lot of preprocessing methods.
SciPy and NumPy are libraries, which added high performant mathematical operations to Python. Thanks to NumPy, Python is now able to perform different mathematical operations on matrices and vectors. Since Python is only a interpreted language, most of the operations are written in C.
Matplotlib adds a wide selection of different plotting methods.
Pandas adds the dataframe strutcture to Python. Some people might recognize the similarities to the dataframe of R. You can easily read data of different sources into a dataframe, analyse it there and use a wide selection of manipulation methods on the data.
In [2]:
data = pd.read_csv('spam.csv', encoding='latin-1')
In [3]:
# print the dimensions of the dataframe
print(data.shape)
data.head()
Out[3]:
As we can see, the data has 5 columns for 5572 SMS messages. We won't need the empty columns Unnamed: 2 - Unnamed: 4, so we're going to drop them.
In [4]:
data.drop(data.columns[[2, 3, 4]], axis=1, inplace=True)
In [5]:
data.head()
Out[5]:
Now we only have the interesting columns selected. v1 needs some transforming too, so we can classify in numerical classes. v2 contains the text, which we're going to use for training our classifiers. Let's rename the columns to something useful and transform v1 to numerical data before we begin with the preprocessing of the data.
In [6]:
data.columns = ['class', 'message']
data['class'] = data['class'].map({'ham': 0, 'spam': 1})
In [7]:
print('Harmless messages in the dataset: {}\nSpam messages in the dataset: {}'
.format(len(data[data['class'] == 0]),
len(data[data['class'] == 1])
)
)
We have a total of 4825 harmless messages, facing 747 spam messages.
Currently we only have our 2 features. One of them was already converted to numeric data. Since we can't apply our algorithms on text data, we need to somehow convert our text into a numerical representation. This is where the preprocessing capacities of sklearn come in handy.
Let's look at some methods we can use here. But the first important step we need to do, is splitting the data into test and train data.
We can split our train and test data into different sets. This will be represented as 4 different variables.
The input data X will be split into 2 sets for training and testing of our model. The same idea applies to y.
In [8]:
X = data["message"]
y = data["class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
The so called Bag of Words representation is the strategy used to convert text data into numerical vectors. We're going to convert our text by using this strategy.
The first step for us, is converting the text into a numerical representation. This can be extremly memory intensive, but thanks to sklearn, we use a sparse representation. Generally we differ between a sparse and a dense matrix/vector.
In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
# fit the vectorizer to our training data and vectorize it
X_train_cnt = count_vectorizer.fit_transform(X_train)
# since the count vectorizer is already fitted, we now only need to transform the test data
X_test_cnt = count_vectorizer.transform(X_test)
print(X_train_cnt[5])
Now we have converted our data in a numerical representation. Since there might be a lot of common words which hold no information like the, it, a etc. we might need to transform the data further.
The Tf-idf transformer will help us, to transform our data into a more helpful form and re-weight the data.
In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
# fit the transformer and convert our training data
X_train_tfidf = tfidf_transformer.fit_transform(X_train_cnt)
# transform test data
X_test_tfidf = tfidf_transformer.transform(X_test_cnt)
print(X_train_tfidf[5])
In [ ]: