Spam classification using logistic regression

We have 4601 email addresses with 57 features. The data can be found on GitHub. The column names can be found here. We want to predict whether the email is spam or not, so we have a binary response, where 1 indicates spam and 0 indicates non-spam.

You can find some utility functions on GitHub.



In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
from classifiers import read_spam_data, transform_log, transform_binary
#from sklearn.linear_model import LogisticRegression # reference sklearn implementation
from classifiers import LogisticRegression

train_data, test_data = read_spam_data()
train_data.head()









    Out[1]:






  
    
      
      spam
      word_freq_make
      word_freq_address
      word_freq_all
      word_freq_3d
      word_freq_our
      word_freq_over
      word_freq_remove
      word_freq_internet
      word_freq_order
      ...
      word_freq_conference
      char_freq_;
      char_freq_(
      char_freq_[
      char_freq_!
      char_freq_$
      char_freq_#
      capital_run_length_average
      capital_run_length_longest
      capital_run_length_total
    
  
  
    
      0
      1
      0.21
      0.28
      0.5
      0.0
      0.14
      0.28
      0.21
      0.07
      0.00
      ...
      0.0
      0.0
      0.132
      0.0
      0.372
      0.180
      0.048
      5.114
      101.0
      1028.0
    
    
      1
      1
      0.00
      0.00
      0.0
      0.0
      0.63
      0.00
      0.31
      0.63
      0.31
      ...
      0.0
      0.0
      0.137
      0.0
      0.137
      0.000
      0.000
      3.537
      40.0
      191.0
    
    
      2
      1
      0.00
      0.00
      0.0
      0.0
      0.63
      0.00
      0.31
      0.63
      0.31
      ...
      0.0
      0.0
      0.135
      0.0
      0.135
      0.000
      0.000
      3.537
      40.0
      191.0
    
    
      3
      1
      0.00
      0.00
      0.0
      0.0
      1.85
      0.00
      0.00
      1.85
      0.00
      ...
      0.0
      0.0
      0.223
      0.0
      0.000
      0.000
      0.000
      3.000
      15.0
      54.0
    
    
      4
      1
      0.00
      0.00
      0.0
      0.0
      1.92
      0.00
      0.00
      0.00
      0.00
      ...
      0.0
      0.0
      0.054
      0.0
      0.164
      0.054
      0.000
      1.671
      4.0
      112.0
    
  

5 rows × 58 columns

In logistic regression, we let

\begin{equation} y \mid \mathbf{x} \sim \operatorname{Bernoulli}\left(\frac{1}{1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)}\right), \end{equation}

so if we make $N$ observations the likelihood is

\begin{equation} L\left(w_0, \mathbf{w}\right) = \prod_{i=1}^N\left(\frac{1}{1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)}\right)^{y_i}\left(\frac{1}{1 + \exp\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)}\right)^{1-y_i}. \end{equation}

Define $\tilde{y}_i = 2y_i - 1.$ We want to minimize the negative log-likelihood, which is

\begin{equation} J\left(w_0, \mathbf{w}\right) = \sum_{i=1}^N\log\left(1 + \exp\left(-\tilde{y}_i\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)\right). \end{equation}

In practice, this function is not always convex, so we add some regularization. Now, we minimize

\begin{equation} J_\lambda\left(w_0, \mathbf{w}\right) = \sum_{i=1}^N\log\left(1 + \exp\left(-\tilde{y}_i\left(w_0 + \mathbf{w}^\intercal \mathbf{x}\right)\right)\right) + \frac{\lambda}{2}\mathbf{w}^\intercal\mathbf{w}. \end{equation}

Note that there is no penalty for $w_0$. One way to minimize this is with Newton's algorithm. To do so, we need to calculate the gradient $\mathbf{g}$ and Hessian $\mathbf{H}$. In exercise 3 (not shown), I show that

\begin{align} \mathbf{g}\left(w_0, \mathbf{w}\right) &= \tilde{\mathbf{X}}^\intercal\left(\boldsymbol\mu - \mathbf{y}\right) + \lambda\begin{pmatrix}0 \\ \mathbf{w}\end{pmatrix} \\ \mathbf{H}\left(w_0, \mathbf{w}\right) &= \tilde{\mathbf{X}}^\intercal \mathbf{M} \tilde{\mathbf{X}} + \lambda \begin{pmatrix} 0 & \mathbf{0} \\ \mathbf{0} & I \end{pmatrix}, \end{align}

where $\tilde{\mathbf{X}}$ is the data matrix with a column of 1s inserted as the first column. Also, we have defined

\begin{equation} \boldsymbol\mu = \begin{pmatrix} \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_1\right)\right)\right)^{-1}\\ \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_2\right)\right)\right)^{-1}\\ \vdots \\ \left(1 + \exp\left(-\left(w_0 + \mathbf{w}^\intercal\mathbf{x}_N\right)\right)\right)^{-1}\\ \end{pmatrix}, ~\text{and}~ \mathbf{M} = \operatorname{diag}\left( \mu_1(1-\mu_1), \mu_2(1-\mu_2), \ldots,\mu_N(1-\mu_N) \right). \end{equation}

Let $\boldsymbol\theta = (w_0,\mathbf{w})^\intercal$. Choose some initial estimate for $\boldsymbol\theta_0$. Then, by Newton's method,

\begin{equation} \mathbf{H}\left(\boldsymbol\theta_k\right)\left(\boldsymbol\theta_{k+1} - \boldsymbol\theta_{k}\right) = -\mathbf{g}\left(\boldsymbol\theta_k\right). \end{equation}



In [2]:

    
#logReg = LogisticRegression(solver='newton-cg', C=1) # sklearn implementation, only newton-cg converges
logReg = LogisticRegression(regularization = 1)
logReg.fit(train_data.drop('spam', axis = 1).as_matrix(), train_data['spam'].as_matrix())
print(logReg.intercept_)
print(logReg.coef_)
print(1-logReg.score(train_data.drop('spam', axis = 1).as_matrix(), train_data['spam'].as_matrix()))
print(1-logReg.score(test_data.drop('spam', axis = 1).as_matrix(), test_data['spam'].as_matrix()))









    



-1.49981367068
[ -1.26600248e-01  -1.56094390e-01   2.54708222e-01   8.58581226e-01
   4.09630906e-01   8.96135831e-01   2.19302802e+00   7.60903407e-01
   6.71931927e-01   1.01103996e-01  -2.82796444e-01  -1.83924290e-01
  -1.11320045e-01   1.03485121e-01   5.54364725e-01   1.18430618e+00
   1.06659155e+00   1.31293171e-01   1.43401894e-01   7.11353892e-01
   2.31623776e-01   3.36640589e-01   1.87194517e+00   6.08773300e-01
  -1.82320650e+00  -5.14474564e-01  -4.14759176e+00  -1.98631296e-01
  -1.18020809e+00  -1.99879005e-01  -1.77496082e-01   1.48991710e-02
  -7.45306456e-01  -1.68448134e-01  -5.55552499e-01   8.47931331e-01
   3.72525690e-02  -6.12885760e-01  -1.00486600e+00  -1.36561188e-01
  -1.16561300e+00  -1.59350006e+00  -9.86968673e-01  -1.38680860e+00
  -7.55505052e-01  -1.82108130e+00  -6.55434252e-01  -1.61610798e+00
  -1.18107595e+00  -3.44812567e-01  -3.76344925e-01   2.25717319e-01
   4.05625095e+00   1.06168683e+00  -1.38267945e-02   8.02586523e-03
   4.62754722e-04]
0.0698205546493
0.0703125

The default settings work pretty well and give us a 7% misclassification rate. Let us try various transformations.

Standardize, that is, $\displaystyle\frac{\mathbf{x}_j - \hat{\mu}_j}{\hat{\sigma}_j}$
log, $\log\left(\mathbf{x}_j + 0.1\right)$
Binarize, $\mathbb{I}\left(\mathbf{x}_j > 0\right)$

You can see the transformations and the logistic regression class that I wrote by hand on GitHub.



In [3]:

    
from sklearn import preprocessing

# transform the data
ytrain = train_data['spam'].as_matrix()
ytest = test_data['spam'].as_matrix()
Xtrain_raw = train_data.drop('spam', axis = 1).as_matrix()
Xtest_raw = test_data.drop('spam', axis = 1).as_matrix()
Xtrain_standard = preprocessing.scale(Xtrain_raw, axis=0)
Xtest_standard = preprocessing.scale(Xtest_raw, axis=0)
Xtrain_log = np.apply_along_axis(transform_log, axis = 0, arr=Xtrain_raw)
Xtest_log = np.apply_along_axis(transform_log, axis = 0, arr=Xtest_raw)
Xtrain_binary = np.apply_along_axis(transform_binary, axis = 0, arr=Xtrain_raw)
Xtest_binary = np.apply_along_axis(transform_binary, axis = 0, arr=Xtest_raw)

data_transform = ['Raw', 'Standard', 'Log', 'Binary']
Xtrain = [Xtrain_raw, Xtrain_standard, Xtrain_log, Xtrain_binary]
Xtest = [Xtest_raw, Xtest_standard, Xtest_log, Xtest_binary]

## now run lots of models to find regularization parameter
regularization = np.linspace(0, 20, num=41)
misclassification_rates = pd.DataFrame(dtype=np.float64,
                                       index = np.arange(len(regularization)),
                                       columns = ['Regularization'] + 
                                       list(map(lambda x : x + ' Train',  data_transform)) + 
                                       list(map(lambda x : x + ' Test',  data_transform)))
for i in range(len(regularization)):
    misclassification_rates.iloc[i]['Regularization'] = regularization[i]
    if regularization[i] == 0:
        regularization[i] += 0.01 # hack when there's no convergence.
    logReg = LogisticRegression(regularization = regularization[i])
    for j in range(len(data_transform)):
        logReg.fit(Xtrain[j], ytrain)
        misclassification_rates.iloc[i][data_transform[j] + ' Train'] = 1 - logReg.score(Xtrain[j], ytrain)
        misclassification_rates.iloc[i][data_transform[j] + ' Test'] = 1 - logReg.score(Xtest[j], ytest)
misclassification_rates









    Out[3]:






  
    
      
      Regularization
      Raw Train
      Standard Train
      Log Train
      Binary Train
      Raw Test
      Standard Test
      Log Test
      Binary Test
    
  
  
    
      0
      0.0
      0.070799
      0.070147
      0.051876
      0.064600
      0.076172
      0.086589
      0.058594
      0.074219
    
    
      1
      0.5
      0.069494
      0.072431
      0.051223
      0.064274
      0.070964
      0.087891
      0.059245
      0.072266
    
    
      2
      1.0
      0.069821
      0.074062
      0.050571
      0.063295
      0.070312
      0.087240
      0.059896
      0.072266
    
    
      3
      1.5
      0.071126
      0.075367
      0.049918
      0.063295
      0.070312
      0.085938
      0.060547
      0.072917
    
    
      4
      2.0
      0.072757
      0.076020
      0.050245
      0.063295
      0.069661
      0.084635
      0.061198
      0.073568
    
    
      5
      2.5
      0.073409
      0.075367
      0.050571
      0.063295
      0.071615
      0.085286
      0.059896
      0.074219
    
    
      6
      3.0
      0.072757
      0.075693
      0.051876
      0.063948
      0.072266
      0.084635
      0.059896
      0.074219
    
    
      7
      3.5
      0.074715
      0.075367
      0.051223
      0.063295
      0.075521
      0.085286
      0.059896
      0.074219
    
    
      8
      4.0
      0.075693
      0.076020
      0.051223
      0.063622
      0.075521
      0.083984
      0.059896
      0.074219
    
    
      9
      4.5
      0.076020
      0.076020
      0.051223
      0.063295
      0.076823
      0.084635
      0.059896
      0.074219
    
    
      10
      5.0
      0.075041
      0.076020
      0.050897
      0.063622
      0.076823
      0.085286
      0.059896
      0.074870
    
    
      11
      5.5
      0.074062
      0.076346
      0.050897
      0.063622
      0.077474
      0.085938
      0.059896
      0.075521
    
    
      12
      6.0
      0.074388
      0.076998
      0.050571
      0.063948
      0.076823
      0.085286
      0.059896
      0.076172
    
    
      13
      6.5
      0.074388
      0.076998
      0.050245
      0.064600
      0.075521
      0.083984
      0.058594
      0.076172
    
    
      14
      7.0
      0.074388
      0.077651
      0.050245
      0.064600
      0.076172
      0.084635
      0.058594
      0.076172
    
    
      15
      7.5
      0.074388
      0.078303
      0.050897
      0.064600
      0.076172
      0.084635
      0.057943
      0.076172
    
    
      16
      8.0
      0.074388
      0.078303
      0.051550
      0.064927
      0.076823
      0.083984
      0.057943
      0.076172
    
    
      17
      8.5
      0.074715
      0.078630
      0.051550
      0.064927
      0.077474
      0.083984
      0.059245
      0.075521
    
    
      18
      9.0
      0.075367
      0.078630
      0.051876
      0.064600
      0.076823
      0.083984
      0.059245
      0.075521
    
    
      19
      9.5
      0.074715
      0.079282
      0.051876
      0.065253
      0.076823
      0.083984
      0.059245
      0.075521
    
    
      20
      10.0
      0.074715
      0.078956
      0.051876
      0.064600
      0.076823
      0.085286
      0.059245
      0.075521
    
    
      21
      10.5
      0.074715
      0.078956
      0.052202
      0.064600
      0.076823
      0.084635
      0.059245
      0.075521
    
    
      22
      11.0
      0.074388
      0.079282
      0.051550
      0.065253
      0.077474
      0.084635
      0.059245
      0.074219
    
    
      23
      11.5
      0.075367
      0.079282
      0.051223
      0.066232
      0.077474
      0.084635
      0.059245
      0.074870
    
    
      24
      12.0
      0.075367
      0.079608
      0.051550
      0.069168
      0.077474
      0.084635
      0.059245
      0.075521
    
    
      25
      12.5
      0.075693
      0.079608
      0.051550
      0.069821
      0.077474
      0.085286
      0.059245
      0.074219
    
    
      26
      13.0
      0.075693
      0.079608
      0.051550
      0.070147
      0.076823
      0.085286
      0.059245
      0.074219
    
    
      27
      13.5
      0.075693
      0.079608
      0.051550
      0.070147
      0.076823
      0.085286
      0.059896
      0.074219
    
    
      28
      14.0
      0.076020
      0.080587
      0.052202
      0.070147
      0.076823
      0.085286
      0.059896
      0.075521
    
    
      29
      14.5
      0.076346
      0.080587
      0.052202
      0.069821
      0.077474
      0.085286
      0.060547
      0.075521
    
    
      30
      15.0
      0.076346
      0.080587
      0.051876
      0.069821
      0.079427
      0.085286
      0.060547
      0.075521
    
    
      31
      15.5
      0.076346
      0.080914
      0.051876
      0.070473
      0.079427
      0.085286
      0.060547
      0.078125
    
    
      32
      16.0
      0.076346
      0.080914
      0.052202
      0.071778
      0.080078
      0.085286
      0.061849
      0.078776
    
    
      33
      16.5
      0.076346
      0.081240
      0.052202
      0.071778
      0.080078
      0.085286
      0.061849
      0.078776
    
    
      34
      17.0
      0.077325
      0.080914
      0.051876
      0.071452
      0.080078
      0.085938
      0.061849
      0.078776
    
    
      35
      17.5
      0.077651
      0.080914
      0.051876
      0.071452
      0.080729
      0.085938
      0.062500
      0.078776
    
    
      36
      18.0
      0.077651
      0.080914
      0.051550
      0.071452
      0.080729
      0.085938
      0.062500
      0.078776
    
    
      37
      18.5
      0.077651
      0.081240
      0.051550
      0.071452
      0.080729
      0.085938
      0.062500
      0.079427
    
    
      38
      19.0
      0.077651
      0.081566
      0.051876
      0.071452
      0.081380
      0.086589
      0.062500
      0.079427
    
    
      39
      19.5
      0.077651
      0.081566
      0.052529
      0.071778
      0.081380
      0.086589
      0.062500
      0.080078
    
    
      40
      20.0
      0.077651
      0.081566
      0.052529
      0.072431
      0.081380
      0.086589
      0.062500
      0.080078

Let's plot these results.



In [4]:

    
import matplotlib.pyplot as plt
colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3']
plt.figure(figsize=(12,8))
for i in range(len(data_transform)):
    plt.plot(misclassification_rates['Regularization'], misclassification_rates[data_transform[i] + ' Train'],
             color=colors[i], linestyle='--', linewidth=2, marker='.')
    plt.plot(misclassification_rates['Regularization'], misclassification_rates[data_transform[i] + ' Test'],
             color=colors[i], linestyle='-', linewidth=2, marker='.')
plt.grid(True)
plt.legend(loc='center left', bbox_to_anchor=(1,0.5), title="Dataset")
plt.ylabel("Misclassification Rate")
plt.xlabel("Regularization Parameter ($\lambda$)")
plt.title("Misclassification Rates for Various Transforms and Regularizations")
plt.show()

The problem tells us to select the regularization paramter with cross validation, but that seems pointless. We see that the log transform performs best with a misclassification rate of 5.8% when $\lambda = 8$.

	spam	word_freq_make	word_freq_address	word_freq_all	word_freq_our	word_freq_over	word_freq_remove	word_freq_internet	word_freq_order	...	char_freq_(	char_freq_!	char_freq_$	char_freq_#	capital_run_length_average	capital_run_length_longest	capital_run_length_total
0	1	0.21	0.28	0.5	0.14	0.28	0.21	0.07	0.00	...	0.132	0.372	0.180	0.048	5.114	101.0	1028.0
1	1	0.00	0.00	0.0	0.63	0.00	0.31	0.63	0.31	...	0.137	0.137	0.000	0.000	3.537	40.0	191.0
2	1	0.00	0.00	0.0	0.63	0.00	0.31	0.63	0.31	...	0.135	0.135	0.000	0.000	3.537	40.0	191.0
3	1	0.00	0.00	0.0	1.85	0.00	0.00	1.85	0.00	...	0.223	0.000	0.000	0.000	3.000	15.0	54.0
4	1	0.00	0.00	0.0	1.92	0.00	0.00	0.00	0.00	...	0.054	0.164	0.054	0.000	1.671	4.0	112.0

	Regularization	Raw Train	Standard Train	Log Train	Binary Train	Raw Test	Standard Test	Log Test	Binary Test
0	0.0	0.070799	0.070147	0.051876	0.064600	0.076172	0.086589	0.058594	0.074219
1	0.5	0.069494	0.072431	0.051223	0.064274	0.070964	0.087891	0.059245	0.072266
2	1.0	0.069821	0.074062	0.050571	0.063295	0.070312	0.087240	0.059896	0.072266
3	1.5	0.071126	0.075367	0.049918	0.063295	0.070312	0.085938	0.060547	0.072917
4	2.0	0.072757	0.076020	0.050245	0.063295	0.069661	0.084635	0.061198	0.073568
5	2.5	0.073409	0.075367	0.050571	0.063295	0.071615	0.085286	0.059896	0.074219
6	3.0	0.072757	0.075693	0.051876	0.063948	0.072266	0.084635	0.059896	0.074219
7	3.5	0.074715	0.075367	0.051223	0.063295	0.075521	0.085286	0.059896	0.074219
8	4.0	0.075693	0.076020	0.051223	0.063622	0.075521	0.083984	0.059896	0.074219
9	4.5	0.076020	0.076020	0.051223	0.063295	0.076823	0.084635	0.059896	0.074219
10	5.0	0.075041	0.076020	0.050897	0.063622	0.076823	0.085286	0.059896	0.074870
11	5.5	0.074062	0.076346	0.050897	0.063622	0.077474	0.085938	0.059896	0.075521
12	6.0	0.074388	0.076998	0.050571	0.063948	0.076823	0.085286	0.059896	0.076172
13	6.5	0.074388	0.076998	0.050245	0.064600	0.075521	0.083984	0.058594	0.076172
14	7.0	0.074388	0.077651	0.050245	0.064600	0.076172	0.084635	0.058594	0.076172
15	7.5	0.074388	0.078303	0.050897	0.064600	0.076172	0.084635	0.057943	0.076172
16	8.0	0.074388	0.078303	0.051550	0.064927	0.076823	0.083984	0.057943	0.076172
17	8.5	0.074715	0.078630	0.051550	0.064927	0.077474	0.083984	0.059245	0.075521
18	9.0	0.075367	0.078630	0.051876	0.064600	0.076823	0.083984	0.059245	0.075521
19	9.5	0.074715	0.079282	0.051876	0.065253	0.076823	0.083984	0.059245	0.075521
20	10.0	0.074715	0.078956	0.051876	0.064600	0.076823	0.085286	0.059245	0.075521
21	10.5	0.074715	0.078956	0.052202	0.064600	0.076823	0.084635	0.059245	0.075521
22	11.0	0.074388	0.079282	0.051550	0.065253	0.077474	0.084635	0.059245	0.074219
23	11.5	0.075367	0.079282	0.051223	0.066232	0.077474	0.084635	0.059245	0.074870
24	12.0	0.075367	0.079608	0.051550	0.069168	0.077474	0.084635	0.059245	0.075521
25	12.5	0.075693	0.079608	0.051550	0.069821	0.077474	0.085286	0.059245	0.074219
26	13.0	0.075693	0.079608	0.051550	0.070147	0.076823	0.085286	0.059245	0.074219
27	13.5	0.075693	0.079608	0.051550	0.070147	0.076823	0.085286	0.059896	0.074219
28	14.0	0.076020	0.080587	0.052202	0.070147	0.076823	0.085286	0.059896	0.075521
29	14.5	0.076346	0.080587	0.052202	0.069821	0.077474	0.085286	0.060547	0.075521
30	15.0	0.076346	0.080587	0.051876	0.069821	0.079427	0.085286	0.060547	0.075521
31	15.5	0.076346	0.080914	0.051876	0.070473	0.079427	0.085286	0.060547	0.078125
32	16.0	0.076346	0.080914	0.052202	0.071778	0.080078	0.085286	0.061849	0.078776
33	16.5	0.076346	0.081240	0.052202	0.071778	0.080078	0.085286	0.061849	0.078776
34	17.0	0.077325	0.080914	0.051876	0.071452	0.080078	0.085938	0.061849	0.078776
35	17.5	0.077651	0.080914	0.051876	0.071452	0.080729	0.085938	0.062500	0.078776
36	18.0	0.077651	0.080914	0.051550	0.071452	0.080729	0.085938	0.062500	0.078776
37	18.5	0.077651	0.081240	0.051550	0.071452	0.080729	0.085938	0.062500	0.079427
38	19.0	0.077651	0.081566	0.051876	0.071452	0.081380	0.086589	0.062500	0.079427
39	19.5	0.077651	0.081566	0.052529	0.071778	0.081380	0.086589	0.062500	0.080078
40	20.0	0.077651	0.081566	0.052529	0.072431	0.081380	0.086589	0.062500	0.080078