University of Texas at San Antonio



**Open Cloud Institute**


Email Classification


**Paul Rad, Ph.D.**

**Gonzalo De La Torre, Ph.D. Student**

*Open Cloud Institute, University of Texas at San Antonio, San Antonio, Texas, USA*
gonzalo.delatorreparra@utsa.edu, paul.rad@utsa.edu


Email Classification using Machine Learning

Email classification is a common beginner problem from Natural Language Processing (NLP). The idea is simple - given an email you’ve never seen before, determine whether or not that email is Spam or not (aka Ham).

While the classification between spam and non-spam email task is easy for humans, it’s much harder to write a program that can correctly classify an email as Spam or Ham. In the following program, instead of telling the program which words we think are important, we will proceed to let the program learn which words are actually important.

To tackle this problem, we start with a collection of sample emails (i.e. a text corpus). In this corpus, each email has already been labeled as Spam or Ham. Since we are making use of these labels in the training phase, this is a supervised learning task. This is called supervised learning because we are (in a sense) supervising the program as it learns what Spam emails look like and what Ham email look like.

During the training phase, we present these emails and their labels to the program. For each email, the program says whether it thought the email was Spam or Ham. After the program makes a prediction, we tell the program what the label of the email actually was. The program then changes its configuration so as to make a better prediction the next time around. This process is done iteratively until either the program can’t do any better or we get impatient and just tell the program to stop.

Initial Steps

In this section we will start by importing the necessary libraries into our machine learning program. One of the main libraries we are importing is tensorflow which is the library we will be using to perform many of our deep learning computations. In addition, we will also import the pre-labeled email data contained in the data.tar.gz file and set variables where the number of words within an email will be saved numFeatures and the and classification (ham or spam) is stated numLabels.


In [ ]:
# import statements

from __future__ import division
import tensorflow as tf
import numpy as np
import tarfile
import os
import matplotlib.pyplot as plt
import time

# Display plots inline 
%matplotlib inline

# import email data

def csv_to_numpy_array(filePath, delimiter):
    return np.genfromtxt(filePath, delimiter=delimiter, dtype=None)

def import_data():
    if "data" not in os.listdir(os.getcwd()):
        # Untar directory of data if we haven't already
        tarObject = tarfile.open("/home/gonzalo/tensorflow-tutorial/data.tar.gz")
        tarObject.extractall()
        tarObject.close()
        print("Extracted tar to current directory")
    else:
        # we've already extracted the files
        pass

    print("loading training data")
    trainX = csv_to_numpy_array("data/trainX.csv", delimiter="\t")
    trainY = csv_to_numpy_array("data/trainY.csv", delimiter="\t")
    print("loading test data")
    testX = csv_to_numpy_array("data/testX.csv", delimiter="\t")
    testY = csv_to_numpy_array("data/testY.csv", delimiter="\t")
    return trainX,trainY,testX,testY

trainX,trainY,testX,testY = import_data()

# set parameters for training

# features, labels
numFeatures = trainX.shape[1]
numLabels = trainY.shape[1]

Defining your placeholders

A placeholder is simply a variable that we will assign data to at a later date. It allows us to create our operations and build our computation graph, without needing the data. In TensorFlow terminology, we then feed data into the graph through these placeholders.


In [ ]:
# define placeholders and variables for use in training

X = tf.placeholder(tf.float32, [None, numFeatures])
yGold = tf.placeholder(tf.float32, [None, numLabels])

weights = tf.Variable(tf.random_normal([numFeatures,numLabels],
                                       mean=0,
                                       stddev=(np.sqrt(6/numFeatures+
                                                         numLabels+1)),
                                       name="weights"))
bias = tf.Variable(tf.random_normal([1,numLabels],
                                    mean=0,
                                    stddev=(np.sqrt(6/numFeatures+numLabels+1)),
                                    name="bias"))

Initializing Variables

After definining our placeholders we must proceed to initialize all the variables and define additional functions using tensorflow library to compute define a feedforward algorithm, cost function, optimization algorithms, and estimate our accuracy. At this point none of these operations would be executed, just defined.


In [ ]:
# initialize variables
init_OP = tf.initialize_all_variables()

# define feedforward algorithm
y = tf.nn.sigmoid(tf.add(tf.matmul(X, weights, name="apply_weights"), bias, name="add_bias"), name="activation")

# define cost function and optimization algorithm (gradient descent)
learningRate = tf.train.exponential_decay(learning_rate=0.0008,
                                          global_step= 1,
                                          decay_steps=trainX.shape[0],
                                          decay_rate= 0.95,
                                          staircase=True)
cost_OP = tf.nn.l2_loss(y-yGold, name="squared_error_cost")
training_OP = tf.train.GradientDescentOptimizer(learningRate).minimize(cost_OP)

# accuracy function
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(yGold,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Creating and Starting your Session

Now that we have defined all of your elements, we can proceed to create and start the session that will execute all operations previous declarations. In this seccion we will first proceed to train our model with data extracted from the data.tar.gz, then test our output model by feeding it with test data, and finally calculating the accuracy of our model.


In [ ]:
numEpochs = 10000
learningRate = tf.train.exponential_decay(learning_rate=0.0008,
                                          global_step= 1,
                                          decay_steps=trainX.shape[0],
                                          decay_rate= 0.95,
                                          staircase=True)
# Launch the graph
errors = []
with tf.Session() as sess:
    sess.run(init_OP )
    print('Initialized Session.')
    for step in range(numEpochs):
        # run optimizer at each step in training
        sess.run(training_OP, feed_dict={X: trainX, yGold: trainY})
        # fill errors array with updated error values
        accuracy_value = accuracy.eval(feed_dict={X: trainX, yGold: trainY})
        errors.append(1 - accuracy_value)
    print('Optimization Finished!')
    
    # output final error
    print("Final error found during training: ", errors[-1])
    # output accuracy 
    print("Final accuracy on test set: %s" %str(sess.run(accuracy, 
                                                     feed_dict={X: testX, 
                                                                yGold: testY})))

In [ ]:
# plot errors array to see how it decreased
plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))])
plt.show()