Project Title : Author Labeling by text classification

Introduction :

Text classification is one of the major applications of Machine Learning. Most of the text classification projects are done by implementing any of the Machine Learning Algorithms. In this project we will use Naive_Bayes algorithm to label the text.

Input Data Preprocessing :

The student assignments of English class are used as input for this project and we have to label the text with respective author(student). The data we received has repetative content for every student, we have dropped such type of files from the input data and the student records with fewer files were also dropped.

Thus evolved data is processed to generate the ".csv" file which is used as input dataset for this project. It contains two columns, one with student roll number and other with corresponding text.

Working Theme


In [14]:
# Importing pandas library

import pandas as pd

# Loding the data set
df = pd.read_table('data.csv',
                   sep=',', 
                   header=None, 
                   names=['rollNo','textData'])

# Output printing out first 5 columns

df.head()
# from sklearn.feature_extraction import text


Out[14]:
rollNo textData
1 1 ESSAY WRITINGQ1.Essay-1:1.Helping the people i...
1 10 Q. Make the sentences more concise:1. We certa...
1 11 Creative WritingA Photographer has got a chanc...
1 13 1)---------------------------------A. NareshIf...
1 14 To: Subject: Regarding Performance FeedbackHel...

The above table shows the first 5 tuples of the dataset which contains two columns namely the roll no and text of the assignment.


In [15]:
# Shape is used to get the details of the data set.
df.shape


Out[15]:
(1028, 2)

The dataset contains 1028 entries (tuples) and 2 columns as described above.

Splitting Training and testing sets

Spliting the dataset into a training and testing set by using the train_test_split method in sklearn. Spliting the data by using the following variables:
-> X_train is our training data for the 'textData' column.
-> y_train is our training data for the 'rollNo' column
-> X_test is our testing data for the 'textData' column.
-> y_test is our testing data for the 'rollNo' column.


In [19]:
# split into training and testing sets
# USE from sklearn.model_selection import train_test_split to avoid seeing deprecation warning.
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['textData'], 
                                                    df['rollNo'], 
                                                    random_state=1)

# Printing out the number of rows we have in each our training and testing data.
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))


Number of rows in the total set: 1028
Number of rows in the training set: 771
Number of rows in the test set: 257

Applying Bag of Words processing to our dataset

We have split the data, next we will generate Bag of words and convert our data into the desired matrix format. We will be using CountVectorizer() which is in sklearn library.
-> First we have to fit our training data (X_train) into CountVectorizer() and return the matrix.
-> Later we have to transform our testing data (X_test) to return the matrix.

Here X_train is our training data for the 'textData' column in our dataset and we will be using this to train our model.

X_test is our testing data for the 'textData' column and this is the data we will be using(after transformation to a matrix) to make predictions on. We will then compare those predictions with y_test later.


In [26]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the CountVectorizer method
count_vector = CountVectorizer(stop_words="english", token_pattern=u'(?u)\\b\\w\\w+\\b')

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

Learning a vocabulary dictionary for the training data and then transforming the data into a document-term matrix and next for the testing data here we are only transforming the data into a document-term matrix.

We have passed arguments to customize the count_vector which involved removing stop words of english language and puntuations.

Naive Bayes implementation using scikit-learn :

We will use sklearns sklearn.naive_bayes method to make predictions on our dataset.

Specifically, we will use the multinomial Naive Bayes implementation which is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input.

Loading the training data into the variable 'training_data' and the testing data into the variable 'testing_data'.

We will import the MultinomialNB classifier and fit the training data into the classifier using fit() and we will train the classifier using 'training_data' and 'y_train' which we have from our split.


In [32]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)


Out[32]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Our algorithm has been trained using the training data set we can now make some predictions on the test data stored in 'testing_data' using predict().


In [33]:
predictions = naive_bayes.predict(testing_data)

Evaluating our model :

Computing the accuracy, precision, recall and F1 scores of our model using your test data 'y_test' and the predictions we made earlier stored in the 'predictions' variable.


In [35]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions,average="weighted")))
print('Recall score: ', format(recall_score(y_test, predictions,average="weighted")))
print('F1 score: ', format(f1_score(y_test, predictions,average="weighted")))


('Accuracy score: ', '0.221789883268')
('Precision score: ', '0.187706099766')
('Recall score: ', '0.221789883268')
('F1 score: ', '0.18274968629')

Conclusion :

The accuracy score is very low for the model we developed, from this we can conclude that the dataset used for generating the model is very little and there is also a possibility of common data (Student assignment texts).