Importing the Dependencies and Loading the Data
In [4]:
import nltk
import pandas as pd
import numpy as np
In [18]:
data = pd.read_csv("original_train_data.csv", header = None,delimiter = "\t", quoting=3,names = ["Polarity","TextFeed"])
In [127]:
#Data Visualization
data.head()
Out[127]:
Data prepration with the available data. I made the combination such that the classes are highly imbalanced making it apt for anomaly detection problem
In [145]:
data_positive = data.loc[data["Polarity"]==1]
data_negative = data.loc[data["Polarity"]==0]
anomaly_data = pd.concat([data_negative.sample(n=10),data_positive,data_negative.sample(n=10)])
anomaly_data.Polarity.value_counts()
Out[145]:
In [134]:
#Number of words per sentence
print ("No of words for sentence in train data",np.mean([len(s.split(" ")) for s in anomaly_data.TextFeed]))
Data pre-processing - text analytics to create a corpus
1) Converting text to matrix of token counts [Bag of words]
Stemming - lowercasing, removing stop-words, removing punctuation and reducing words to its lexical roots
2) Stemmer, tokenizer(removes non-letters) are created by ourselves.These are passed as parameters to CountVectorizer of sklearn.
3) Extracting important words and using them as input to the classifier
In [50]:
import re
from sklearn.feature_extraction.text import CountVectorizer
nltk.download()
from nltk.stem.porter import PorterStemmer
In [ ]:
''' this code is taken from
http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
'''
# a stemmer widely used
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
# remove non letters
text = re.sub("[^a-zA-Z]", " ", text)
# tokenize
tokens = nltk.word_tokenize(text)
# stem
stems = stem_tokens(tokens, stemmer)
return stems
The below implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
Note: I am not using frequencies(TfidTransformer, apt for longer documents) because the text size is small and can be dealt with occurences(CountVectorizer).
In [146]:
#Max_Features selected as 80 - can be changed for the better trade-off
vector_data = CountVectorizer(
analyzer = 'word',
tokenizer = tokenize,
lowercase = True,
stop_words = 'english',
max_features = 90
)
Fit_Transform: 1) Fits the model and learns the vocabulary 2) transoforms the data into feature vectors
In [177]:
#using only the "Text Feed" column to build the features
features = vector_data.fit_transform(anomaly_data.TextFeed.tolist())
#converting the data into the array
features = features.toarray()
features.shape
Out[177]:
In [178]:
#printing the words in the vocabulary
vocab = vector_data.get_feature_names()
print (vocab)
In [287]:
# Sum up the counts of each vocabulary word
dist = np.sum(features, axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the data set
a = zip(vocab,dist)
print (list(a))
In [203]:
from sklearn.cross_validation import train_test_split
#80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(
features,
anomaly_data.Polarity,
train_size=0.80,
random_state=1234)
In [204]:
print ("Training data - positive and negative values")
print (pd.value_counts(pd.Series(y_train)))
print ("Testing data - positive and negative values")
print (pd.value_counts(pd.Series(y_test)))
A text polarity depends on what words appear in that text, discarding any grammar or word order but keeping multiplicity.
1) All the above text processing for features ended up with the same entries in our dataset
2) Instead of having them defined by a whole text, they are now defined by a series of counts of the most frequent words in our whole corpus.
3) These vectors are used as features to train a classifier.
In [281]:
from sklearn.svm import SVC
clf = SVC()
clf.fit(X=X_train,y=y_train)
wclf = SVC(class_weight={0: 20})
wclf.fit(X=X_train,y=y_train)
Out[281]:
In [282]:
y_pred = clf.predict(X_test)
y_pred_weighted = wclf.predict(X_test)
In [283]:
from sklearn.metrics import classification_report
print ("Basic SVM metrics")
print(classification_report(y_test, y_pred))
print ("Weighted SVM metrics")
print(classification_report(y_test, y_pred_weighted))
In [284]:
from sklearn.metrics import confusion_matrix
print ("Basic SVM Confusion Matrix")
print (confusion_matrix(y_test, y_pred))
print ("Weighted SVM Confusion Matrix")
print (confusion_matrix(y_test, y_pred_weighted))
In [285]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_weighted).ravel()
(tn, fp, fn, tp)
Out[285]:
Interpretation:
As seen from the above procedure, we have to perform cost-sensitive learning using weighting methods (adding more weight to anomaly classes) to deal with anomalies
In [ ]: