"Detection of anomalous tweets using supervising outlier techniques"

Importing the Dependencies and Loading the Data


In [4]:
import nltk
import pandas as pd
import numpy as np

In [18]:
data = pd.read_csv("original_train_data.csv", header = None,delimiter = "\t", quoting=3,names = ["Polarity","TextFeed"])

In [127]:
#Data Visualization
data.head()


Out[127]:
Polarity TextFeed
0 1 The Da Vinci Code book is just awesome.
1 1 this was the first clive cussler i've ever rea...
2 1 i liked the Da Vinci Code a lot.
3 1 i liked the Da Vinci Code a lot.
4 1 I liked the Da Vinci Code but it ultimatly did...

Data Preparation

Data prepration with the available data. I made the combination such that the classes are highly imbalanced making it apt for anomaly detection problem


In [145]:
data_positive = data.loc[data["Polarity"]==1]
data_negative = data.loc[data["Polarity"]==0]
anomaly_data = pd.concat([data_negative.sample(n=10),data_positive,data_negative.sample(n=10)])
anomaly_data.Polarity.value_counts()


Out[145]:
1    3995
0      20
Name: Polarity, dtype: int64

In [134]:
#Number of words per sentence
print ("No of words for sentence in train data",np.mean([len(s.split(" ")) for s in anomaly_data.TextFeed]))


No of words for sentence in train data 10.5379825654

Data pre-processing - text analytics to create a corpus

1) Converting text to matrix of token counts [Bag of words]
      Stemming -  lowercasing, removing stop-words, removing punctuation and reducing words to its lexical roots 
2) Stemmer, tokenizer(removes non-letters) are created by ourselves.These are passed as parameters to CountVectorizer of sklearn.
3) Extracting important words and using them as input to the classifier

Feature Engineering


In [50]:
import re
from sklearn.feature_extraction.text import CountVectorizer
nltk.download()
from nltk.stem.porter import PorterStemmer


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

In [ ]:
''' this code is taken from
http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
'''
# a stemmer widely used
stemmer = PorterStemmer() 

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems

The below implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Note: I am not using frequencies(TfidTransformer, apt for longer documents) because the text size is small and can be dealt with occurences(CountVectorizer).


In [146]:
#Max_Features selected as 80 - can be changed for the better trade-off
vector_data = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 90
)

Fit_Transform: 1) Fits the model and learns the vocabulary 2) transoforms the data into feature vectors


In [177]:
#using only the "Text Feed" column to build the features
features = vector_data.fit_transform(anomaly_data.TextFeed.tolist())
#converting the data into the array
features = features.toarray()
features.shape


Out[177]:
(4015, 90)

In [178]:
#printing the words in the vocabulary
vocab = vector_data.get_feature_names()
print (vocab)


['absolut', 'accept', 'anyon', 'awesom', 'beauti', 'becaus', 'becom', 'bitch', 'bonker', 'book', 'brokeback', 'care', 'catcher', 'code', 'commun', 'count', 'coz', 'da', 'dash', 'deep', 'desper', 'differ', 'don', 'dudee', 'escapad', 'excel', 'felicia', 'film', 'freakin', 'friend', 'fun', 'gon', 'good', 'got', 'grab', 'great', 'harri', 'hill', 'homosexu', 'imposs', 'jane', 'join', 'kate', 'key', 'know', 'leah', 'like', 'love', 'lubb', 'luv', 'make', 'man', 'mission', 'mom', 'mountain', 'movi', 'na', 'peopl', 'place', 'potter', 'read', 'realli', 'right', 'rock', 's', 'said', 'say', 'sentri', 'seri', 'silent', 'stand', 'start', 'stori', 't', 'thi', 'thing', 'think', 'thought', 'tom', 'turn', 'tye', 'vinci', 'virgin', 'wa', 'wait', 'want', 'watch', 'whi', 'worth', 'yeah']

In [287]:
# Sum up the counts of each vocabulary word
dist = np.sum(features, axis=0)
    
# For each, print the vocabulary word and the number of times it 
# appears in the data set
a = zip(vocab,dist)
print (list(a))


[('absolut', 93), ('accept', 81), ('anyon', 81), ('awesom', 1129), ('beauti', 128), ('becaus', 342), ('becom', 80), ('bitch', 81), ('bonker', 80), ('book', 150), ('brokeback', 1003), ('care', 82), ('catcher', 80), ('code', 1012), ('commun', 82), ('count', 81), ('coz', 80), ('da', 1010), ('dash', 80), ('deep', 80), ('desper', 81), ('differ', 84), ('don', 89), ('dudee', 80), ('escapad', 80), ('excel', 86), ('felicia', 160), ('film', 91), ('freakin', 81), ('friend', 85), ('fun', 81), ('gon', 80), ('good', 112), ('got', 92), ('grab', 80), ('great', 92), ('harri', 1093), ('hill', 81), ('homosexu', 80), ('imposs', 1000), ('jane', 80), ('join', 80), ('kate', 80), ('key', 80), ('know', 178), ('leah', 80), ('like', 1050), ('love', 1873), ('lubb', 80), ('luv', 82), ('make', 90), ('man', 82), ('mission', 1000), ('mom', 83), ('mountain', 1002), ('movi', 422), ('na', 83), ('peopl', 167), ('place', 83), ('potter', 1093), ('read', 199), ('realli', 186), ('right', 88), ('rock', 87), ('s', 364), ('said', 82), ('say', 92), ('sentri', 80), ('seri', 177), ('silent', 80), ('stand', 81), ('start', 161), ('stori', 170), ('t', 187), ('thi', 103), ('thing', 91), ('think', 94), ('thought', 89), ('tom', 95), ('turn', 82), ('tye', 80), ('vinci', 1010), ('virgin', 80), ('wa', 840), ('wait', 89), ('want', 254), ('watch', 102), ('whi', 168), ('worth', 82), ('yeah', 166)]

Train-Test Split


In [203]:
from sklearn.cross_validation import train_test_split
#80:20 ratio
X_train, X_test, y_train, y_test  = train_test_split(
        features, 
        anomaly_data.Polarity,
        train_size=0.80, 
        random_state=1234)

In [204]:
print ("Training data - positive and negative values")
print (pd.value_counts(pd.Series(y_train)))
print ("Testing data - positive and negative values")
print (pd.value_counts(pd.Series(y_test)))


Training data - positive and negative values
1    3196
0      16
Name: Polarity, dtype: int64
Testing data - positive and negative values
1    799
0      4
Name: Polarity, dtype: int64

A text polarity depends on what words appear in that text, discarding any grammar or word order but keeping multiplicity.

1) All the above text processing for features ended up with the same entries in our dataset

2) Instead of having them defined by a whole text, they are now defined by a series of counts of the most frequent words in our whole corpus.

3) These vectors are used as features to train a classifier.

Training the model


In [281]:
from sklearn.svm import SVC
clf = SVC()
clf.fit(X=X_train,y=y_train)

wclf = SVC(class_weight={0: 20})
wclf.fit(X=X_train,y=y_train)


Out[281]:
SVC(C=1.0, cache_size=200, class_weight={0: 20}, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [282]:
y_pred = clf.predict(X_test)
y_pred_weighted = wclf.predict(X_test)

In [283]:
from sklearn.metrics import classification_report
print ("Basic SVM metrics")
print(classification_report(y_test, y_pred))
print ("Weighted SVM metrics")
print(classification_report(y_test, y_pred_weighted))


Basic SVM metrics
             precision    recall  f1-score   support

          0       0.00      0.00      0.00         4
          1       1.00      1.00      1.00       799

avg / total       0.99      1.00      0.99       803

Weighted SVM metrics
             precision    recall  f1-score   support

          0       0.25      1.00      0.40         4
          1       1.00      0.98      0.99       799

avg / total       1.00      0.99      0.99       803

c:\users\manojkumar_meno\appdata\local\programs\python\python35\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

In [284]:
from sklearn.metrics import confusion_matrix
print ("Basic SVM Confusion Matrix")
print (confusion_matrix(y_test, y_pred))
print ("Weighted SVM Confusion Matrix")
print (confusion_matrix(y_test, y_pred_weighted))


Basic SVM Confusion Matrix
[[  0   4]
 [  0 799]]
Weighted SVM Confusion Matrix
[[  4   0]
 [ 12 787]]

In [285]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_weighted).ravel()
(tn, fp, fn, tp)


Out[285]:
(4, 0, 12, 787)

Interpretation:

As seen from the above procedure, we have to perform cost-sensitive learning using weighting methods (adding more weight to anomaly classes) to deal with anomalies

In [ ]: