Web Mining Project: Detecting Temporal Development of Topics associated with Politicans and their Polarity on Twitter

Project undertaken by

Atef Belaaj

SeifAllah kouki

Haythem Sahbani

Outline

Crawl Twitter

Preprocessing the tweets for clustering

Polarity Classification

Topic classification

Evaluation

Crawl Twitter

Input: An english speaking politician


In [7]:
screen_name = "BarackObama"
# "SenJohnMcCain"

Output : Json frame that contains the tweet id, date of creation, polarity (to find), and the topic (to find)


In [8]:
from twitter_crawl import TwitterCrawl

twt = TwitterCrawl()
json_tweets = twt.get_feed(screen_name)
print("number of tweets : %d" % len(json_tweet))


number of tweets : 3238

We can save the Json frame in a Json file like follow:


In [9]:
file = screen_name + "_tweets.json"
twt.save_tweets(file)

And then load it to avoid crawling the data from twitter each time.


In [ ]:
json_tweets = twt.load_tweets(file)

Preprocessing the tweets for clustering

The tweets are processed like follow:

Frist, expand the contractions: for example don't => do not

Second, tokenize with nltk regxp tokenizer using special regular expression patterns that best fit the needs of the project. For example we removed retweet, url, and we maintained hashtags.

Third, remove stop words: words like "will" are not relevent for topic classification


In [10]:
from Preprocess import Preprocess, TweetTokenizer
tweet_text = twt.get_tweet_text(json_tweets)
print "tweet before preprocessing: " + tweet_text[1]
for tweet in tweet_text:
    tweet_text[tweet_text.index(tweet)] = " ".join(
                                                Preprocess().remove_stopwords(
                                                    TweetTokenizer().tokenize(
                                                        Preprocess().expand_contraction(tweet))))

print "tweet after preprocessing: " + tweet_text[1]


tweet before preprocessing: "We have to pass a budget that gives middle-class families the security they need to get ahead in the new economy." President Obama
tweet after preprocessing: pass budget middle-class families security ahead economy president obama

Polarity classification

In order to classify the extracted tweets according to their polarity, we built a naive bayesien classifier. As our main focus is political tweets, we will assume that our tweets words are spelled correctly. Therefore we decided to train our classifier with nltk movie reviews corpus.

Feature extraction

As a first step, we used a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:


In [37]:
def bag_of_word_feats(words):
    return dict([(word, True) for word in words])

Training and Testing sets

The movies reviews corpus contains 2000 reviews: half positive and half negative. To make sure that we can evaluate our classifier properly, we must keep a part of the data unseen by the training classifier. Therfore we have these two sets:


In [40]:
import collections
import nltk.metrics
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')
neg_feats = [(bag_of_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(bag_of_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]

neg_limit = len(neg_feats)*3/4
pos_limit = len(pos_feats)*3/4


trainfeats = neg_feats[:neg_limit] + pos_feats[:pos_limit]
testfeats = neg_feats[neg_limit:] + pos_feats[pos_limit:]
print 'train on %d instances, test.ipynb on %d instances' % (len(trainfeats), len(testfeats))


train on 1500 instances, test.ipynb on 500 instances

Training the classifier

At this point, we have a training set, so all we need to do is instantiate a classifier and classify test tweets.


In [41]:
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)


accuracy: 0.728

Evaluation

Accuracy is not the only way to evaluate a classifier. For binary classifiers, there are two other metrics that can give us more information about the performence: they are called precision and recall. Besides from these two mertrics, there is an other measure that combines the two metrics: which is the weighted harmonic mean of precision and recall. We used nltk.metrics class:


In [42]:
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)

print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])


pos precision: 0.651595744681
pos recall: 0.98
pos F-measure: 0.782747603834
neg precision: 0.959677419355
neg recall: 0.476
neg F-measure: 0.636363636364

Stop Words Features

We assumed that our dataset contains a lot of useless words. These words are actually called stop words. We used the nltk stop word corpus, to filter the dataset with this modified feature extractor:


In [ ]:
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
    return dict([(word, True) for word in words if word not in stopset])

In [43]:
import collections
import nltk.metrics
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')



from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
    return dict([(word, True) for word in words if word not in stopset])



neg_feats = [(stopword_filtered_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(stopword_filtered_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]


neg_limit = len(neg_feats)*3/4
pos_limit = len(pos_feats)*3/4


trainfeats = neg_feats[:neg_limit] + pos_feats[:pos_limit]
testfeats = neg_feats[neg_limit:] + pos_feats[pos_limit:]
print 'train on %d instances, test.ipynb on %d instances' % (len(trainfeats), len(testfeats))


classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(testfeats):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)

print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])


train on 1500 instances, test.ipynb on 500 instances
accuracy: 0.726
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0
pos precision: 0.649867374005
pos recall: 0.98
pos F-measure: 0.781499202552
neg precision: 0.959349593496
neg recall: 0.472
neg F-measure: 0.632707774799

We conclude that stop words add information to the polarity analysis

Bigram

we noticed that negation like "not bad" , which is a positive expression that the bag of words model could interpret as negative since it sees "bad" as a separate word.

Therefore we used nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures to calculate bigram frequencies and individual words frequencies and determine whether the bigram occurs about as frequently as each individual word:


In [36]:
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

Evaluate:


In [46]:
import collections
import nltk.metrics
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')




import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
neg_feats = [(bigram_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(bigram_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]

neg_limit = len(neg_feats)*3/4
pos_limit = len(pos_feats)*3/4


trainfeats = neg_feats[:neg_limit] + pos_feats[:pos_limit]
testfeats = neg_feats[neg_limit:] + pos_feats[pos_limit:]


classifier = NaiveBayesClassifier.train(trainfeats)

print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)


refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(testfeats):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)

print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])


accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
pos F-measure: 0.836298932384
neg precision: 0.920212765957
neg recall: 0.692
neg F-measure: 0.7899543379

The results confirm our assumption about the negative expressions, as bigram gives better precession, recall and accuracy

To update our tweets polarity, we create the class Polarity_classifier from where we can call one of the previous three classifier. This class assign polarity to tweets by refering to the pickle files of each trained classifier.


In [11]:
from Polarity_classifier import Polarity_classifier

pc = Polarity_classifier()

#Choose one of the three polarity classifiers
pc.set_polarity_bag_classifier(json_tweets)
#pc.set_polarity_bigram_classifier(json_tweets)
#pc.set_polarity_stop_classifier(json_tweets)

Topic Classificaiton

We tried two classification methods:

Classifing tweets using a clustering algorithm

Classifing tweets using htags

Clustering using k-means

This section contains technics that are applied to extract features and cluster tweets into topics.

Most of used algorithms were already implemented in the Scikit-learn library which has significantly facilitate the task.

Feature extraction method Tf-idf

In order to compute the preprocessed data, it was reduced to a set of selected words to gain in the computational memory since the dataset size was too big.

Starting with a bag-of-word approach which aims to represent a tweet as a sparse vector of occurrence counts of words, a set of most significant words was chosen in a way to describe the whole data according to the made assumption "the 5% most frequent words are relevant enough to present topics discussed in the dataset of tweets".

In a next step, the set of relevant words was weighted with the numerical statistic technic term frequency-inverse document frequency (tf-idf) which assign to each word in the set a value that reflect its importance to a tweet in the corpus.

The weighting of a word is a product of the frequency of a term in a tweet tf(t,d) and the inverse document frequency idf(t, D) such that: $$\mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|}$$ With N is the total number of tweets in the corpus.

$|\{d \in D: t \in d\}|$ : number of tweets where the term t appears.

The idf is a weighting value of the term importance by indicating if the term is common or rare in the corpus and it tend to weigh down very common words like ?the?, ?is? and ?a? which are frequent words but have little importance.

Thus, the tfidf value tend to assign a high value for frequent words in a group of tweets and low value for frequent and rare words in the whole corpus which will be significantly helpful in the clustering process.

The number of features was multiplied by the number of n-grams because the employment of n-grams leads to have usually same words that repeated as single term and in tuple so the number of significant word was remarkably decreased.

The parameters max_df and min_df indicate respectively the max percentage of documents and the minimum number of documents where a term occurs. In our case, the terms that appears in less than two document and more than 95% of documents are ignored which expand the number of words taken in consideration


In [ ]:
tfidf_vect= TfidfVectorizer(max_df=0.95, max_features=self.ngram*self.n_features,
                                     min_df=2, stop_words='english', ngram_range=(1, self.ngram))

Gap Statistic

In order to apply the clustering algorithm, we need to know the number k of clusters which is unknown in our project.

In the literature, few numbers of heuristic proposed to solve the problem of detecting k in k-means. However, Gap Statistic was chosen because it has shown a better performances than other methods.

The Gap statistic is a numerical statistic method that, increasing the number of k in k-means algorithm, select the first number of clusters such that adding another does not give much better modeling of the data.

To compute this gap, the implemented algorithm compare the change in within cluster dispersion with that expected under an appropriate reference null dispersion, then, it select the first k that verify: $$\mathrm{Gap}(k) \geq \mathrm{Gap}(k+1) - s_{k+1}$$

With $s_{k+1}$ is the standard deviation of the reference dispersion.

The reference null dispersion is generating by sample uniformly from a rectangle formed from the original dataset?s principal components.


In [12]:
%matplotlib inline

In [19]:
from Clustering import Clustering

clr = Clustering()

best_k = clr.gap_statistic(tweet_text, kmin=2, kmax=10)


best k found by gap statistic =3

K-means clustering

the calculated weighted matrix from TfidfVectorizer serve the k-means algorithm to cluster the data points into k topics.

K-means was launched with k-means++ option to find in an optimal way the centroids of clusters.


In [22]:
clr.best_kmeans(best_k, tweet_text)


#####################   Clustering with best matching k=3  ####################
Top terms per cluster:
 Cluster 1: health, care, getcovered, time, insurance, 

 Cluster 2: president, obama, watch, america, opportunityforall, 

 Cluster 3: ofa, immigration, reform, actonreform, volunteers, 

Draw clusters and centroids (taken from the scikit-learn tutorial)

Evaluation

Gap statistic is a numerical statistic method that make many estimations and approximative measures which make it very inprecise and it shows ,when executed several times, a variant number of estimated k for the same data.

K-means algorithm prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid, which lead to incorrect border in between clusters.

For big or small cluster number, the k-means clusters show similarity between different clusters and heterogeneities within the same cluster, which reflect the importance of detecting a good predicted number of clusters.

But even with a good number of clusters, the implemented clustering algorithm has a good results regarding the big topics but it show a weakness to extract topic with few numbers of tweets.


In [25]:
from plot import present_all_data
clr.set_tweet_topic(json_tweets)
present_all_data(best_k, json_tweets)


Clustering using htags: (hashtag = topic)


In [26]:
from hashtag_classification import HtagClassifier, plot
plot(json_tweets)


('Most frequent tweets:', [u'#RaiseTheWage', u'#ActOnClimate', u'#OpportunityForAll', u'#SOTU', u'#GetCovered'])

In [27]:
from hashtag_classification import HtagClassifier
dic = HtagClassifier().htag_classifier(json_tweets)
print("Number of different htags", len(dic.keys()))
print("Number of tweets with no htags: ", len(dic["no_htags_tweet"]))


('Number of different htags', 430)
('Number of tweets with no htags: ', 1256)

Almost 30% of tweets are without htags so they're not classified

The 5 most used htags are present in over than 100 tweets

htags like #house and #houses are taken as seperate topics but they are pretty much the same

=> Stem the htags can ged rid of this

Over all Evaluation

The Clustering algorithm is maid to work with a big data set (more than 1000 tweets). The classifier doesn't classify little topics With 3000 tweets we crawl more than one year of twitter activity. The Classifie shows at most 10 topics. These are the most used one

the Clustering we implemented works better on big data sets. The +: Can enhance the accuracy with feature engineering. The -: Takes too long to proceed

The Hashtag classifier can work on small data set. The +: No learning time, no preprocessing of the tweets. The -: About 30% of the tweets doesn't contain any htag

The algorithm can be improved by merging the Clustering and the htag classifiers

Also using unsupervised learning for topic classification scales well in time. Because the topics are dynamic. Instead of supervised learning which should have more problem due to it's need to labels