Atef Belaaj
SeifAllah kouki
Haythem Sahbani
Crawl Twitter
Preprocessing the tweets for clustering
Polarity Classification
Topic classification
Evaluation
Input: An english speaking politician
In [7]:
screen_name = "BarackObama"
# "SenJohnMcCain"
Output : Json frame that contains the tweet id, date of creation, polarity (to find), and the topic (to find)
In [8]:
from twitter_crawl import TwitterCrawl
twt = TwitterCrawl()
json_tweets = twt.get_feed(screen_name)
print("number of tweets : %d" % len(json_tweet))
We can save the Json frame in a Json file like follow:
In [9]:
file = screen_name + "_tweets.json"
twt.save_tweets(file)
And then load it to avoid crawling the data from twitter each time.
In [ ]:
json_tweets = twt.load_tweets(file)
The tweets are processed like follow:
Frist, expand the contractions: for example don't => do not
Second, tokenize with nltk regxp tokenizer using special regular expression patterns that best fit the needs of the project. For example we removed retweet, url, and we maintained hashtags.
Third, remove stop words: words like "will" are not relevent for topic classification
In [10]:
from Preprocess import Preprocess, TweetTokenizer
tweet_text = twt.get_tweet_text(json_tweets)
print "tweet before preprocessing: " + tweet_text[1]
for tweet in tweet_text:
tweet_text[tweet_text.index(tweet)] = " ".join(
Preprocess().remove_stopwords(
TweetTokenizer().tokenize(
Preprocess().expand_contraction(tweet))))
print "tweet after preprocessing: " + tweet_text[1]
In order to classify the extracted tweets according to their polarity, we built a naive bayesien classifier. As our main focus is political tweets, we will assume that our tweets words are spelled correctly. Therefore we decided to train our classifier with nltk movie reviews corpus.
As a first step, we used a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:
In [37]:
def bag_of_word_feats(words):
return dict([(word, True) for word in words])
The movies reviews corpus contains 2000 reviews: half positive and half negative. To make sure that we can evaluate our classifier properly, we must keep a part of the data unseen by the training classifier. Therfore we have these two sets:
In [40]:
import collections
import nltk.metrics
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')
neg_feats = [(bag_of_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(bag_of_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
neg_limit = len(neg_feats)*3/4
pos_limit = len(pos_feats)*3/4
trainfeats = neg_feats[:neg_limit] + pos_feats[:pos_limit]
testfeats = neg_feats[neg_limit:] + pos_feats[pos_limit:]
print 'train on %d instances, test.ipynb on %d instances' % (len(trainfeats), len(testfeats))
At this point, we have a training set, so all we need to do is instantiate a classifier and classify test tweets.
In [41]:
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
Accuracy is not the only way to evaluate a classifier. For binary classifiers, there are two other metrics that can give us more information about the performence: they are called precision and recall. Besides from these two mertrics, there is an other measure that combines the two metrics: which is the weighted harmonic mean of precision and recall. We used nltk.metrics class:
In [42]:
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
We assumed that our dataset contains a lot of useless words. These words are actually called stop words. We used the nltk stop word corpus, to filter the dataset with this modified feature extractor:
In [ ]:
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
def stopword_filtered_word_feats(words):
return dict([(word, True) for word in words if word not in stopset])
In [43]:
import collections
import nltk.metrics
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
def stopword_filtered_word_feats(words):
return dict([(word, True) for word in words if word not in stopset])
neg_feats = [(stopword_filtered_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(stopword_filtered_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
neg_limit = len(neg_feats)*3/4
pos_limit = len(pos_feats)*3/4
trainfeats = neg_feats[:neg_limit] + pos_feats[:pos_limit]
testfeats = neg_feats[neg_limit:] + pos_feats[pos_limit:]
print 'train on %d instances, test.ipynb on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
We conclude that stop words add information to the polarity analysis
we noticed that negation like "not bad" , which is a positive expression that the bag of words model could interpret as negative since it sees "bad" as a separate word.
Therefore we used nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures to calculate bigram frequencies and individual words frequencies and determine whether the bigram occurs about as frequently as each individual word:
In [36]:
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
In [46]:
import collections
import nltk.metrics
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
neg_feats = [(bigram_word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(bigram_word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
neg_limit = len(neg_feats)*3/4
pos_limit = len(pos_feats)*3/4
trainfeats = neg_feats[:neg_limit] + pos_feats[:pos_limit]
testfeats = neg_feats[neg_limit:] + pos_feats[pos_limit:]
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
The results confirm our assumption about the negative expressions, as bigram gives better precession, recall and accuracy
To update our tweets polarity, we create the class Polarity_classifier from where we can call one of the previous three classifier. This class assign polarity to tweets by refering to the pickle files of each trained classifier.
In [11]:
from Polarity_classifier import Polarity_classifier
pc = Polarity_classifier()
#Choose one of the three polarity classifiers
pc.set_polarity_bag_classifier(json_tweets)
#pc.set_polarity_bigram_classifier(json_tweets)
#pc.set_polarity_stop_classifier(json_tweets)
We tried two classification methods:
Classifing tweets using a clustering algorithm
Classifing tweets using htags
This section contains technics that are applied to extract features and cluster tweets into topics.
Most of used algorithms were already implemented in the Scikit-learn library which has significantly facilitate the task.
In order to compute the preprocessed data, it was reduced to a set of selected words to gain in the computational memory since the dataset size was too big.
Starting with a bag-of-word approach which aims to represent a tweet as a sparse vector of occurrence counts of words, a set of most significant words was chosen in a way to describe the whole data according to the made assumption "the 5% most frequent words are relevant enough to present topics discussed in the dataset of tweets".
In a next step, the set of relevant words was weighted with the numerical statistic technic term frequency-inverse document frequency (tf-idf) which assign to each word in the set a value that reflect its importance to a tweet in the corpus.
The weighting of a word is a product of the frequency of a term in a tweet tf(t,d) and the inverse document frequency idf(t, D) such that: $$\mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|}$$ With N is the total number of tweets in the corpus.
$|\{d \in D: t \in d\}|$ : number of tweets where the term t appears.
The idf is a weighting value of the term importance by indicating if the term is common or rare in the corpus and it tend to weigh down very common words like ?the?, ?is? and ?a? which are frequent words but have little importance.
Thus, the tfidf value tend to assign a high value for frequent words in a group of tweets and low value for frequent and rare words in the whole corpus which will be significantly helpful in the clustering process.
The number of features was multiplied by the number of n-grams because the employment of n-grams leads to have usually same words that repeated as single term and in tuple so the number of significant word was remarkably decreased.
The parameters max_df and min_df indicate respectively the max percentage of documents and the minimum number of documents where a term occurs. In our case, the terms that appears in less than two document and more than 95% of documents are ignored which expand the number of words taken in consideration
In [ ]:
tfidf_vect= TfidfVectorizer(max_df=0.95, max_features=self.ngram*self.n_features,
min_df=2, stop_words='english', ngram_range=(1, self.ngram))
In order to apply the clustering algorithm, we need to know the number k of clusters which is unknown in our project.
In the literature, few numbers of heuristic proposed to solve the problem of detecting k in k-means. However, Gap Statistic was chosen because it has shown a better performances than other methods.
The Gap statistic is a numerical statistic method that, increasing the number of k in k-means algorithm, select the first number of clusters such that adding another does not give much better modeling of the data.
To compute this gap, the implemented algorithm compare the change in within cluster dispersion with that expected under an appropriate reference null dispersion, then, it select the first k that verify: $$\mathrm{Gap}(k) \geq \mathrm{Gap}(k+1) - s_{k+1}$$
With $s_{k+1}$ is the standard deviation of the reference dispersion.
The reference null dispersion is generating by sample uniformly from a rectangle formed from the original dataset?s principal components.
In [12]:
%matplotlib inline
In [19]:
from Clustering import Clustering
clr = Clustering()
best_k = clr.gap_statistic(tweet_text, kmin=2, kmax=10)
the calculated weighted matrix from TfidfVectorizer serve the k-means algorithm to cluster the data points into k topics.
K-means was launched with k-means++ option to find in an optimal way the centroids of clusters.
In [22]:
clr.best_kmeans(best_k, tweet_text)
Draw clusters and centroids (taken from the scikit-learn tutorial)
Gap statistic is a numerical statistic method that make many estimations and approximative measures which make it very inprecise and it shows ,when executed several times, a variant number of estimated k for the same data.
K-means algorithm prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid, which lead to incorrect border in between clusters.
For big or small cluster number, the k-means clusters show similarity between different clusters and heterogeneities within the same cluster, which reflect the importance of detecting a good predicted number of clusters.
But even with a good number of clusters, the implemented clustering algorithm has a good results regarding the big topics but it show a weakness to extract topic with few numbers of tweets.
In [25]:
from plot import present_all_data
clr.set_tweet_topic(json_tweets)
present_all_data(best_k, json_tweets)
In [26]:
from hashtag_classification import HtagClassifier, plot
plot(json_tweets)
In [27]:
from hashtag_classification import HtagClassifier
dic = HtagClassifier().htag_classifier(json_tweets)
print("Number of different htags", len(dic.keys()))
print("Number of tweets with no htags: ", len(dic["no_htags_tweet"]))
Almost 30% of tweets are without htags so they're not classified
The 5 most used htags are present in over than 100 tweets
htags like #house and #houses are taken as seperate topics but they are pretty much the same
=> Stem the htags can ged rid of this
The Clustering algorithm is maid to work with a big data set (more than 1000 tweets). The classifier doesn't classify little topics With 3000 tweets we crawl more than one year of twitter activity. The Classifie shows at most 10 topics. These are the most used one
the Clustering we implemented works better on big data sets. The +: Can enhance the accuracy with feature engineering. The -: Takes too long to proceed
The Hashtag classifier can work on small data set. The +: No learning time, no preprocessing of the tweets. The -: About 30% of the tweets doesn't contain any htag
The algorithm can be improved by merging the Clustering and the htag classifiers
Also using unsupervised learning for topic classification scales well in time. Because the topics are dynamic. Instead of supervised learning which should have more problem due to it's need to labels