In this section we will see how to extract features from tweets and use a classifier to classify the tweet as positive or negative.
We will use a pandas DataFrames (http://pandas.pydata.org/) to store tweets and process them. Pandas DataFrames are very powerful python data-structures, like excel spreadsheets with the power of python.
In [4]:
# Let's create a DataFrame with each tweet using pandas
import pandas as pd
import json
import numpy as np
def getTweetID(tweet):
""" If properly included, get the ID of the tweet """
return tweet.get('id')
def getUserIDandScreenName(tweet):
""" If properly included, get the tweet
user ID and Screen Name """
user = tweet.get('user')
if user is not None:
uid = user.get('id')
screen_name = user.get('screen_name')
return uid, screen_name
else:
return (None, None)
filename = 'AI2.txt'
# create a list of dictionaries with the data that interests us
tweet_data_list = []
with open(filename, 'r') as fopen:
# each line correspond to a tweet
for line in fopen:
if line != '\n':
tweet = json.loads(line.strip('\n'))
tweet_id = getTweetID(tweet)
user_id = getUserIDandScreenName(tweet)[0]
text = tweet.get('text')
if tweet_id is not None:
tweet_data_list.append({'tweet_id' : tweet_id,
'user_id' : user_id,
'text' : text})
# put everything in a dataframe
tweet_df = pd.DataFrame.from_dict(tweet_data_list)
In [5]:
print(tweet_df.shape)
print(tweet_df.columns)
#print 5 first element of one of the column
print(tweet_df.text.iloc[:5])
# or
print(tweet_df['text'].iloc[:5])
In [6]:
#show the first 10 rows
tweet_df.head(10)
Out[6]:
This part uses concepts from Naltural Langage Processing. We will use a tweet tokenizer I built based on TweetTokenizer from NLTK (http://www.nltk.org/). You can see how it works by opening the file TwSentiment.py. The goal is to process any tweets and extract a list of words taking into account usernames, hashtags, urls, emoticons and all the informal text we can find in tweets. We also want to reduce the number of features by doing some transformations such as putting all the words in lower cases.
In [7]:
from TwSentiment import CustomTweetTokenizer
In [8]:
tokenizer = CustomTweetTokenizer(preserve_case=False, # keep Upper cases
reduce_len=True, # reduce repetition of letter to a maximum of three
strip_handles=False, # remove usernames (@mentions)
normalize_usernames=True, # replace all mentions to "@USER"
normalize_urls=True, # replace all urls to "URL"
keep_allupper=True) # keep upercase for words that are all in uppercase
In [9]:
# example
tweet_df.text.iloc[0]
Out[9]:
In [10]:
tokenizer.tokenize(tweet_df.text.iloc[0])
Out[10]:
In [9]:
# other examples
tokenizer.tokenize('Hey! This is SO cooooooooooooooooool! :)')
Out[9]:
In [11]:
tokenizer.tokenize('Hey! This is so cooooooool! :)')
Out[11]:
We will use the occurrence of words and pair of words (bigrams) as features.
This corresponds to a bag-of-words representation (https://en.wikipedia.org/wiki/Bag-of-words_model): we just count each words (or n-grams) without taking account their order. For document classification, the frequency of occurence of each words is usually taken as a feature. In the case of tweets, they are so short that we can just count each words once.
Using pair of words allows to capture some of the context in which each words appear. This helps capturing the correct meaning of words.
In [12]:
from TwSentiment import bag_of_words_and_bigrams
# this will return a dictionary of features,
# we just list the features present in this tweet
bag_of_words_and_bigrams(tokenizer.tokenize(tweet_df.text.iloc[0]))
Out[12]:
https://www.dropbox.com/s/09rw6a85f7ezk31/sklearn_SGDLogReg_.pickle.zip?dl=1
I trained this classifier on this dataset: http://help.sentiment140.com/for-students/, following the approach from this paper: http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
This is a set of 14 million tweets with emoticons. Tweets containing "sad" emoticons (7 million) are considered negative and tweets with "happy" emoticons (7 million) are considered positive.
I used a Logistic Regression classifier with L2 regularization that I optimized with a 10 fold cross-validation using $F_1$ score as a metric.
In [14]:
# the classifier is saved in a "pickle" file
import pickle
with open('sklearn_SGDLogReg_.pickle', 'rb') as fopen:
classifier_dict = pickle.load(fopen)
In [15]:
# classifier_dict contain the classifier and label mappers
# that I added so that we remember how the classes are
# encoded
classifier_dict
Out[15]:
The classifier is in fact contained in a pipeline. A sklearn pipeline allows to assemble several transformation of your data (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
In [16]:
pipline = classifier_dict['sklearn_pipeline']
In our case we have two steps:
In [17]:
pipline.steps
Out[17]:
In [18]:
# this the step that will transform a list of textual features to a vector of zeros and ones
dict_vect = pipline.steps[0][1]
In [19]:
dict_vect.feature_names_
Out[19]:
In [20]:
# number of features
len(dict_vect.feature_names_)
Out[20]:
In [21]:
# a little example
text = 'Hi all, I am very happy today'
# first tokenize
tokens = tokenizer.tokenize(text)
print(tokens)
# list features
features = bag_of_words_and_bigrams(tokens)
print(features)
# vectorize features
X = dict_vect.transform(features)
print(X.shape)
In [22]:
# X is a special kind of numpy array. beacause it is extremely sparse
# it can be encoded to take less space in memory
# if we want to see it fully, we can use .toarray()
# number of non-zero values in X:
X.toarray().sum()
Out[22]:
The mapping between the list of features and the vector of zeros and ones is done when you train the pipeline with its .fit
method.
In [23]:
classifier = pipline.steps[1][1]
In [24]:
classifier
Out[24]:
In [25]:
# access the weights of the logistic regression
classifier.coef_
Out[25]:
In [26]:
# we have as many weights as features
classifier.coef_.shape
Out[26]:
In [27]:
# plus the intrecept
classifier.intercept_
Out[27]:
In [28]:
# let's check the weight associated with a given feature
x = dict_vect.transform({'bad': True})
_, ind = np.where(x.todense())
classifier.coef_[0,ind]
Out[28]:
In [29]:
# find the probability for a specific tweet
classifier.predict_proba(X)
Out[29]:
Using the sklearn pipeline to group the two last steps:
In [30]:
pipline.predict_proba(features)
Out[30]:
We see to numbers, the first one is the probability of the tweet being sad, the second one is the probability of the tweet being happy.
In [31]:
# note that:
pipline.predict_proba(features).sum()
Out[31]:
In [32]:
from TwSentiment import TweetClassifier
In [33]:
twClassifier = TweetClassifier(pipline,
tokenizer=tokenizer,
feature_extractor=bag_of_words_and_bigrams)
In [34]:
# example
text = 'Hi all, I am very happy today'
twClassifier.classify_text(text)
Out[34]:
In [35]:
# the classify text method also accepts a list of text as input
twClassifier.classify_text(['great day today!', 'bad day today...'])
# the classify text method also accepts a list of text as input
# twClassifier.classify_text(['not sad', 'not happy'])
Out[35]:
In [36]:
emo_clas, prob = twClassifier.classify_text(tweet_df.text.tolist())
In [37]:
# add the result to the dataframe
In [38]:
tweet_df['pos_class'] = (emo_clas == 'pos')
tweet_df['pos_prob'] = prob[:,1]
In [39]:
tweet_df.head()
Out[39]:
In [40]:
# plot the distribution of probability
import matplotlib.pyplot as plt
%matplotlib inline
h = plt.hist(tweet_df.pos_prob, bins=50)
We want to classify users based on the class of their tweets. Pandas allows to easily group tweets per users using the groupy method of DataFrames:
In [41]:
user_group = tweet_df.groupby('user_id')
In [42]:
print(type(user_group))
In [43]:
# let's look at one of the group
groups = user_group.groups
uid = list(groups.keys())[5]
user_group.get_group(uid)
Out[43]:
In [44]:
# we need to make a function that takes the dataframe of tweets grouped by users and return the class of the users
def get_user_emo(group):
num_pos = group.pos_class.sum()
num_tweets = group.pos_class.size
if num_pos/num_tweets > 0.5:
return 'pos'
elif num_pos/num_tweets < 0.5:
return 'neg'
else:
return 'NA'
In [45]:
# apply the function to each group
user_df = user_group.apply(get_user_emo)
In [46]:
# This is a pandas Series where the index are the user_id
user_df.head(10)
Out[46]:
In [48]:
import networkx as nx
G = nx.read_graphml('twitter_lcc_AI2.graphml', node_type=int)
for n in G.nodes_iter():
if n in user_df.index:
# here we look at the value of the user_df series at the position where the index
# is equal to the user_id of the node
G.node[n]['emotion'] = user_df.loc[user_df.index == n].values[0]
In [49]:
# we have added an attribute 'emotion' to the nodes
G.node[n]
Out[49]:
In [50]:
# save the graph to open it with Gephi
nx.write_graphml(G, 'twitter_lcc_emo_AI2.graphml')
In [ ]: