We used two main datasets for training:
Sentimental Analysis Dataset: 1.6 million collected labelled sentences with '0' for negative and '1' for positive. This dataset is used to train a 2 class classifier.
Sample:
[['0' 'I missed the New Moon trailer...']
['1' 'omg its already 7:30 :O']
['0' ".. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just
get a crown put on (30mins)..."]
['0' ' i think mi bf is cheating on me!!! T_T']
['0' ' or i just worry too much? ']]
Amazon Dataset: 30k Amazon Product reviews from users with the rating given to the product. This dataset is used to train a multi-class classifier and Regressors.
Sample:
[[2, "It is good if you have internet than you can download the stuff, else, you can't "],
[5, "The RIO rocks! It is so great that Diamond Multimedia prevailed in their fight against the forces of pure evil in the music industry and allowed us, the public, to have the RIO! This little baby holds your MP3's and plays them with outrageous quality and no moving parts! You simply cannot make the music "skip". Take it jogging, bob sledding, whatever! The Rio is cute and compact, battery lasts forever, runs great and is really simple to use. Works well with the PC linkup, etc. A hot item! "],
[4, 'I had high hopes for the Diamond Rio and it certainly lived up 2 the hype. Lightweight and excellent quality with some good connecting software. My only gripe can be with memory. You definately need another 32Mb to store your music. If you want one, my advice is to wait for the new upgraded version with 64Mb and a graphic equaliser! '], [5, "Diamond's RIO is the current, silicon-state nightmare for monopolistic entertainment industries. When the first audio recording-devices entered the consumer market decades ago, the idea of a controlled "charge-per-copy" business model in the music industry was doomed. Although traditional copyrights could never be totally enforced, the record and music-producing industry neglected these threads existing in the shades of multi-billion profits.<p>Mp3 and the Internet raided the existing markets with their "natural" power like a cruise-missile against a frying-pan. Like other good examples for "Killer-Apps", the RIO is designed to Web-Specs (mp3). This makes its use and performance comparable with other CD-quality playing devices while putting the power of the internet into the palm of your hand. If you want to make a statement of "being digital", the RIO is a must! "],
[5, "Remember when the Walkman hit the market years ago? At first it was fad, then it became a craze before finally becoming as ubiquitous as taxi cabs in Manhattan. The Diamond Rio is the first product with the potential to eventually REPLACE the Walkman. Think about it - portable music that sounds as clear as a compact disc from a device that looks and feels like a pager. Unshockable, no matter how much you jump around or how bumpy the bus ride. This is the perfect gift for your favorite "gadget junkie", as long as they have a PC to download from. Pssst...here's a secret: you can record your own CD's into MP3 format and download the songs into your RIO - it isn't just for internet music!! Great product. "]]
Facebook Dataset: In order to study the reactions of the Facebook audience to a post , we needed to scrap to comments of this post.
Here are parts of 3 functions used to recover the information from Facebook.
postReactions (postId,filename): Takes the post id and the name the corresponding dataset we want to generate. Returns a table with: the post message,the post id, the post time,the number of likes of the post,nb of comments, all the comments id, all the comments message, all the comments time, and the number of likes for each comment.
lastPostsReactions(page,filename,n): Takes the page id and the name the corresponding dataset we want to generate, and the minimum number of posts we want from the page. It returns the same output as the postReactions() function but for several posts.
getAllComments(postId,nb_comments_per_post,serie,fb): this function is used by the 2 functions above to get all the comments of a post. Indeed, if there is more than 25 comments on one post we need to do a pagination of the comments to recover all of them.
In [ ]:
def postReactions (postId,filename):
...
fields = 'id,created_time,message,likes.limit(0).summary(1),comments.limits(0).summary(1)'
url = 'https://graph.facebook.com/{}?fields={}&access_token={}'.format(postId, fields, token)
fb = pd.DataFrame(columns=['post message','post id','post time','post likes','nb of comments','comment id', 'comment message', 'comment time', 'comment likes'])# 'user name']) #'age', 'gender','location','political','religion','education'])
serie={'post message':[],'post id':[],'post time':[],'post likes':[],'nb of comments':[],'comment id':[],'comment message':[],'comment time':[], 'comment likes':[]}#,'user name':[]} # 'age':[], 'gender':[],'location':[],'political':[],'religion':[],'education':[]};
post = requests.get(url).json()
...
try:
# Only work with posts with comments which have text.
nb_comments_per_post=post['comments']['summary']['total_count']
#IndexError if no comment on the page, only work with posts
# which have at least 1 comment
x= post['comments']['data'][0]['message']
serie['post message']=post_message
serie['post time']=post['created_time']
serie['post likes']=post['likes']['summary']['total_count']
serie['nb of comments']= post['comments']['summary']['total_count']
serie['post id']=post['id']
fb = getAllComments(postId,nb_comments_per_post,serie,fb)
except IndexError or KeyError:
print('')
except KeyError:
print('')
....
fb['post time'] = fb['post time'].apply(convert_time)
fb['comment time'] = fb['comment time'].apply(convert_time)
...
return fb
In [ ]:
def lastPostsReactions(page,filename,n):
...
i=0
while i < n: #len(fb) < n:
posts = requests.get(url).json()
# extract information for each of the received post
for post in posts['data']:
[...]
try:
url = posts['paging']['next']
#print('next')
except KeyError:
#print('no next')
break
print("Number of posts: ",i)
return fb
In [ ]:
def getAllComments(postId,nb_comments_per_post,serie,fb):
fields_comments = 'comment_count,like_count,created_time,message'
url_comments = 'https://graph.facebook.com/{}/comments/?fields={}&access_token={}'.format(postId, fields_comments, token)
comment =0
while comment <= nb_comments_per_post + 1:
post_comments=requests.get(url_comments).json()
i=0
for com in post_comments['data']:
i=i+1
comment_message=com['message']
serie['comment message'] = comment_message
serie['comment time'] = com['created_time']
serie['comment likes'] = com['like_count']
serie['comment id']=com['id']
fb = fb.append(serie, ignore_index=True)
comment=comment+1
try:
url_comments = post_comments['paging']['next']
except KeyError:
break
return fb
For the classification , the comments are transformed into vectors that represent the features of the text. This is also called the bag-of-words model. Each word is represented by an integer and count the appearance in the document.
Removing Stopwords: Using the English dictionary
Select K Best Model: Selects the K best features in relation with the labels using the chi2 distribution.
In [ ]:
def compute_bag_of_words(text, stopwords , vocab=None):
vectorizer = CountVectorizer(stop_words = stopwords,vocabulary=vocab)
vectors = vectorizer.fit_transform(text)
vocabulary = vectorizer.get_feature_names()
return vectors, vocabulary
KBestModel = SelectKBest(chi2, k=1000).fit(bow, Y_sat)
indices = KBestModel.get_support(True)
bow_transformed = KBestModel.transform(bow)
Some of the informative features of our dataset:
In [ ]:
def Transform_To_Input_Format_SAT_Classifiers(X):
with codecs.open('Best_Features_SAT.txt','r',encoding='utf8') as f:
features = f.readlines()
features = [x.strip("\n") for x in features]
X_transformed,vocab = compute_bag_of_words(X, stopwords.words(),features)
return X_transformed,features
def Transform_To_Input_Format_Amazon(X):
with open('Best_Features_Amazon.txt') as f:
features = f.readlines()
features = [x.strip("\n") for x in features]
X_transformed,vocab = compute_bag_of_words(X, stopwords.words(),features)
return X_transformed
In [ ]:
partial_fit_classifiers = {
'SGD': SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0,
learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
penalty='l1', power_t=0.5, random_state=None, shuffle=True,
verbose=0, warm_start=False),
'Perceptron': Perceptron(),
'NB Multinomial': MultinomialNB(alpha=0.01),
'Passive-Aggressive': PassiveAggressiveClassifier(C=1.0, n_iter=50, shuffle=True,
verbose=0, loss='hinge',
warm_start=False),
'NB Bernoulli': BernoulliNB(alpha=0.01),
}
In [ ]:
partial_fit_Regressors = {
'SGD Regressor':SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001, l1_ratio=0,
fit_intercept=True, n_iter=1000, shuffle=True, verbose=0, epsilon=0.01, random_state=None,
learning_rate='invscaling', eta0=0.01, power_t=0.25, warm_start=False, average=False),
'Passive-Aggressive Regressor' : PassiveAggressiveRegressor(),
}
The dataset is too large to train the classifiers at once Memory Error that is why we used incremental learning:
Libraries used:
1. sklearn
2. Numpy
3. Scipy
Learning Process:
Divide Data into Batches
minibatch_size = 10000
batch = bow_transformed[start:start+minibatch_size]
Partially train the classifier on the batch
cls.partial_fit(X_train, Y_train, classes = y_all)
Cross-Validation
kf = KFold(n_splits = 10)
for train_index,test_index in kf.split(X):
X_train,X_test = X[train_index],X[test_index]
Y_train,Y_test = Y[train_index],Y[test_index]
Accuracy Check:
Classifiers:
100*sklearn.metrics.accuracy_score(Y_test, train_pred)
Regressors:
sklearn.metrics.mean_squared_error(Y_test, train_pred)
In [23]:
#CNN article about orlando shooting's victims suing twitter and facebook
#page='cnninternational'
#postId= '10154812470324641'
#fb= postReactions(postId,1,'CNN_OS1')
filename='facebookCNN_OS1.sqlite'
getAllGraphsForPost(filename,50)
CNN article about John McCain saying that Obama is responsible for the Orlando shooting
In [25]:
#postId= '10154219265779641'
#fb= postReactions(postId,1,'CNN_OS2')
filename='facebookCNN_OS2.sqlite'
getAllGraphsForPost(filename,50)
CNN articles about gun sales after Orlando shooting
In [26]:
#postId= '10154210843679641'
#fb= postReactions(postId,1,'CNN_OS3')
filename='facebookCNN_OS3.sqlite'
getAllGraphsForPost(filename,50)
In [28]:
#postId= '10154817313449641'
#fb= postReactions(postId,1,'CNN_Ab1')
filename='facebookCNN_Ab1.sqlite'
getAllGraphsForPost(filename,50)
CNN article about Texas fetus burial obliagtion
In [30]:
#postId= '10154738564069641'
#fb= postReactions(postId,1,'CNN_Ab2')
filename='facebookCNN_Ab2.sqlite'
getAllGraphsForPost(filename,50)
CNN article about pope abortion forgivness
In [32]:
#postId= '10154708071624641'
#fb= postReactions(postId,1,'CNN_Ab3')
filename='facebookCNN_Ab3.sqlite'
getAllGraphsForPost(filename,50)
In [33]:
#page = 'DavidCameronOfficial'
#postId= '1216426805048302'
#fb= postReactions(postId,1,'DC1')
filename='facebookDC1.sqlite'
getAllGraphsForPost(filename,50)
David Cameron 'thank you' note to the voters
In [34]:
#postId= '1218229621534687'
#fb= postReactions(postId,0,'DC2')
filename='facebookDC2.sqlite'
getAllGraphsForPost(filename,50)
Hillary Clinton asking for votes (before election)
In [35]:
#page = 'hillaryclinton'
#postId= '1322830897773436'
#fb= postReactions(postId,1,'HC3')
filename='facebookHC3.sqlite'
getAllGraphsForPost(filename,50)
Donald Trump speaks about the defeat of the democrats
In [45]:
#page = 'DonaldTrump'
#postId= '10158423167515725'
#fb= postReactions(postId,0,'DT3')
filename='facebookDT3.sqlite'
getAllGraphsForPost(filename,50)
Donald Trump speaks about Obamacare
In [48]:
#postId= '10158417912085725'
#fb= postReactions(postId,0,'DT4')
filename='facebookDT4.sqlite'
getAllGraphsForPost(filename,50)
In [49]:
#page='ajplusenglish'
#postId='843345759140266'
#fb= postReactions(postId,1,'ENV1')
filename='facebookENV1.sqlite'
getAllGraphsForPost(filename,50)
AJ+ post on celebrations after the U.S. Army Corp of Engineers refused to grant an easement allowing the Dakota Access Pipeline to go under Lake Oahe
In [50]:
#postId='852012618273580'
#fb= postReactions(postId,0,'ENV2')
filename='facebookENV2.sqlite'
getAllGraphsForPost(filename,50)
In [51]:
#postId='864548553686653'
#fb= postReactions(postId,1,'ENV3')
filename='facebookENV3.sqlite'
getAllGraphsForPost(filename,50)
AJ+ post on Donald Trump being picked for the Environmental Protection Agency not believing in man-made climate change.
In [52]:
#postId='855199717954870'
#fb= postReactions(postId,1,'ENV4')
filename='facebookENV4.sqlite'
getAllGraphsForPost(filename,50)
In [53]:
#page='cnninternational'
#postId='10154249259664641'
#fb= postReactions(postId,1,'HM1')
filename='facebookHM1.sqlite'
getAllGraphsForPost(filename,50)
CNN post on the fact that more than half of British Muslims think homosexuality should be illegal
In [54]:
#postId='10154056181289641'
#fb= postReactions(postId,1,'HM2')
filename='facebookHM2.sqlite'
getAllGraphsForPost(filename,50)
In [37]:
#page='cnninternational'
#lastPostsReactions(page,'CNN',20)
filename='facebookCNN.sqlite'
fb=getTable('facebookCNN.sqlite')
getAllGraphsForPage(filename,50)