Experimenting with Gensim/Word2Vec on tweets collected by the folks at the discursive project. Also making use of the gensim model built from ~ 400 million Twitter posts (built by Fréderic Godin , available at http://www.fredericgodin.com/software/)


In [1]:
import gensim
import pymongo
import json
import numpy as np
import pandas as pd
from pymongo import MongoClient


/Users/wwymak/anaconda/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")

In [8]:
import requests

In [55]:
from gensim import corpora, models, similarities

In [2]:
mongoClient = MongoClient()
db = mongoClient.data4democracy
tweets_collection = db.tweets

In [19]:
from gensim.models.word2vec import Word2Vec
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import smart_open, simple_preprocess
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

In [5]:
tweets_model = Word2Vec.load_word2vec_format('../../../../Volumes/SDExternal2/word2vec_twitter_model/word2vec_twitter_model.bin', binary=True, unicode_errors='ignore')

In [14]:
#now calculate word simiarities on twitter data e.g.  
tweets_model.most_similar('jewish')


Out[14]:
[('Jewish', 0.6926181316375732),
 ('hispanic', 0.6038353443145752),
 ('muslim', 0.5737464427947998),
 ('armenian', 0.5712549686431885),
 ('iranian', 0.5708979368209839),
 ('mormon', 0.567450761795044),
 ('mexican', 0.5669983625411987),
 ('protestant', 0.5593392252922058),
 ('Chaldean', 0.5580775737762451),
 ('asian', 0.5575110912322998)]

In [9]:
#to remind myself what a tweet is like:
r = requests.get('https://s3-us-west-2.amazonaws.com/discursive/2017/1/10/18/tweets-25.json')

In [10]:
tweets_collection = r.json()
print(tweets_collection[0])
#for text analysis, the 'text' field is the one of interest


{'description': "I Fuck Up... Just don't forget you Fuck Up Too.", 'original_name': 'Linda Suhler, Ph.D.', 'created': '2017-01-10 18:14:08', 'id_str': '818883641640177665', 'name': 'VFL2013', 'loc': None, 'retweet': 'Y', 'text': "RT @LindaSuhler: Can we hear from #MSM here?\n@MTV's @Ira Madison III Calls Jeff Sessions' Granddaughter 'Prop' Stolen from Toys R Us… ", 'original_id': 347627434, 'followers': 3098, 'hashtags': '["MSM"]', 'user_created': '2012-12-29 17:54:08', 'friends_count': 979, 'retweet_count': 0}

In [13]:
#the tweets text are in the 'text' field
print(tweets_collection[0]['text'])


RT @LindaSuhler: Can we hear from #MSM here?
@MTV's @Ira Madison III Calls Jeff Sessions' Granddaughter 'Prop' Stolen from Toys R Us… 

The following is a bit of experimentation/learning with gensim -- following along some tutuorials on the gensim site to vectorize text, find tfidf etc


In [15]:
tweets_text_documents = [x['text'] for x in tweets_collection]

In [16]:
#quick check that the mapping was done correctly
tweets_text_documents[0]


Out[16]:
"RT @LindaSuhler: Can we hear from #MSM here?\n@MTV's @Ira Madison III Calls Jeff Sessions' Granddaughter 'Prop' Stolen from Toys R Us… "

In [20]:
#quick check of the tokenize function -- remove stopwords included 
tokenize(tweets_text_documents[0])


Out[20]:
['rt',
 'lindasuhler',
 'hear',
 'msm',
 'mtv',
 'ira',
 'madison',
 'iii',
 'calls',
 'jeff',
 'sessions',
 'granddaughter',
 'prop',
 'stolen',
 'toys']

In [36]:
tokenized_tweets = [[word for word in tokenize(x) if word != 'rt'] for x in tweets_text_documents]

In [37]:
tokenized_tweets[0]


Out[37]:
['lindasuhler',
 'hear',
 'msm',
 'mtv',
 'ira',
 'madison',
 'iii',
 'calls',
 'jeff',
 'sessions',
 'granddaughter',
 'prop',
 'stolen',
 'toys']

In [38]:
#construct a dictoinary of the words in the tweets using gensim
# the dictionary is a mapping between words and their ids
tweets_dictionary = corpora.Dictionary(tokenized_tweets)

In [44]:
#save gyhe dict for future reference
tweets_dictionary.save('temp/tweets_dictionary.dict')

In [49]:
#just a quick view of words and ids
dict(list(tweets_dictionary.token2id.items())[0:20])


Out[49]:
{'agend': 453,
 'aware': 865,
 'big': 1273,
 'coming': 908,
 'declare': 1042,
 'derekf': 575,
 'est': 1671,
 'hximdj': 570,
 'jb': 1127,
 'nabs': 1056,
 'nationalists': 321,
 'plan': 596,
 'qx': 880,
 'rw': 1347,
 'suspect': 1069,
 'thought': 752,
 'tlot': 378,
 'tries': 185,
 'vikingriver': 1448,
 'wdiemokb': 1504}

In [50]:
#convert tokenized documents to vectors
# compile corpus (vectors number of times each elements appears)
tweet_corpus = [tweets_dictionary.doc2bow(x) for x in tokenized_tweets]
corpora.MmCorpus.serialize('temp/tweets_corpus.mm', tweet_corpus) # save for future ref

In [51]:
tweets_tfidf_model = gensim.models.TfidfModel(tweet_corpus, id2word = tweets_dictionary)

In [53]:
tweets_tfidf_model[tweet_corpus]


Out[53]:
<gensim.interfaces.TransformedCorpus at 0x2a3b6bc88>

In [56]:
#Create similarity matrix of all tweets
'''note from gensim docs: The class similarities.MatrixSimilarity is only appropriate when 
   the whole set of vectors fits into memory. For example, a corpus of one million documents 
   would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.
   Without 2GB of free RAM, you would need to use the similarities.Similarity class.
   This class operates in fixed memory, by splitting the index across multiple files on disk, 
   called shards. It uses similarities.MatrixSimilarity and similarities.SparseMatrixSimilarity internally,
   so it is still fast, although slightly more complex.'''
index = similarities.MatrixSimilarity(tweets_tfidf_model[tweet_corpus]) 
index.save('temp/tweetsSimilarity.index')

In [62]:
#get similarity matrix between docs: https://groups.google.com/forum/#!topic/gensim/itYEaOYnlEA
#and check that the similarity matrix is what you expect
tweets_similarity_matrix = np.array(index)
print(tweets_similarity_matrix.shape)


(500, 500)

In [70]:
#save the similarity matrix and associated tweets to json
#work in progress-- use tSNE to visualise the tweets to see if there's any clustering
outputDict = {'tweets' : [{'text': x['text'], 'id': x['id_str'], 'user': x['original_name']} for x in tweets_collection], 'matrix': tweets_similarity_matrix.tolist()}
with open('temp/tweetSimilarity.json', 'w') as f:
    json.dump(outputDict, f)

In [77]:
#back to the word2vec idea, use min_count=1 since corpus is tiny
tweets_collected_model = gensim.models.Word2Vec(tokenized_tweets, min_count=1)

In [79]:
#looking again at the term jewish in our small tweet collection...
tweets_collected_model.most_similar('jewish')


Out[79]:
[('blocked', 0.3967747688293457),
 ('wall', 0.3779504895210266),
 ('blackanddecker', 0.36672842502593994),
 ('white', 0.33267900347709656),
 ('campaign', 0.3270014524459839),
 ('product', 0.3262655735015869),
 ('jail', 0.32027339935302734),
 ('nytimes', 0.3143633008003235),
 ('maga', 0.3098933696746826),
 ('community', 0.30887705087661743)]

next step is to loop through the data on s3 and build up a bigger corpus of tweets from the