This notebook compares the documents published by the Executive Office of the President (of the United States of America) from January 20, 2017, to December 8th, 2017, and his tweets during the same time period. The data wrangling steps can be found in this GitHub repo (https://github.com/mtchem/Twitter-Politics/blob/master/Data_Wrangle.ipynb)
In [1]:
# imports
import pandas as pd
import numpy as np
import itertools
# imports for cosine similarity with NMF
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.feature_extraction import text
# imports for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# special matplotlib argument for in notebook improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")
In [1]:
# load federal document data from pickle file
fed_reg_data = r'data/fed_reg_data.pickle'
fed_data = pd.read_pickle(fed_reg_data)
# load twitter data from csv
twitter_file_path = r'data/twitter_01_20_17_to_3-2-18.pickle'
twitter_data = pd.read_pickle(twitter_file_path)
In [3]:
# Change the index (date), to a column
fed_data['date'] = fed_data.index
twitter_data['date'] = twitter_data.index
In [4]:
# keep text strings and rename columns
fed = fed_data[['str_text', 'date']].rename({'str_text': 'texts'}, axis = 'columns')
tweet = twitter_data[['text', 'date']].rename({'text': 'texts'}, axis = 'columns')
# Add a label for the type of document (Tweet = 0, Fed = 1)
tweet['label'] = 0
fed['label'] = 1
# concatinate the dataframes
comb_text = pd.concat([fed,tweet])
# Re_index so that each doc has a unique id_number
comb_text = comb_text.reset_index()
comb_text['ID'] = range(0,len(comb_text))
# Look at the dataframe to make sure it works
comb_text = comb_text[['texts','date','label', 'ID']]
comb_text.head(3)
Out[4]:
Computers cannot understand a text like humans, so in order to analyze text data, I first need to make every word a feature (column) in an array, where each document (row) is represented by a weighted* frequency of each word (column) they contain. An example text and array are shown below.
Using Scikit Learn to create a word-frequency array:
*Weighting the word frequencies lowers the importance that very frequently used domain-specific words are considered less important during the analysis
In [5]:
# nonsense words, and standard words like proclimation and dates
more_stop = set(['presidential', 'documents', 'therfore','i','donald', 'j', 'trump', 'president', 'order',
'authority', 'vested', 'articles','january','february','march','april','may','june','july','august','september','october',
'november','december','jan','feb','mar','apr','jun','jul','aug','sep','oct','nov','dec',
'2017','2018','act','agencies','agency','wh','rtlwanjjiq','pmgil08opp','blkgzkqemw','qcdljff3wn','erycjgj23r ','fzep1e9mo7','m0hmpbuz6c','rdo6jt2pip','kyv866prde','aql4jlvndh',
'tx5snacaas','t0eigo6lp8','jntoth0mol','8b8aya7v1s', 'x25t9tqani','q7air0bum2','ypfvhtq8te','ejxevz3a1r','1zo6zc2pxt',
'strciewuws','lhos4naagl','djlzvlq6tj', 'theplumlinegs', '3eyf3nir4b','cbewjsq1a3','lvmjz9ax0u',
'dw0zkytyft','sybl47cszn','6sdcyiw4kt','¼ï','yqf6exhm7x','cored8rfl2','6xjxeg1gss','dbvwkddesd',
'ncmsf4fqpr','twunktgbnb','ur0eetseno','ghqbca7yii','cbqrst4ln4','c3zikdtowc','6snvq0dzxn','ekfrktnvuy',
'k2jakipfji','œthe ','p1fh8jmmfa','vhmv7qoutk','mkuhbegzqs','ajic3flnki','mvjbs44atr',
'wakqmkdpxa','e0bup1k83z','ðÿ','ºðÿ','µðÿ','eqmwv1xbim','hlz48rlkif','td0rycwn8c','vs4mnwxtei','75wozgjqop',
'e1q36nkt8g','u8inojtf6d','rmq1a5bdon','5cvnmhnmuh','pdg7vqqv6m','s0s6xqrjsc','5cvnmhnmuh','wlxkoisstg',
'tmndnpbj3m','dnzrzikxhd','4qckkpbtcr','x8psdeb2ur','fejgjt4xp9','evxfqavnfs','aty8r3kns2','pdg7vqqv6m','nqhi7xopmw',
'lhos4naagl','32tfova4ov','zkyoioor62','np7kyhglsv','km0zoaulyh','kwvmqvelri','pirhr7layt',
'v3aoj9ruh4','https','cg4dzhhbrv','qojom54gy8','75wozgjqop','aty8r3kns2','nxrwer1gez','rvxcpafi2a','vb0ao3s18d',
'qggwewuvek','ddi1ywi7yz','r5nxc9ooa4','6lt9mlaj86','1jb53segv4','vhmv7qoutk','i7h4ryin3h',
'aql4jlvndh','yfv0wijgby','nonhjywp4j','zomixteljq','iqum1rfqso','2nl6slwnmh','qejlzzgjdk',
'p3crvve0cy','s0s6xqrjsc','gkockgndtc','2nl6slwnmh','zkyoioor62','clolxte3d4','iqum1rfqso',
'msala9poat','p1f12i9gvt','mit2lj7q90','qejlzzgjdk','pjldxy3hd9','vjzkgtyqb9','b2nqzj53ft',
'tpz7eqjluh','enyxyeqgcp','avlrroxmm4','2kuqfkqbsx','kwvmqvelri','œi','9lxx1iqo7m','vdtiyl0ua7',
'dmhl7xieqv','3jbddn8ymj','gysxxqazbl','ðÿž','tx5snacaas','4igwdl4kia','kqdbvxpekk','1avysamed4',
'cr4i8dvunc','bsp5f3pgbz','rlwst30gud','rlwst30gud','g4elhh9joh', '2017', 'January', 'kuqizdz4ra',
'nvdvrrwls4','ymuqsvvtsb', 'rgdu9plvfk','bk7sdv9phu','b5qbn6llze','xgoqphywrt ','hscs4y9zjk ',
'soamdxxta8','erycjgj23r','ryyp51mxdq','gttk3vjmku','j882zbyvkj','9pfqnrsh1z','ubbsfohmm7',
'xshsynkvup','xwofp9z9ir','1iw7tvvnch','qeeknfuhue','riqeibnwk2','seavqk5zy5','7ef6ac6kec',
'htjhrznqkj','8vsfl9mzxx','xgoqphywrt','zd0fkfvhvx','apvbu2b0jd','mstwl628xe','4hnxkr3ehw','mjij7hg3eu',
'1majwrga3d','x6fuuxxyxe','6eqfmrzrnv','h1zi5xrkeo','kju0moxchk','trux3wzr3u','suanjs6ccz',
'ecf5p4hjfz','m5ur4vv6uh','8j7y900vgk','7ef6ac6kec','d0aowhoh4x','aqqzmt10x7','zauqz4jfwv',
'bmvjz1iv2a','gtowswxinv','1w3lvkpese','8n4abo9ihp','f6jo60i0ul','od7l8vpgjq','odlz2ndrta',
'9tszrcc83j','6ocn9jfmag','qyt4bchvur','wkqhymcya3','tp4bkvtobq','baqzda3s2e','March','April',
'op2xdzxvnc','d7es6ie4fy','proclamation','hcq9kmkc4e','rf9aivvb7g','sutyxbzer9','s0t3ctqc40','aw0av82xde'])
# defines all stop words
my_stop = text.ENGLISH_STOP_WORDS.union(more_stop)
In [6]:
# Instantiate TfidfVectorizer to remove common english words, and any word used in 99% of the documents
tfidf = TfidfVectorizer(stop_words = my_stop , max_df = 0.99)
In [7]:
# create matrix representation of all documents
text_mat = tfidf.fit_transform(comb_text.texts)
In [8]:
# make a list of feature words
words = tfidf.get_feature_names()
In [9]:
# print excluded words from the matrix features
print(tfidf.get_stop_words())
Model the documents with non-zero matrix factorization (NMF):
In [10]:
# instantiate model
NMF_model = NMF(n_components=260 , init = 'nndsvd')
In [11]:
# fit the model
NMF_model.fit(text_mat)
Out[11]:
In [12]:
# transform the text frequecy matrix using the fitted NMF model
nmf_features = NMF_model.transform(text_mat)
In [13]:
# create a dataframe with words as a columns, NMF components as rows
components_df = pd.DataFrame(NMF_model.components_, columns = words)
Using the components dataframe create a dictionary with components as keys, and top words as values:
In [14]:
# create dictionary with the key = component, value = top 5 words
topic_dict = {}
for i in range(0,260):
component = components_df.iloc[i, :]
topic_dict[i] = component.nlargest()
In [15]:
# look at a few of the component topics
print(topic_dict[0].index)
print(topic_dict[7].index)
The informal and non-regular grammar used in tweets makes a direct comparison with documents published by the Executive Office, which uses formal vocabulary and grammar, difficult. Therefore, I will use the metric, cosine similarity, which compares the distance between feature vectors, instead of direct word comparison. Higher cosine similarities between two documents indicate greater topic similarity.
Calculating cosine similarities of NMF features:
In [16]:
# normalize previouly found nmf features
norm_features = normalize(nmf_features)
In [17]:
#dataframe of document's NMF features, where rows are documents and columns are NMF components
df_norms = pd.DataFrame(norm_features)
In [18]:
# initialize empty dictionary
similarity_dict= {}
# loop through each row of the df_norms dataframe
for i in range(len(norm_features)):
# isolate one row, by ID number
row = df_norms.loc[i]
# calculate the top cosine similarities
top_sim = (df_norms.dot(row)).nlargest()
# append results to dictionary
similarity_dict[i] = (top_sim.index, top_sim)
Using the results of part 2C, find the types of documents are the most similar, then sum the labels (0=twitter, 1= federal document). If similar documents are a mix of tweets and federal documents, then the sum of their value will be either 1,2,3 or 4.
In [19]:
# dataframe with document ID and labels
doc_label_df = comb_text[['label', 'ID']].copy().set_index('ID')
In [20]:
# inialize list for the sum of all similar documents label
label_sums =[]
similarity_score_sum = []
# loop through all of the documents
for doc_num in doc_label_df.index:
# sum the similarity scores
similarity_sum = similarity_dict[doc_num][1].sum()
similarity_score_sum.append(similarity_sum)
#find the list of similar documents
similar_doc_ID_list = list(similarity_dict[doc_num][0])
# loop through labels
s_label = 0
for ID_num in similar_doc_ID_list:
# sum the label values for each similar document
s_label = s_label + doc_label_df.loc[ID_num].label
# append the sum of the labels for ONE document
label_sums.append(s_label)
In [21]:
# add the similarity score sum to dataframe as separate column
doc_label_df['similarity_score_sum'] = similarity_score_sum
# add the similar document's summed label value to the dataframe as a separate column
doc_label_df['sum_of_labels'] = label_sums
Isolate documents with mixed types of similar documents and high similarity scores
In [22]:
# Filter dataframe for federal documents with similar tweets, and vice versa
df_filtered = doc_label_df[doc_label_df['sum_of_labels'] != 0][doc_label_df['sum_of_labels'] != 5].copy().reset_index()
# Make sure it worked
print(df_filtered.head())
print(len(df_filtered))
In [23]:
# Look at the ones that have all top 5 documents with a cosine similarity score of 0.9 or above.
#The sum of scores need to be 4.6 or higher
similar_score_min = 4.6
highly_similar = df_filtered[df_filtered.similarity_score_sum >= similar_score_min]
Remove duplicate highly similar groups
In [24]:
# create a list of all the group lists
doc_groups = []
for doc_id in highly_similar.ID:
doc_groups.append(sorted(list(similarity_dict[doc_id][0])))
# make the interior lists tuples, then make a set of them
unique_groups = set([tuple(x) for x in doc_groups])
In [25]:
unique_groups
Out[25]:
Four of the 5 unique groups are basically the same
In [26]:
print(comb_text.texts.loc[1892])
print(comb_text.texts.loc[27])
There does seem to be some general similarities between President Trump's tweets and official federal action. However, the topics are quite vague. Such as tweets about specific White House officals are grouped with the federal documents that define who is on different committees in the Executive Office.
In [ ]: