Comparing President Trump's Tweets and Executive Office Activity using NLP


This notebook compares the documents published by the Executive Office of the President (of the United States of America) from January 20, 2017, to December 8th, 2017, and his tweets during the same time period. The data wrangling steps can be found in this GitHub repo (https://github.com/mtchem/Twitter-Politics/blob/master/Data_Wrangle.ipynb)



In [1]:
# imports
import pandas as pd
import numpy as np
import itertools
# imports for cosine similarity with NMF
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.feature_extraction import text 
# imports for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# special matplotlib argument for in notebook improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")

Part 1: Data Wrangle

Load and transform the data for analysis



In [1]:
# load federal document data from pickle file
fed_reg_data = r'data/fed_reg_data.pickle'
fed_data = pd.read_pickle(fed_reg_data)
# load twitter data from csv
twitter_file_path = r'data/twitter_01_20_17_to_3-2-18.pickle'
twitter_data = pd.read_pickle(twitter_file_path)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-3e898a7c87f7> in <module>()
      1 # load federal document data from pickle file
      2 fed_reg_data = r'data/fed_reg_data.pickle'
----> 3 fed_data = pd.read_pickle(fed_reg_data)
      4 # load twitter data from csv
      5 twitter_file_path = r'data/twitter_01_20_17_to_3-2-18.pickle'

NameError: name 'pd' is not defined

In [3]:
# Change the index (date), to a column
fed_data['date'] = fed_data.index
twitter_data['date'] = twitter_data.index

Combine data for analysis

Create a dataframe that contains:

  • Each document, from both data sets, as a string
  • The date the text was published
  • A label for the type of document (0= twitter doc, 1= federal doc)


In [4]:
# keep text strings and rename columns
fed = fed_data[['str_text', 'date']].rename({'str_text': 'texts'}, axis = 'columns')
tweet = twitter_data[['text', 'date']].rename({'text': 'texts'}, axis = 'columns')

# Add a label for the type of document (Tweet = 0, Fed = 1)
tweet['label'] = 0
fed['label'] = 1

# concatinate the dataframes
comb_text = pd.concat([fed,tweet])

# Re_index so that each doc has a unique id_number
comb_text = comb_text.reset_index()
comb_text['ID'] = range(0,len(comb_text))

# Look at the dataframe to make sure it works
comb_text = comb_text[['texts','date','label', 'ID']]
comb_text.head(3)


Out[4]:
texts date label ID
0 Federal Register / Vol. 82, No. 161 / Tues... 2017-08-22 1 0
1 Federal Register / Vol. 82, No. 188 / Frid... 2017-09-29 1 1
2 42706 \n\nFederal Register / Vol. 82, No. ... 2017-09-11 1 2

Transform text data into a word-frequency array

Computers cannot understand a text like humans, so in order to analyze text data, I first need to make every word a feature (column) in an array, where each document (row) is represented by a weighted* frequency of each word (column) they contain. An example text and array are shown below.

Using Scikit Learn to create a word-frequency array:

  • Define list of stop words (nonsense or non-meaninful words, such as 'the', 'a', 'of', 'q34fqwer3').
  • Instantiate a tf-idf object (term frequency-inverse document frequency reweighting), that removes the stop words, and filters any word that appears in 99% of the documents
  • Create a matrix representation of the documents
  • Create list of the words each feature(column) represents
  • Print a list of the excluded words

*Weighting the word frequencies lowers the importance that very frequently used domain-specific words are considered less important during the analysis


In [5]:
# nonsense words, and standard words like proclimation and dates
more_stop = set(['presidential', 'documents', 'therfore','i','donald', 'j', 'trump', 'president', 'order', 
                 'authority', 'vested', 'articles','january','february','march','april','may','june','july','august','september','october',
                 'november','december','jan','feb','mar','apr','jun','jul','aug','sep','oct','nov','dec',
                 '2017','2018','act','agencies','agency','wh','rtlwanjjiq','pmgil08opp','blkgzkqemw','qcdljff3wn','erycjgj23r ','fzep1e9mo7','m0hmpbuz6c','rdo6jt2pip','kyv866prde','aql4jlvndh',
             'tx5snacaas','t0eigo6lp8','jntoth0mol','8b8aya7v1s', 'x25t9tqani','q7air0bum2','ypfvhtq8te','ejxevz3a1r','1zo6zc2pxt',
             'strciewuws','lhos4naagl','djlzvlq6tj', 'theplumlinegs', '3eyf3nir4b','cbewjsq1a3','lvmjz9ax0u',
             'dw0zkytyft','sybl47cszn','6sdcyiw4kt','¼ï','yqf6exhm7x','cored8rfl2','6xjxeg1gss','dbvwkddesd',
             'ncmsf4fqpr','twunktgbnb','ur0eetseno','ghqbca7yii','cbqrst4ln4','c3zikdtowc','6snvq0dzxn','ekfrktnvuy',
             'k2jakipfji','œthe ','p1fh8jmmfa','vhmv7qoutk','mkuhbegzqs','ajic3flnki','mvjbs44atr',
             'wakqmkdpxa','e0bup1k83z','ðÿ','ºðÿ','µðÿ','eqmwv1xbim','hlz48rlkif','td0rycwn8c','vs4mnwxtei','75wozgjqop',
             'e1q36nkt8g','u8inojtf6d','rmq1a5bdon','5cvnmhnmuh','pdg7vqqv6m','s0s6xqrjsc','5cvnmhnmuh','wlxkoisstg',
             'tmndnpbj3m','dnzrzikxhd','4qckkpbtcr','x8psdeb2ur','fejgjt4xp9','evxfqavnfs','aty8r3kns2','pdg7vqqv6m','nqhi7xopmw',
             'lhos4naagl','32tfova4ov','zkyoioor62','np7kyhglsv','km0zoaulyh','kwvmqvelri','pirhr7layt',
             'v3aoj9ruh4','https','cg4dzhhbrv','qojom54gy8','75wozgjqop','aty8r3kns2','nxrwer1gez','rvxcpafi2a','vb0ao3s18d',
             'qggwewuvek','ddi1ywi7yz','r5nxc9ooa4','6lt9mlaj86','1jb53segv4','vhmv7qoutk','i7h4ryin3h',
             'aql4jlvndh','yfv0wijgby','nonhjywp4j','zomixteljq','iqum1rfqso','2nl6slwnmh','qejlzzgjdk',
             'p3crvve0cy','s0s6xqrjsc','gkockgndtc','2nl6slwnmh','zkyoioor62','clolxte3d4','iqum1rfqso',
             'msala9poat','p1f12i9gvt','mit2lj7q90','qejlzzgjdk','pjldxy3hd9','vjzkgtyqb9','b2nqzj53ft',
             'tpz7eqjluh','enyxyeqgcp','avlrroxmm4','2kuqfkqbsx','kwvmqvelri','œi','9lxx1iqo7m','vdtiyl0ua7',
             'dmhl7xieqv','3jbddn8ymj','gysxxqazbl','ðÿž','tx5snacaas','4igwdl4kia','kqdbvxpekk','1avysamed4',
             'cr4i8dvunc','bsp5f3pgbz','rlwst30gud','rlwst30gud','g4elhh9joh', '2017', 'January', 'kuqizdz4ra', 
             'nvdvrrwls4','ymuqsvvtsb', 'rgdu9plvfk','bk7sdv9phu','b5qbn6llze','xgoqphywrt ','hscs4y9zjk ',
             'soamdxxta8','erycjgj23r','ryyp51mxdq','gttk3vjmku','j882zbyvkj','9pfqnrsh1z','ubbsfohmm7',
             'xshsynkvup','xwofp9z9ir','1iw7tvvnch','qeeknfuhue','riqeibnwk2','seavqk5zy5','7ef6ac6kec',
             'htjhrznqkj','8vsfl9mzxx','xgoqphywrt','zd0fkfvhvx','apvbu2b0jd','mstwl628xe','4hnxkr3ehw','mjij7hg3eu',
             '1majwrga3d','x6fuuxxyxe','6eqfmrzrnv','h1zi5xrkeo','kju0moxchk','trux3wzr3u','suanjs6ccz',
             'ecf5p4hjfz','m5ur4vv6uh','8j7y900vgk','7ef6ac6kec','d0aowhoh4x','aqqzmt10x7','zauqz4jfwv',
             'bmvjz1iv2a','gtowswxinv','1w3lvkpese','8n4abo9ihp','f6jo60i0ul','od7l8vpgjq','odlz2ndrta',
             '9tszrcc83j','6ocn9jfmag','qyt4bchvur','wkqhymcya3','tp4bkvtobq','baqzda3s2e','March','April',
             'op2xdzxvnc','d7es6ie4fy','proclamation','hcq9kmkc4e','rf9aivvb7g','sutyxbzer9','s0t3ctqc40','aw0av82xde'])
# defines all stop words
my_stop = text.ENGLISH_STOP_WORDS.union(more_stop)

In [6]:
# Instantiate TfidfVectorizer to remove common english words, and any word used in 99% of the documents
tfidf = TfidfVectorizer(stop_words = my_stop , max_df = 0.99)

In [7]:
# create matrix representation of all documents
text_mat = tfidf.fit_transform(comb_text.texts)

In [8]:
# make a list of feature words
words = tfidf.get_feature_names()

Excluded Words

Below is a printed list of all of the excluded words. I include this because I am not a political scientist or a linguist. What I consider to be nonsense maybe important and you may want to modify this list.


In [9]:
# print excluded words from the matrix features
print(tfidf.get_stop_words())


frozenset({'enough', 'thus', 'a', 'kuqizdz4ra', 'might', 'have', 'bmvjz1iv2a', 'become', 'thereupon', 'thin', 'found', 'nowhere', 'around', 'could', 'kyv866prde', 'ddi1ywi7yz', 'move', 'you', 'can', 'nxrwer1gez', 'm0hmpbuz6c', 'ypfvhtq8te', 'he', 'four', 'eg', '2018', 'amoungst', 'no', 's0s6xqrjsc', 'find', 'to', 'ur0eetseno', 'whether', 'x25t9tqani', 'any', 'sometime', 'evxfqavnfs', '2nl6slwnmh', 'cry', 'kqdbvxpekk', 'dbvwkddesd', 'system', 'k2jakipfji', 'htjhrznqkj', 'within', 'even', 'sutyxbzer9', 'riqeibnwk2', 'along', 'sometimes', 'before', 'down', 'top', 'seeming', 'therfore', 'pmgil08opp', 'will', 'wkqhymcya3', 'ten', 'whenever', 'three', 'dec', 'those', 'µðÿ', 'some', 'feb', 'sep', 'cbqrst4ln4', 'b2nqzj53ft', 'eqmwv1xbim', 'u8inojtf6d', 'together', 'co', 'wakqmkdpxa', 'km0zoaulyh', 'xgoqphywrt', 'now', 'them', 'fzep1e9mo7', 'vdtiyl0ua7', 'against', 'much', 'ourselves', 'up', 'april', 'be', 'is', 'third', 'kwvmqvelri', 'name', 'strciewuws', 'per', 'whereafter', 'nvdvrrwls4', 'empty', 'always', 'my', 'who', 'him', 'our', 'trump', 'iqum1rfqso', 'across', 'september', 'np7kyhglsv', 'therein', 'been', 'qojom54gy8', 'aty8r3kns2', 'de', 'fire', 'because', 'since', 'qejlzzgjdk', 'ours', 'avlrroxmm4', 'documents', 'during', 'one', 'odlz2ndrta', 'as', 'below', 'r5nxc9ooa4', 'january', 'though', 'g4elhh9joh', 'e0bup1k83z', 'cg4dzhhbrv', 'x6fuuxxyxe', 'd0aowhoh4x', 'qeeknfuhue', 'formerly', 'order', 'ie', 'dmhl7xieqv', 'twelve', 'its', 'that', 'this', 'eleven', 'wh', 'xshsynkvup', 'very', 'next', 'dnzrzikxhd', 'yfv0wijgby', 'op2xdzxvnc', 'aw0av82xde', 'too', 'yourselves', 'clolxte3d4', 'his', 'november', 'elsewhere', 'on', 'which', 'into', 'January', 'many', 'about', 'bill', 'donald', '¼ï', '4qckkpbtcr', 'fejgjt4xp9', 'amount', 'c3zikdtowc', '6lt9mlaj86', 'whence', 'hereupon', 'rather', 'mit2lj7q90', 'and', 'ryyp51mxdq', 'each', 'b5qbn6llze', 'again', 'are', 'most', 'out', 'jntoth0mol', 'august', 'presidential', 'agency', '6sdcyiw4kt', 'f6jo60i0ul', 'six', 'the', 'yourself', 'noone', 'lvmjz9ax0u', 'himself', 'couldnt', 'except', 'go', 'off', 'see', 'was', 'when', '8n4abo9ihp', 'herself', 'hence', 'of', 'had', 'were', 'mostly', 'why', 'vs4mnwxtei', 'tpz7eqjluh', '8j7y900vgk', 'least', 'nov', 'zd0fkfvhvx', '2kuqfkqbsx', 'vjzkgtyqb9', 'whereby', 'xwofp9z9ir', 'proclamation', 'has', 'so', 'onto', 'themselves', 'namely', 'after', 'anything', 'became', 'full', 'moreover', 'thru', 'also', 'enyxyeqgcp', 'cr4i8dvunc', 'od7l8vpgjq', 'perhaps', 'then', 'ejxevz3a1r', 'nobody', 'give', 'it', 'latter', 'describe', 'over', 'there', 'herein', 'without', 'not', 're', 'i7h4ryin3h', 'meanwhile', 'ncmsf4fqpr', 'becoming', 'if', 'therefore', 'further', '9lxx1iqo7m', 'baqzda3s2e', '9pfqnrsh1z', 'beforehand', 'may', 'where', 'yqf6exhm7x', 'hscs4y9zjk ', 'apvbu2b0jd', 'qyt4bchvur', 'yours', 'alone', 'wherein', 'march', 'agencies', 'among', 'former', 'in', 'back', 'anyone', '6ocn9jfmag', 'none', 'erycjgj23r', 'beside', 'qcdljff3wn', 'mvjbs44atr', 'vhmv7qoutk', 'gtowswxinv', 'please', 'toward', 'indeed', 'aug', '3jbddn8ymj', 'between', 'œthe ', 'besides', 'often', 'several', 'authority', 'rtlwanjjiq', 'trux3wzr3u', 'nevertheless', 'aql4jlvndh', '7ef6ac6kec', 'rdo6jt2pip', 'theplumlinegs', 'jul', 'erycjgj23r ', 'both', 'apr', 'nonhjywp4j', 'am', 'act', 'p1f12i9gvt', 'hcq9kmkc4e', 'yet', 'few', 'more', '1jb53segv4', 'five', 'part', 'what', 'ymuqsvvtsb', 'everyone', 'their', 'mkuhbegzqs', 'own', 'done', 'front', 'your', 'how', 'us', 'same', 'gttk3vjmku', 'we', 'jun', 'ðÿ', 'mstwl628xe', 'itself', '1w3lvkpese', 'seems', 'december', 'eight', 'tx5snacaas', 'wlxkoisstg', 'i', 'wherever', 'p1fh8jmmfa', 'whither', 'would', '75wozgjqop', 'an', 'still', 'must', 'vb0ao3s18d', 'these', 'zauqz4jfwv', 's0t3ctqc40', 'pdg7vqqv6m', 'for', 'july', 'nine', 'ghqbca7yii', 'rlwst30gud', 'beyond', 'keep', 'they', 'q7air0bum2', 'ltd', 'oct', 'somewhere', 'all', 'behind', 'anyhow', 'but', 'hundred', 'sixty', 'seem', 'whole', 'forty', 'j882zbyvkj', 'sybl47cszn', 'president', '6xjxeg1gss', 'ajic3flnki', 'call', 'mine', 'e1q36nkt8g', '1majwrga3d', 'h1zi5xrkeo', 'tp4bkvtobq', 'hasnt', 'nothing', '8vsfl9mzxx', 'here', 'zkyoioor62', 'others', 'take', 'ecf5p4hjfz', 'although', 'she', 'bottom', 'sincere', 'twenty', 'than', 'td0rycwn8c', 'however', 'myself', 'dw0zkytyft', 'd7es6ie4fy', 'something', 'afterwards', 'almost', 'œi', 'should', 'p3crvve0cy', 'blkgzkqemw', 'zomixteljq', 'from', 'only', 'another', 'well', 'twunktgbnb', '6snvq0dzxn', 'seavqk5zy5', 'put', 'ekfrktnvuy', 'gysxxqazbl', 'm5ur4vv6uh', 'aqqzmt10x7', 'soamdxxta8', 'hers', 'either', 'february', 'msala9poat', 'con', 'with', 'somehow', 'amongst', 'due', 'mill', 'someone', 'two', 'ðÿž', 'rgdu9plvfk', 'hereafter', 'me', 'seemed', 'nor', 'throughout', 'March', 'jan', 'inc', 'every', 'such', '4hnxkr3ehw', 'anywhere', 'fifty', 'whoever', '1avysamed4', 'djlzvlq6tj', 'gkockgndtc', 'while', 'do', '32tfova4ov', 'thereby', 'made', '4igwdl4kia', 'never', 'cannot', 'at', 'ubbsfohmm7', 'her', 'j', 'first', 'under', 'nqhi7xopmw', 'whereas', '9tszrcc83j', 'suanjs6ccz', 't0eigo6lp8', 'already', 'whose', 'by', 'pjldxy3hd9', 'fifteen', 'being', '1zo6zc2pxt', 'April', 'kju0moxchk', 'via', 'until', 'october', 'whereupon', 'otherwise', '5cvnmhnmuh', 'mjij7hg3eu', 'whatever', 'side', 'thereafter', 'less', 'once', 'ºðÿ', 'etc', 'hlz48rlkif', 'else', 'ever', 'qggwewuvek', 'fill', '3eyf3nir4b', '2017', 'anyway', 'serious', 'everything', 'tmndnpbj3m', 'neither', 'xgoqphywrt ', 'lhos4naagl', 'through', 'or', '6eqfmrzrnv', 'cbewjsq1a3', 'v3aoj9ruh4', 'mar', 'everywhere', 'get', 'vested', 'rmq1a5bdon', 'last', 'latterly', 'rf9aivvb7g', 'towards', 'whom', '8b8aya7v1s', 'upon', 'https', 'above', 'hereby', 'cored8rfl2', 'thence', '1iw7tvvnch', 'interest', 'other', 'cant', 'articles', 'un', 'thick', 'june', 'show', 'x8psdeb2ur', 'detail', 'rvxcpafi2a', 'pirhr7layt', 'bsp5f3pgbz', 'becomes', 'bk7sdv9phu'})

Part 2: Analysis

Use unsupervised machine learning to analyze both President Trump's tweets, official presidential actions and explore any correlation between the two


Part 2A: Determine the document's topics

Model the documents with non-zero matrix factorization (NMF):

  • Instantiate NMF model with 260 components (1/10th the number of documents) and initialized with Nonnegative Double Singular Value Decomposition (NNDSVD, better for sparseness)
  • Fit(learn the NMF model for the tf-idf matrix) model
  • Transform the model, which applies the fit to the matirix
  • Make a dataframe with the NMF components for each word


In [10]:
# instantiate model
NMF_model = NMF(n_components=260 , init = 'nndsvd')

In [11]:
# fit the model
NMF_model.fit(text_mat)


Out[11]:
NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0,
  max_iter=200, n_components=260, random_state=None, shuffle=False,
  solver='cd', tol=0.0001, verbose=0)

In [12]:
# transform the text frequecy matrix using the fitted NMF model
nmf_features = NMF_model.transform(text_mat)

In [13]:
# create a dataframe with words as a columns, NMF components as rows
components_df = pd.DataFrame(NMF_model.components_, columns = words)

Part 2B: Find the top 5 topic words (components) for each document

Using the components dataframe create a dictionary with components as keys, and top words as values:

  • Make an empty dictionary and loop through each row of NMF components
  • Add to the dictionary where the key is the NMF component and the value is the topic words for that component (the column names with the largest component values)


In [14]:
# create dictionary with the key = component, value = top 5 words
topic_dict = {}
for i in range(0,260):
    component = components_df.iloc[i, :]
    topic_dict[i] = component.nlargest()

In [15]:
# look at a few of the component topics
print(topic_dict[0].index)
print(topic_dict[7].index)


Index(['states', 'united', '11', 'fr', '4790'], dtype='object')
Index(['shall', 'law', 'sec', 'federal', 'section'], dtype='object')

Part 2C: Cosine Similarity

The informal and non-regular grammar used in tweets makes a direct comparison with documents published by the Executive Office, which uses formal vocabulary and grammar, difficult. Therefore, I will use the metric, cosine similarity, which compares the distance between feature vectors, instead of direct word comparison. Higher cosine similarities between two documents indicate greater topic similarity.

Calculating cosine similarities of NMF features:

  • Normalize NMF features (calculated in part 2A)
  • Create dataframe where each row contains the normalized NMF features for a document and its ID number
  • Look at each row(decomposed article) and calculate its cosine similarity to all other document's normalized NMF features
  • Create a dictionary where the key is the document ID, and the value is a pandas series of the 5 most similar documents (including its self)


In [16]:
# normalize previouly found nmf features
norm_features = normalize(nmf_features)

In [17]:
#dataframe of document's NMF features, where rows are documents and columns are NMF components
df_norms = pd.DataFrame(norm_features)

In [18]:
# initialize empty dictionary
similarity_dict= {}
# loop through each row of the df_norms dataframe
for i in range(len(norm_features)):
    # isolate one row, by ID number
    row = df_norms.loc[i]
    # calculate the top cosine similarities
    top_sim = (df_norms.dot(row)).nlargest()
    # append results to dictionary
    similarity_dict[i] = (top_sim.index, top_sim)

Part 3: Use the cosine similarity results to explore how (or if) President Trump's tweets and official actions correlate


Part 3A: Find Twitter documents that have at least one federal document in its top 5 cosine similarity scores (and vice versa)

Using the results of part 2C, find the types of documents are the most similar, then sum the labels (0=twitter, 1= federal document). If similar documents are a mix of tweets and federal documents, then the sum of their value will be either 1,2,3 or 4.

  • Create a dataframe with the document ID number as the index and the document type label (tweet = 0, fed_doc = 1)
  • Loop through each document in the dataframe and use the similarity dictionary to find the list of most similar document ID numbers and the sum of the similarity scores
  • For each list of similar documents, sum the value of the document type labels. If the sum value is 1, 2, 3, or 4, that means there are both tweets and federal documents in the group


In [19]:
# dataframe with document ID and labels
doc_label_df = comb_text[['label', 'ID']].copy().set_index('ID')

In [20]:
# inialize list for the sum of all similar documents label
label_sums =[]
similarity_score_sum = []
# loop through all of the documents
for doc_num in doc_label_df.index:
    # sum the similarity scores
    similarity_sum = similarity_dict[doc_num][1].sum()
    similarity_score_sum.append(similarity_sum)
    
 
    #find the list of similar documents
    similar_doc_ID_list = list(similarity_dict[doc_num][0])    
    # loop through labels
    s_label = 0
    for ID_num in similar_doc_ID_list:
        # sum the label values for each similar document
        s_label = s_label + doc_label_df.loc[ID_num].label
        
    # append the sum of the labels for ONE document
    label_sums.append(s_label)

In [21]:
# add the similarity score sum to dataframe as separate column
doc_label_df['similarity_score_sum'] = similarity_score_sum

# add the similar document's summed label value to the dataframe as a separate column
doc_label_df['sum_of_labels'] = label_sums

Part 3B: Look at the topics of tweets that have similar federal documents (and vice versa)

Isolate documents with mixed types of similar documents and high similarity scores

  • Filter dataframe to include only top_similar_label_sums with a value of 1, 2, 3, or 4
  • Filter again to only include groups with high combinded similarity scores
  • Remove and duplicate groups


In [22]:
# Filter dataframe for federal documents with similar tweets, and vice versa
df_filtered = doc_label_df[doc_label_df['sum_of_labels'] != 0][doc_label_df['sum_of_labels'] != 5].copy().reset_index()

# Make sure it worked
print(df_filtered.head())
print(len(df_filtered))


   ID  label  similarity_score_sum  sum_of_labels
0   0      1              3.819105              2
1   1      1              3.324981              3
2   9      1              4.859847              1
3  18      1              3.339872              1
4  24      1              4.563310              4
293
C:\Anaconda2\envs\py36\lib\site-packages\ipykernel\__main__.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  from ipykernel import kernelapp as app

In [23]:
# Look at the ones that have all top 5 documents with a cosine similarity score of 0.9 or above.  
#The sum of scores need to be 4.6 or higher
similar_score_min = 4.6
highly_similar = df_filtered[df_filtered.similarity_score_sum >= similar_score_min]

Remove duplicate highly similar groups


In [24]:
# create a list of all the group lists
doc_groups = []
for doc_id in highly_similar.ID:
    doc_groups.append(sorted(list(similarity_dict[doc_id][0])))

# make the interior lists tuples, then make a set of them
unique_groups = set([tuple(x) for x in doc_groups])

In [25]:
unique_groups


Out[25]:
{(4, 5, 20, 157, 2090),
 (4, 20, 2040, 2069, 2084),
 (9, 221, 227, 240, 2256),
 (9, 221, 1179, 1690, 2256),
 (9, 694, 820, 1690, 2256),
 (9, 694, 1690, 2256, 2578),
 (25, 127, 174, 863, 1696),
 (25, 174, 863, 1696, 1845),
 (28, 71, 229, 248, 2576),
 (28, 204, 229, 248, 2576),
 (28, 229, 233, 248, 2576),
 (28, 229, 248, 2571, 2576),
 (47, 84, 130, 1806, 2070),
 (49, 205, 428, 1578, 2312),
 (49, 205, 1578, 1917, 2312),
 (49, 205, 1672, 1917, 2312),
 (49, 363, 428, 1578, 2463),
 (71, 95, 229, 248, 2576),
 (74, 694, 820, 1838, 2508),
 (82, 1545, 1682, 1785, 2532),
 (84, 102, 131, 170, 2070),
 (102, 131, 170, 478, 2380),
 (131, 170, 478, 479, 2380),
 (131, 328, 478, 479, 2380),
 (131, 478, 479, 555, 2380),
 (170, 478, 479, 1806, 2070),
 (229, 248, 1526, 2571, 2576),
 (229, 248, 1743, 2020, 2576),
 (251, 497, 1689, 1778, 2120),
 (251, 1653, 1689, 1916, 2120),
 (260, 414, 922, 1135, 1180),
 (260, 414, 1093, 1135, 1180),
 (260, 773, 1093, 1118, 1135),
 (260, 1093, 1095, 1118, 1135),
 (260, 1093, 1118, 1135, 2111)}

Part 3C: Manually look at the documents. Are they similar?

Components = 100 , Highly similar score = 4.9

Four of the 5 unique groups are basically the same

    {(58, 80, 105, 149, 1139), (58, 80, 126, 149, 1139), (58, 80, 126, 185, 1139), (58, 80, 149, 185, 1139), (131, 170, 478, 479, 2044)}
Thoses components (58, 80, 105, 126, 149, 185, 1139) are all about national emergencies. The 5 group is about national security and national emergencies

Components = 260 , Highly similar cutoff score = 4.6

6 unique groups can be further distilled to one set (27, 28, 229, 248, 196, 203,2576, 2546, 204, 1151, 1892)


In [26]:
print(comb_text.texts.loc[1892])
print(comb_text.texts.loc[27])


RT @VP: Our President is choosing to put American jobs, American consumers, American energy, and American industry first. https://t.co/y2Op…
Federal  Register 

Vol.  82,  No.  84 

Wednesday,  May  3,  2017 

Title  3— 

The  President 

Presidential Documents

20795 

Proclamation  9595  of  April  28,  2017 

Asian  American  and  Pacific  Islander  Heritage  Month,  2017 

By  the  President  of  the  United  States  of  America 

A  Proclamation 
This  month,  we  celebrate  Asian  American  and  Pacific  Islander  Heritage 
Month, and we recognize the achievements and contributions of Asian Ameri-
cans and Pacific Islanders that enrich our Nation. 
Asian  Americans  and  Pacific  Islanders  have  distinguished  themselves  in 
the  arts,  literature,  and  sports.  They  are  leading  researchers  in  science, 
medicine, and technology; dedicated teachers to our Nation’s children; inno-
vative  farmers  and  ranchers;  and  distinguished  lawyers  and  government 
leaders. 
Dr. Sammy Lee, a Korean American who passed away last December, exem-
plified  the  spirit  of  this  month.  Dr.  Lee  was  the  first  Asian  American  man 
to  win  an  Olympic  gold  medal,  becoming  a  platform  diving  champion  at 
the 1948 London Olympics only 1 year after graduating from medical school. 
To  fulfill  his  dreams,  Dr.  Lee  overcame  several  obstacles,  including  his 
local  childhood  pool’s  policy  of  opening  to  minorities  only  once  per  week. 
Later  in  life  he  was  subject  to  housing  discrimination  (even  after  8  years 
of  military  service).  Dr.  Lee  nevertheless  tirelessly  served  his  country  and 
community,  including  by  representing  the  United  States  at  the  Olympic 
Games, on behalf of several Presidents. 
Katherine  Sui  Fun  Cheung  also  embodied  the  spirit  of  this  month.  In  1932, 
she  became  the  first  Chinese  American  woman  to  earn  a  pilot  license. 
At  the  time,  only  about  1  percent  of  pilots  in  the  United  States  were 
women. As a member of The Ninety-Nines, an organization of women pilots, 
she paved the way for thousands of women to take to the skies. 
There  are  more  than  20  million  Asian  Americans  and  Pacific  Islanders 
in  the  United  States.  Each  day,  through  their  actions,  they  make  America 
more  vibrant,  more  prosperous,  and  more  secure.  Our  Nation  is  particularly 
grateful to the many Asian Americans and Pacific Islanders who have served 
and  are  currently  serving  in  our  Armed  Forces,  protecting  the  Nation,  and 
promoting freedom and peace around the world. 
NOW,  THEREFORE,  I,  DONALD  J.  TRUMP,  President  of  the  United  States 
of  America,  by  virtue  of  the  authority  vested  in  me  by  the  Constitution 
and  the  laws  of  the  United  States,  do  hereby  proclaim  May  2017  as  Asian 
American  and  Pacific  Islander  Heritage  Month.  The  Congress,  by  Public 
Law  102–450,  as  amended,  has  also  designated  the  month  of  May  each 
year as ‘‘Asian/Pacific American Heritage Month.’’ I encourage all Americans 
to  learn  more  about  our  Asian  American,  Native  Hawaiian,  and  Pacific 
Islander  heritage,  and  to  observe  this  month  with  appropriate  programs 
and activities. 

0
D
h

 

t
i

 

w
D
O
R
P
1
N
V
T
P
S
3
K
S
D
n
o

 

 
s
a
k
s
u
a

i
l

a
b
a
s
a

VerDate Sep<11>2014  18:27 May 02, 2017 Jkt 241001 PO 00000 Frm 00003 Fmt 4705 Sfmt 4790 E:\FR\FM\03MYD0.SGM 03MYD0

20796 

Federal  Register / Vol.  82,  No.  84 / Wednesday,  May  3,  2017 / Presidential  Documents 

IN  WITNESS  WHEREOF,  I  have  hereunto  set  my  hand  this  twenty-eighth 
day  of  April,  in  the  year  of  our  Lord  two  thousand  seventeen,  and  of 
the  Independence  of  the  United  States  of  America  the  two  hundred  and 
forty-first. 

[FR  Doc.  2017–09073 
Filed  5–2–17;  11:15  am] 
Billing  code  3295–F7–P 

0
D
h

 

t
i

 

w
D
O
R
P
1
N
V
T
P
S
3
K
S
D
n
o

 

 
s
a
k
s
u
a

i
l

a
b
a
s
a

VerDate Sep<11>2014  18:27 May 02, 2017 Jkt 241001 PO 00000 Frm 00004 Fmt 4705 Sfmt 4790 E:\FR\FM\03MYD0.SGM 03MYD0

/

>
H
P
G
<
S
P
E
p
m
u
r
T

.



Conclusion

There does seem to be some general similarities between President Trump's tweets and official federal action. However, the topics are quite vague. Such as tweets about specific White House officals are grouped with the federal documents that define who is on different committees in the Executive Office.


In [ ]: