Is Donald Trump Kanye West?

I was listening to NPR a few weeks back, and they were talking about David Robinson's analysis of Donald Trump's Twitter. At the end of the interview, someone made a passing joke that Donald Trump tweets more like a celebrity, namely Kanye West. It inspired me to do an analysis of my own. This Notebook discusses how to download a user's Twitter feed, including some minor post processing, and how to compute various statistics that I more or less pull out of thin air. An appendix disucsses EDA on Twitter data. First, you have to register as a Twitter developer, which for small projects is free. You'll get 4 keys: consumer, consumer secret, access, and access secret. Don't share them with anyone- not even your doctor!

Installing Tweepy, and downloading Tweets

We start by installing the Tweepy package, which acts as a wrapper for Twitter's REST API. There are a few other choices, but Tweepy seemed to be the most popular after a cursory search. (I do my programming in Spyder, but this notebook seems to be a good way to add commentary to my code)


In [ ]:
!pip install tweepy
import tweepy

consumer_key = "XXXXXX"
consumer_secret = "YYYYY"

access_token = "ZZZZZZZ"
access_token_secret = "AAAAAAA"

In practice, it's best to not put your access keys as plain text here. You can maybe put them in a text file, and read from the file. I got a list of the top 100 users by copy/pasting the info from http://twittercounter.com/pages/100 into Notepad++, and recording/running a macro that made the formatting pretty.


In [ ]:
userList = ['realdonaldtrump','kanyewest','katyperry','justinbieber','taylorswift13','BarackObama','rihanna','YouTube','ladygaga','TheEllenShow','twitter','jtimberlake','KimKardashian','britneyspears','Cristiano','selenagomez','jimmyfallon','cnnbrk','ArianaGrande','instagram','shakira','ddlovato','JLo','Drake','Oprah','KingJames','KevinHart4real','MileyCyrus','BillGates','onedirection','nytimes','SportsCenter','espn','Harry_Styles','Pink','LilTunechi','CNN','wizkhalifa','Adele','NiallOfficial','BrunoMars','KAKA','ActuallyNPH','kany,west','BBCBreaking','danieltosh','neymarjr','aliciakeys','LiamPayne','Louis_Tomlinson','NBA','EmWatson','pitbull','narendra,odi','SrBachchan','khloekardashian','ConanOBrien','iamsrk','Eminem','kourtneykardash','NICKIMINAJ','realmadrid','davidguett,','AvrilLavigne','NFL','zaynmalik','KendallJenner','BeingSalmanKhan','FCBarcelona','aamir_khan','NASA','blakeshelton','Kyli,Jenner','aplusk','coldplay','vine','chrisbrown','edsheeran','MariahCarey','xtina','LeoDiCaprio','agnezmo','TwitterEspanol','deepikapadukone','MohamadAlarefe','BBCWorld','google','TheEconomist','JimCarrey','shugairi','KDTrey5','ivetesangalo','priyankachopra','RyanSe,crest','iHrithik','Beyonce','TwitterSports','SnoopDogg','Reuters','AlejandroSanz','ricky_martin','radityadika']

The next step is is to download data from twitter. yanofsky has a nice little for loop that will take a user list, and output as a .csv file their most recent 3240 tweets.


In [ ]:
from collections import namedtuple


def get_all_tweets(screen_name):
	#Twitter only allows access to a users most recent 3240 tweets with this method
	
	#authorize twitter, initialize tweepy
	auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
	auth.set_access_token(access_token, access_token_secret)
	api = tweepy.API(auth)
	
	#initialize a list to hold all the tweepy Tweets
	alltweets = []	
	
	#make initial request for most recent tweets (200 is the maximum allowed count)
	new_tweets = api.user_timeline(screen_name = screen_name,count=200)
	
	#save most recent tweets
	alltweets.extend(new_tweets)
	
	#save the id of the oldest tweet less one
	oldest = alltweets[-1].id - 1
	
	#keep grabbing tweets until there are no tweets left to grab
	while len(new_tweets) > 0:
		#print "getting tweets before %s" % (oldest)
		#all subsiquent requests use the max_id param to prevent duplicates
		new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)		
		#save most recent tweets
		alltweets.extend(new_tweets)		
		#update the id of the oldest tweet less one
		oldest = alltweets[-1].id - 1		
		#print "...%s tweets downloaded so far" % (len(alltweets))	
	#transform the tweepy tweets into a 2D array that will populate the csv	
	outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
	
	#write the csv	
	with open('%s_tweets.csv' % screen_name, 'wb') as f:
		writer = csv.writer(f)
		writer.writerow(["id","created_at","text"])
		writer.writerows(outtweets)
	pass

for user in userList:
   get_all_tweets(user)

This completes step one.

Step two involves post processing to clean up the data, and make it usable

First, let's define some helper functions to keep our main code clean


In [ ]:
#compare the bin heights of two histograms
#this particular implementation uses the intersection as a similarity metric
def compHist(first, second):
    i=0
    r=0
    while i< len(first):
        r = r + min(first[i],second[i])
        i=i+1
    return r

#get tweets per day
def getTPD(df):
    dates = [datetime.datetime(int(x[0:4]),int(x[5:7]),int(x[8:10])) for x in df['created_at'].values] #translates a string to a date object
    datesU = pd.unique(dates)
    cc = np.array([])
    for D in datesU:
        c=0
        for D2 in dates:
            if D2==D:
                c=c+1
        cc = np.append(cc,c)
    return [cc.mean(), cc.std()]
        
#a "narcicism" index.  Basically counting up occurences of "I" and "me"
def getNarcInd(df):
    P = re.compile(r"I[\'m]*[\s]|\sme$|\sme[\s\.]")
    A = np.array([len(P.findall(x)) for x in df['text'].values])
    #df['narcInd'] = A
    return A.mean()

We then create a few useful variables.
tgt will be today minus 90 days. This is necessary because we don't know how far back 3240 tweets go. Now, we know for sure each dataset will only go 90 days back.
celebStruct is a namedtuple object to hold various (hopefully self-explanatory!) statistics for each Twitter account.
rr will be a list containing indiviual instances of celebStruct.


In [ ]:
tgt = datetime.datetime(2016,8,26)-datetime.timedelta(90)
celebStruct = namedtuple("celebStruct", "user numTweets histA tweetsPerDay narcInd")
rr=[]

for user in userList:
    df = pd.read_csv(user + '_tweets.csv')
    
    #get most recent 3 months
    recentTgt =[datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10]))>tgt.date() for x in df['created_at']]
    recentTweets = df[recentTgt]

    numTweets = len(recentTweets ) #tweets per 3 months

    #create a histogram for each celebrity indicating tweets per hour
    bin1 = range(25)
    fig = plt.plot()    
    hourInformation = [int(x[11:13]) for x in df['created_at']] #strip off time information, leaving only the hour of day
    [A,B,C] =plt.hist((hourInformation), normed=True,bins = bin1)
    #the next two lines will plot the histogram with consistent axes, and save each one, allowing for visual inspection
    plt.ylim([0,.1])
    plt.savefig(user + '_hist.png')
    plt.close()
    
    #two random measures I made up.  Not sure how important they are!
    narcisistIndex=getNarcInd(recentTweets)    
    tweetsPerDay = getTPD(recentTweets)

    #create a new object, adds it to the list of Twitter accounts
    thisCeleb = celebStruct(user, numTweets, linksTweeted, A,tweetsPerDay,narcisistIndex)
    rr.append(thisCeleb)

We are now ready to calculate statistics of individual twitter users, and compare them against Trump.


In [ ]:
TRUMP = rr[0]
#this portion of the code computes and compares histograms
#of what hour of day people tweet, against the histogram of Donald Trump's daily tweeting frequency.
i=0
m_FINAL =[]
while i<len(rr):
    current = rr[0]
    m_FINAL.append([rr[i].user ,compHist(TRUMP.histA, rr[i].histA)])
    i=i+1
print m_FINAL
L = pd.DataFrame(m_FINAL,columns=['name','match'])
L.sort('match')

In [ ]:

Here is the last part of the dataframe, L. Obviously, Trump matches 100% with himself, but the times during which he Tweets are most similar to ladygaga

        name     match
30        BillGates  0.801204
8           rihanna  0.814037
18      jimmyfallon  0.836678
19           cnnbrk  0.848119
28   KevinHart4real  0.849917
10         ladygaga  0.850373
0   realdonaldtrump  1.000000

Donald Trump's Tweeting frequency

Katy Perry's Tweeting frequency

I think the feature that causes this high level of similarity is the ramp-up around 8 pm, and also the "lunch break" both of them seem to take.


Now, let's look at this Arbitrary Narcism Index (or ANI, as I like to call it)


In [ ]:
[plt.plot(x[4],'o',color='#3366aa') for x in rr]

ANIs of Top Twitter users


We see here that most of the top users don't rank very high on the narcism index. Either we're not all doomed, or this index is in fact completely arbitrary!
Looking at the numbers:


In [ ]:
A = [[x[0],x[4]] for x in rr]
        name         ANI
17      selenagomez  0.272727
23         ddlovato  0.286604
0   realdonaldtrump  0.319012
30        BillGates  0.358974
11     TheEllenShow  0.424710
28   KevinHart4real  0.456000

Oddly enough, Kevin Hart ranks as a greater narcissist than Trump, who is somewhere between Bill Gates and Demi Lovato.
So, Donald Trump isn't Kanye West, afterall!

This is my first published independent (as in not ordered/funded by my boss) data science project. Please comment on it, and let me know how I can improve my communication skills!


In [ ]: