I was listening to NPR a few weeks back, and they were talking about David Robinson's analysis of Donald Trump's Twitter. At the end of the interview, someone made a passing joke that Donald Trump tweets more like a celebrity, namely Kanye West. It inspired me to do an analysis of my own. This Notebook discusses how to download a user's Twitter feed, including some minor post processing, and how to compute various statistics that I more or less pull out of thin air. An appendix disucsses EDA on Twitter data. First, you have to register as a Twitter developer, which for small projects is free. You'll get 4 keys: consumer, consumer secret, access, and access secret. Don't share them with anyone- not even your doctor!
We start by installing the Tweepy package, which acts as a wrapper for Twitter's REST API. There are a few other choices, but Tweepy seemed to be the most popular after a cursory search. (I do my programming in Spyder, but this notebook seems to be a good way to add commentary to my code)
In [ ]:
!pip install tweepy
import tweepy
consumer_key = "XXXXXX"
consumer_secret = "YYYYY"
access_token = "ZZZZZZZ"
access_token_secret = "AAAAAAA"
In practice, it's best to not put your access keys as plain text here. You can maybe put them in a text file, and read from the file. I got a list of the top 100 users by copy/pasting the info from http://twittercounter.com/pages/100 into Notepad++, and recording/running a macro that made the formatting pretty.
In [ ]:
userList = ['realdonaldtrump','kanyewest','katyperry','justinbieber','taylorswift13','BarackObama','rihanna','YouTube','ladygaga','TheEllenShow','twitter','jtimberlake','KimKardashian','britneyspears','Cristiano','selenagomez','jimmyfallon','cnnbrk','ArianaGrande','instagram','shakira','ddlovato','JLo','Drake','Oprah','KingJames','KevinHart4real','MileyCyrus','BillGates','onedirection','nytimes','SportsCenter','espn','Harry_Styles','Pink','LilTunechi','CNN','wizkhalifa','Adele','NiallOfficial','BrunoMars','KAKA','ActuallyNPH','kany,west','BBCBreaking','danieltosh','neymarjr','aliciakeys','LiamPayne','Louis_Tomlinson','NBA','EmWatson','pitbull','narendra,odi','SrBachchan','khloekardashian','ConanOBrien','iamsrk','Eminem','kourtneykardash','NICKIMINAJ','realmadrid','davidguett,','AvrilLavigne','NFL','zaynmalik','KendallJenner','BeingSalmanKhan','FCBarcelona','aamir_khan','NASA','blakeshelton','Kyli,Jenner','aplusk','coldplay','vine','chrisbrown','edsheeran','MariahCarey','xtina','LeoDiCaprio','agnezmo','TwitterEspanol','deepikapadukone','MohamadAlarefe','BBCWorld','google','TheEconomist','JimCarrey','shugairi','KDTrey5','ivetesangalo','priyankachopra','RyanSe,crest','iHrithik','Beyonce','TwitterSports','SnoopDogg','Reuters','AlejandroSanz','ricky_martin','radityadika']
The next step is is to download data from twitter. yanofsky has a nice little for loop that will take a user list, and output as a .csv file their most recent 3240 tweets.
In [ ]:
from collections import namedtuple
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#print "...%s tweets downloaded so far" % (len(alltweets))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
#write the csv
with open('%s_tweets.csv' % screen_name, 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text"])
writer.writerows(outtweets)
pass
for user in userList:
get_all_tweets(user)
In [ ]:
#compare the bin heights of two histograms
#this particular implementation uses the intersection as a similarity metric
def compHist(first, second):
i=0
r=0
while i< len(first):
r = r + min(first[i],second[i])
i=i+1
return r
#get tweets per day
def getTPD(df):
dates = [datetime.datetime(int(x[0:4]),int(x[5:7]),int(x[8:10])) for x in df['created_at'].values] #translates a string to a date object
datesU = pd.unique(dates)
cc = np.array([])
for D in datesU:
c=0
for D2 in dates:
if D2==D:
c=c+1
cc = np.append(cc,c)
return [cc.mean(), cc.std()]
#a "narcicism" index. Basically counting up occurences of "I" and "me"
def getNarcInd(df):
P = re.compile(r"I[\'m]*[\s]|\sme$|\sme[\s\.]")
A = np.array([len(P.findall(x)) for x in df['text'].values])
#df['narcInd'] = A
return A.mean()
We then create a few useful variables.
tgt will be today minus 90 days. This is necessary because we don't know how far back 3240 tweets go. Now, we know for sure each dataset will only go 90 days back.
celebStruct is a namedtuple object to hold various (hopefully self-explanatory!) statistics for each Twitter account.
rr will be a list containing indiviual instances of celebStruct.
In [ ]:
tgt = datetime.datetime(2016,8,26)-datetime.timedelta(90)
celebStruct = namedtuple("celebStruct", "user numTweets histA tweetsPerDay narcInd")
rr=[]
for user in userList:
df = pd.read_csv(user + '_tweets.csv')
#get most recent 3 months
recentTgt =[datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:10]))>tgt.date() for x in df['created_at']]
recentTweets = df[recentTgt]
numTweets = len(recentTweets ) #tweets per 3 months
#create a histogram for each celebrity indicating tweets per hour
bin1 = range(25)
fig = plt.plot()
hourInformation = [int(x[11:13]) for x in df['created_at']] #strip off time information, leaving only the hour of day
[A,B,C] =plt.hist((hourInformation), normed=True,bins = bin1)
#the next two lines will plot the histogram with consistent axes, and save each one, allowing for visual inspection
plt.ylim([0,.1])
plt.savefig(user + '_hist.png')
plt.close()
#two random measures I made up. Not sure how important they are!
narcisistIndex=getNarcInd(recentTweets)
tweetsPerDay = getTPD(recentTweets)
#create a new object, adds it to the list of Twitter accounts
thisCeleb = celebStruct(user, numTweets, linksTweeted, A,tweetsPerDay,narcisistIndex)
rr.append(thisCeleb)
We are now ready to calculate statistics of individual twitter users, and compare them against Trump.
In [ ]:
TRUMP = rr[0]
#this portion of the code computes and compares histograms
#of what hour of day people tweet, against the histogram of Donald Trump's daily tweeting frequency.
i=0
m_FINAL =[]
while i<len(rr):
current = rr[0]
m_FINAL.append([rr[i].user ,compHist(TRUMP.histA, rr[i].histA)])
i=i+1
print m_FINAL
L = pd.DataFrame(m_FINAL,columns=['name','match'])
L.sort('match')
In [ ]:
Here is the last part of the dataframe, L. Obviously, Trump matches 100% with himself, but the times during which he Tweets are most similar to ladygaga
name match
30 BillGates 0.801204
8 rihanna 0.814037
18 jimmyfallon 0.836678
19 cnnbrk 0.848119
28 KevinHart4real 0.849917
10 ladygaga 0.850373
0 realdonaldtrump 1.000000
I think the feature that causes this high level of similarity is the ramp-up around 8 pm, and also the "lunch break" both of them seem to take.
Now, let's look at this Arbitrary Narcism Index (or ANI, as I like to call it)
In [ ]:
[plt.plot(x[4],'o',color='#3366aa') for x in rr]
In [ ]:
A = [[x[0],x[4]] for x in rr]
name ANI
17 selenagomez 0.272727
23 ddlovato 0.286604
0 realdonaldtrump 0.319012
30 BillGates 0.358974
11 TheEllenShow 0.424710
28 KevinHart4real 0.456000
Oddly enough, Kevin Hart ranks as a greater narcissist than Trump, who is somewhere between Bill Gates and Demi Lovato.
So, Donald Trump isn't Kanye West, afterall!
This is my first published independent (as in not ordered/funded by my boss) data science project. Please comment on it, and let me know how I can improve my communication skills!
In [ ]: