Twitter - Timeline Analysis 1



In [2]:

    
import pandas as pd
import matplotlib.pyplot as plt
import glob
runlevel = 0

Create timeline

The tweets are captured with a simple cronjob that is running the following command every minute

/usr/local/bin/t timeline -n 10 --csv > /root/twitter/capt/twitter-data-`date +%s`.csv

I haven't figured out the API limit yet, so I am only capturing 10 tweets - dont want to get blacklisted. Those files are zipped up with the following commands (the first line deletes empty files which sometimes happen, as they pose problems downstream)

find capt -size  0 -print0 | xargs -0 rm
zip archive.zip capt/*.csv

We can copy this one locally using

rsync user@my-twitter-server.com:/path/to/twitter/archive.zip /Users/myname/Dropbox

and we can get a dropbox link to the file that we use to get it locally using wget



In [3]:

    
if runlevel >= 100:
    !wget https://www.dropbox.com/s/xxx/archive.zip?dl=0 -O archive.zip
    !unzip -o archive.zip

We now glob all the files we have received and concatenate the dataframes corresponding to the csv files, after having renamed the heading (this is where it falls over if files are of zero length)



In [217]:

    
if runlevel >= 95:
    filenames = glob.glob('capt/*.csv')
    tweets = pd.concat(
        [ pd.DataFrame.from_csv(fn).rename(columns = {'Posted at':'time', 'Screen name':'from', 'Text':'text' }) 
             for fn in filenames
        ])
    tweets.to_csv('tweets_raw.csv')
    tweets.tail()
    print ("==%i== tweet files converted -> tweets_raw.csv" % len (filenames) )



In [218]:

    
ls -l | grep tweets









    



-rw-r--r-- 1 root root 1025462 Sep  7 08:41 tweets.csv
-rw-r--r-- 1 root root    8104 Sep  7 08:41 tweets_per_handle.csv
-rw-r--r-- 1 root root      89 Sep  7 08:41 tweets_per_utc1.csv
-rw-r--r-- 1 root root  892569 Sep  7 08:41 tweets_raw.csv



In [219]:

    
tweets = pd.DataFrame.from_csv('tweets_raw.csv')
print ("number of tweets: %i" % len(tweets))
tweets.head()









    



number of tweets: 5297






    Out[219]:






  
    
      
      time
      from
      text
    
    
      ID
      
      
      
    
  
  
    
      508463857913036802
       2014-09-07 03:56:45 +0000
             tomgara
                  Is Conrad Hackett a person or an idea?
    
    
      508463793504931842
       2014-09-07 03:56:29 +0000
        PaulGambles2
       RT @MLutherKingQts: A productive and happy lif...
    
    
      508463703197765632
       2014-09-07 03:56:08 +0000
        Sally_Hadidi
       Sonoma wine tasting picnics in the sun #Califo...
    
    
      508463524062830592
       2014-09-07 03:55:25 +0000
         A_Reader_FT
       RT @SaoSasha: You can accuse the Czechs of man...
    
    
      508463450109255680
       2014-09-07 03:55:07 +0000
       RightWingNews
       Even More IRS Empoyees Have “Lost” Emails http...
    
  

5 rows × 3 columns



In [220]:

    
if runlevel >= 90:
    from datetime import datetime as dt
    def convert_time(t):
        o = dt.strptime(t, "%Y-%m-%d %H:%M:%S %z")
        return o.hour + o.minute/60, o.hour, (o.hour//3)*3, (o.hour//6)*6, (o.hour//12)*12, o.weekday()

    df1 = pd.DataFrame([ convert_time(t) for t in tweets['time']], 
                           columns=['utc', 'utc1', 'utc3', 'utc6', 'utc12', 'wday'], index=tweets.index)
    for col in df1.columns:
        tweets[col] = df1[col]
    tweets.to_csv('tweets.csv')
    print ("reformatted time data (%i tweets)" % len(tweets))

Basic analysis



In [221]:

    
tweets = pd.DataFrame.from_csv('tweets.csv')
tweets.head()









    Out[221]:






  
    
      
      time
      from
      text
      utc
      utc1
      utc3
      utc6
      utc12
      wday
    
    
      ID
      
      
      
      
      
      
      
      
      
    
  
  
    
      508463857913036802
       2014-09-07 03:56:45 +0000
             tomgara
                  Is Conrad Hackett a person or an idea?
       3.933333
       3
       3
       0
       0
       6
    
    
      508463793504931842
       2014-09-07 03:56:29 +0000
        PaulGambles2
       RT @MLutherKingQts: A productive and happy lif...
       3.933333
       3
       3
       0
       0
       6
    
    
      508463703197765632
       2014-09-07 03:56:08 +0000
        Sally_Hadidi
       Sonoma wine tasting picnics in the sun #Califo...
       3.933333
       3
       3
       0
       0
       6
    
    
      508463524062830592
       2014-09-07 03:55:25 +0000
         A_Reader_FT
       RT @SaoSasha: You can accuse the Czechs of man...
       3.916667
       3
       3
       0
       0
       6
    
    
      508463450109255680
       2014-09-07 03:55:07 +0000
       RightWingNews
       Even More IRS Empoyees Have “Lost” Emails http...
       3.916667
       3
       3
       0
       0
       6
    
  

5 rows × 9 columns

Tweets per hour



In [222]:

    
utc1s = {x for x in tweets['utc1']}
utc1s









    Out[222]:





{0, 1, 2, 3, 4, 5, 6, 7, 8, 20, 21, 22, 23}



In [223]:

    
if runlevel >= 0:
    mylist = [  (  utc1, len(tweets[tweets['utc1']==utc1])) for utc1 in utc1s]
    mylist.sort(key = lambda x: int(x[0]), reverse=False)
    index = [x[0] for x in mylist]
    data = [x[1] for x in mylist]
    tweets_per_utc1 = pd.DataFrame(data, index=index, columns=['count'])
    mylist,index,data = None,None,None
    tweets_per_utc1.to_csv('tweets_per_utc1.csv')
    
tweets_per_utc1 = pd.DataFrame.from_csv('tweets_per_utc1.csv')
plt.bar(tweets_per_utc1.index, tweets_per_utc1['count'])
plt.title('Tweets per hour (capped; UTC)')









    Out[223]:





<matplotlib.text.Text at 0x7ff1da6998d0>

Note: because we only get 10 teets per minute this analysis will max out at 600 tweets / hour, regardless of how many tweets are being sent; this is a binding constraint, so with the present data this graph is probably not very meaningful

Tweets per sender

We first look at the twitter handles that send messages into my timeline (note: for new-style RT's I won't see who RT'd it, and I also might not follow the tweep); we look at some basic stats (#tweets, #tweeps, #tweets/tweep)



In [224]:

    
handles = {x for x in tweets['from']}
#handles



In [225]:

    
len(tweets), len(handles), round(len(tweets) / len(handles),2)









    Out[225]:





(5297, 580, 9.13)



In [226]:

    
if runlevel > 80:
    mylist = [  (  handle, len(tweets[tweets['from']==handle])  )
      for handle in handles]
    mylist.sort(key = lambda x: x[1], reverse=True)
    index = [x[0] for x in mylist]
    data = [x[1] for x in mylist]
    tweets_per_handle = pd.DataFrame(data, index=index, columns=['count'])
    mylist,index,data = None,None,None
    tweets_per_handle.to_csv('tweets_per_handle.csv')
    print ("calculated number of tweets per handle (%i handles)" % len(handles))

tweets_per_handle = pd.DataFrame.from_csv('tweets_per_handle.csv')
tweets_per_handle.head(20)









    Out[226]:






  
    
      
      count
    
  
  
    
      ManchurianDevil
       186
    
    
      luvronandez
       143
    
    
      CMCMFIN
       120
    
    
      Ed_Tech_
       107
    
    
      saserief
        86
    
    
      beccanalia
        75
    
    
      ao_techfreak
        71
    
    
      AnnieSage
        69
    
    
      Arinobe_SME
        66
    
    
      WarrenWhitlock
        61
    
    
      katecaldwell
        60
    
    
      BenAtkinsonPhD
        55
    
    
      dianewitt
        55
    
    
      Timothy_Hughes
        52
    
    
      TheWarRoom_Tom
        50
    
    
      thezhanly
        45
    
    
      ISSMAG
        44
    
    
      OnlineMagazin
        43
    
    
      TheFutureMedia
        40
    
    
      elearningfeeds
        39
    
  

20 rows × 1 columns



In [227]:

    
average(tweets_per_handle), median(tweets_per_handle)









    Out[227]:





(9.1327586206896552, 4.0)



In [239]:

    
plt.plot(tweets_per_handle)
plt.title('Tweets per handle')









    Out[239]:





<matplotlib.text.Text at 0x7ff1da638208>



In [229]:

    
plt.plot(tweets_per_handle[:50], '+-')









    Out[229]:





[<matplotlib.lines.Line2D at 0x7ff1da798780>]



In [230]:

    
plt.plot(tweets_per_handle[:10], 'o-')









    Out[230]:





[<matplotlib.lines.Line2D at 0x7ff1da6f94a8>]

Streaming API



In [231]:

    
import twitter
import yaml



In [232]:

    
if False:
    with open("twitter_tokens.yaml", "w") as f:
        yaml.dump(tok, f, default_flow_style=False)
!ls -l | grep yaml









    



-rw-r--r-- 1 root root     226 Sep  7 07:10 twitter_tokens.yaml



In [233]:

    
with open("twitter_tokens.yaml", "r") as f:
    tok = yaml.load(f.read())
#tok

auth_obj=twitter.OAuth(
    token = tok['token'], 
    token_secret = tok['token_secret'], 
    consumer_key = tok['consumer_key'], 
    consumer_secret = tok['consumer_secret']
)
auth_obj









    Out[233]:





<twitter.oauth.OAuth at 0x7ff1da8c4358>



In [238]:

    
#t = twitter.Twitter(auth=auth_obj)



In [235]:

    
#status = t.statuses.home_timeline(count=5)
#status[0]



In [237]:

    
#twitter_stream = twitter.TwitterStream(auth=auth_obj, domain='userstream.twitter.com')
twitter_stream = twitter.TwitterStream(auth=auth_obj)
#twitterator = twitter_stream.user()
#twitterator = twitter_stream.statuses.sample()
#twitterator



In [ ]:

    
if False:
    import pprint
    for tweet in twitterator:
        #print ('incoming')
        pprint (tweet)

	time	from	text
ID
508463857913036802	2014-09-07 03:56:45 +0000	tomgara	Is Conrad Hackett a person or an idea?
508463793504931842	2014-09-07 03:56:29 +0000	PaulGambles2	RT @MLutherKingQts: A productive and happy lif...
508463703197765632	2014-09-07 03:56:08 +0000	Sally_Hadidi	Sonoma wine tasting picnics in the sun #Califo...
508463524062830592	2014-09-07 03:55:25 +0000	A_Reader_FT	RT @SaoSasha: You can accuse the Czechs of man...
508463450109255680	2014-09-07 03:55:07 +0000	RightWingNews	Even More IRS Empoyees Have “Lost” Emails http...

	count
ManchurianDevil	186
luvronandez	143
CMCMFIN	120
Ed_Tech_	107
saserief	86
beccanalia	75
ao_techfreak	71
AnnieSage	69
Arinobe_SME	66
WarrenWhitlock	61
katecaldwell	60
BenAtkinsonPhD	55
dianewitt	55
Timothy_Hughes	52
TheWarRoom_Tom	50
thezhanly	45
ISSMAG	44
OnlineMagazin	43
TheFutureMedia	40
elearningfeeds	39