Twords

This notebook will show a typical work flow when using Twords to analyze word frequencies of twitter data.

The basic work flow is you search twitter for a particular search term, say "charisma", either with the java search function or with the Twitter API. (Note: for more obscure terms like "charisma", an API search may take upwards of a day or two to get a reasonable sized data set. The java search code would be much faster in these cases.) All the returned tweets are then put into a bag-of-words frequency analyzer to find the frequency of each word in collection of tweets. These frequencies are compared with the background frequencies of English words on twitter (there are several options for computing this background rate, see below) and the results displayed in order of which words are most disproportionately more frequently used with that search.

E.g., if you search for "charisma", you will find that words like "great", "lacks", and "handsome" were roughly 20 times more likely to occur in tweets with the word "charisma" in it than in a random sample of all tweets. Twords creates a list of these disproportionately-used words, in order of how many times more likely they are to appear in tweets containing search term that otherwise.

Once the frequency chart is found with the list of disproportionately used words, the user can search the tweet corpus for tweets containing one of these words to see what types of tweets had them. In the charisma example, it was found that "carpenter" was one of the most common words in these tweets, but a search in the tweet corpus revealed that a popular actress was named "charisma carpenter", so all tweets about her were included. Since this was not the sense that was intended, Twords lets the user delete tweets from the corpus by search term and re-do the frequency analysis. In this case, tweets containing the word "carpenter" were removed. (This would also remove a tweet that said something like "the carpenter I hired had great charisma", which we would probably want to include in the corpus, but this is a small price to pay to remove the many unwanted "charisma carpenter" tweets.) Another common use of this delete function is to remove spam - spam often consists of idential tweets coming simultaneously from several different accounts, which can produce noticeable (and probably undesired) frequency associations in the final bag-of-words analysis.

Main objects within Twords class:

data_path: Path to twitter data collected with search_terms. Currently code is meant to read in data collected wtih Henrique twitter code.

background_path: path to background twitter data used to compare word frequencies

search_terms: list of terms used in twitter search to collect tweets located at data_path

tweets_df: pandas dataframe that contains all tweets located at data_path, possibly with some tweets dropped (see dropping methods below). If user needs to re-add tweets that were previously dropped they must reload all tweet data over again from the csv.

word_bag: python list of words in all tweets contained in tweets_df

freq_dist: nltk.FreqDist() object computed on word_bag

word_freq_df: pandas dataframe of frequencies of top words computed from freq_dist

The two most important objects are tweets_df and word_freq_df, the dataframes that hold the tweets themselves and the relative frequencies of words in those tweets.

Note on background word frequencies

If the search is done by the Twitter API, calculation of background word frequencies is straightforward with the sampling feature of the Twitter API.

If the tweets are gathered using the java search code (which queries twitter automatically through Twitter's web interface), background frequencies can be trickier, since the java library can specify search dates. Search dates set a year or more in the past in principle require background rates for those past dates, which cannot be obtained with the Twitter API. To approximate these rates, a search was done using ~50 of the most common English words simultaneously.

Another issue is that the java code can return either "top tweets" only or "all tweets" for a given search term, depending on which jar file is used.

A solution is to use all three background possibilities (current API background, Top tweets background w/ top 50 English search terms, and All tweets background with top 50 English search terms) and compare their word rates each other. In early versions of these tests the difference in word rates in background between Twitter API and Top tweets was typically less than factor of 2, which would make little difference in word rates comparison, as words of interest typically appeared ~10 or more times more frequently than base rate.

Example using search on word "charisma"

Showing how program can work with some previously collected tweets. These were found by searching for "charisma" and "charismatic" using java code, i.e. by calling create_java_tweets in Twords. The create_java_tweets function can return ~2000 tweets per minute, which is much faster than the Twitter API when searching for uncommon words like "charisma."



In [1]:

    
# to reload files that are changed automatically
%load_ext autoreload
%autoreload 2



In [2]:

    
from twords.twords import Twords 
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import pandas as pd
# this pandas line makes the dataframe display all text in a line; useful for seeing entire tweets
pd.set_option('display.max_colwidth', -1)

First set path to desired twitter data location, collect tweets, lower case of all letters, and drop tweets where the strings "charisma" or "charismatic" are contained in either a username or a mention.

Example with pre-existing data collected with Twords



In [3]:

    
twit = Twords()
twit.data_path = "data/java_collector/charisma_300000"
twit.background_path = 'jar_files_and_background/freq_table_72319443_total_words_twitter_corpus.csv'
twit.create_Background_dict()
twit.set_Search_terms(["charisma"])
twit.create_Stop_words()



In [4]:

    
twit.get_java_tweets_from_csv_list()



In [5]:

    
# find how many tweets we have in original dataset
print "Total number of tweets:", len(twit.tweets_df)









    



Total number of tweets: 267917



In [6]:

    
twit.tweets_df.head(5)









    Out[6]:






  
    
      
      username
      date
      retweets
      favorites
      text
      mentions
      hashtags
      id
      permalink
    
  
  
    
      0
      Zer0Cool6711
      2016/05/03
      1
      1
      Yo @THEVinceRusso u gotta chill on @FightOwensFight he's over rite now ppl don't care how he looks his charisma that good
      @THEVinceRusso @FightOwensFight
      NaN
      727647453315534848
      https://twitter.com/Zer0Cool6711/status/727647453315534848
    
    
      1
      CultureSyndrome
      2016/05/03
      0
      2
      @shellechii that hitler'chu charisma. pic.twitter.com/lc5Zoi5Rly
      @shellechii
      NaN
      727647245621956608
      https://twitter.com/CultureSyndrome/status/727647245621956608
    
    
      2
      Charisma_Elf
      2016/05/03
      0
      0
      @VOCAL01D おはありでち＾＾ノン
      @VOCAL01D
      NaN
      727646469746917376
      https://twitter.com/Charisma_Elf/status/727646469746917376
    
    
      3
      Charisma___M
      2016/05/03
      0
      0
      @ToriKelly
      @ToriKelly
      NaN
      727645669213777927
      https://twitter.com/Charisma___M/status/727645669213777927
    
    
      4
      charisma_news
      2016/05/03
      0
      0
      Early Christians got the #Jesus story wrong, author says: http://bit.ly/1NiGeM6
      NaN
      #Jesus
      727645183441969154
      https://twitter.com/charisma_news/status/727645183441969154

Do standard cleaning



In [7]:

    
# create a column that stores the raw original tweets
twit.keep_column_of_original_tweets()



In [8]:

    
twit.lower_tweets()



In [9]:

    
# for this data set this drops about 200 tweets
twit.keep_only_unicode_tweet_text()



In [10]:

    
twit.remove_urls_from_tweets()









    



Removing urls from tweets...
This may take a minute - cleaning rate is about 400,000 tweets per minute
Time to complete: 0.631 minutes
Tweets cleaned per minute: 424275.6



In [11]:

    
twit.remove_punctuation_from_tweets()



In [12]:

    
twit.drop_non_ascii_characters_from_tweets()



In [13]:

    
# these will also commonly be used for collected tweets - DROP DUPLICATES and DROP BY SEARCH IN NAME
twit.drop_duplicate_tweets()



In [14]:

    
twit.drop_by_search_in_name()



In [15]:

    
twit.convert_tweet_dates_to_standard()



In [16]:

    
twit.sort_tweets_by_date()



In [17]:

    
# apparently not all tweets contain the word "charisma" at this point, so do this
# cleaning has already dropped about half of the tweets we started with 
len(twit.tweets_df)









    Out[17]:





131841



In [18]:

    
twit.keep_tweets_with_terms("charisma")



In [19]:

    
len(twit.tweets_df)









    Out[19]:





124062

Create the word bag and then the nltk object from the tweets in tweets_df



In [20]:

    
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()









    



Time to make words_string:  0.001 minutes
Time to tokenize:  0.196 minutes
Time to compute word bag:  0.154 minutes

Create dataframe showing the most common words in the tweets and plot the results

Ideally word_freq_df would be created with a large number like 10,000, since it is created in order of most common words in the corpus, and the more interesting relative frequencies may occur with less-common words. It takes about one minutes to compute per 1000 words though, so as a first look at data it can make sense to create with smaller values of n. Here we'll use n=400.



In [21]:

    
# this creates twit.word_freq_df, a dataframe that stores word frequency values
twit.create_word_freq_df(400)









    



Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.2357 minutes



In [56]:

    
twit.word_freq_df.sort_values("log relative frequency", ascending = False, inplace = True)



In [57]:

    
twit.word_freq_df.head(20)









    Out[57]:






  
    
      
      word
      occurrences
      frequency
      relative frequency
      log relative frequency
      background occurrences
    
  
  
    
      368
      lakas
      231
      0.000336
      1280.618505
      7.155098
      19
    
    
      64
      uniqueness
      748
      0.001089
      1212.131215
      7.100135
      65
    
    
      504
      alis
      178
      0.000259
      937.457096
      6.843171
      20
    
    
      1126
      spritzer
      80
      0.000117
      842.658064
      6.736561
      10
    
    
      294
      oozes
      275
      0.000401
      827.610598
      6.718543
      35
    
    
      700
      nicht
      128
      0.000186
      674.126451
      6.513418
      20
    
    
      906
      exudes
      100
      0.000146
      619.601517
      6.429077
      17
    
    
      937
      niet
      96
      0.000140
      561.772042
      6.331096
      18
    
    
      1538
      likability
      58
      0.000084
      509.105913
      6.232656
      12
    
    
      1653
      mehr
      53
      0.000077
      507.509970
      6.229516
      11
    
    
      2197
      noch
      38
      0.000055
      500.328225
      6.215264
      8
    
    
      2138
      viel
      40
      0.000058
      421.329032
      6.043414
      10
    
    
      1424
      gravitas
      63
      0.000092
      414.745766
      6.027666
      16
    
    
      2386
      likeability
      35
      0.000051
      409.625448
      6.015243
      9
    
    
      1404
      dexterity
      65
      0.000095
      402.740986
      5.998294
      17
    
    
      2408
      farages
      34
      0.000050
      397.921863
      5.986256
      9
    
    
      558
      iba
      162
      0.000236
      387.814222
      5.960526
      44
    
    
      1665
      izmir
      53
      0.000077
      372.173978
      5.919361
      15
    
    
      166
      charismatic
      420
      0.000612
      365.616102
      5.901584
      121
    
    
      507
      oozing
      178
      0.000259
      328.932314
      5.795852
      57

Plotting log of relative frequency

This gives the natural log of the frequency of word in tweets data divided by frequency of word in background data set. This is a logarithmic scale, so remember that a 1 on this graph means the given word is ~2.7 times more common, a 2 means it is ~7 times more common, and a 3 means it is ~20 times more common.

Instead of using plotting function provided by Twords (which plots all words from word_freq_df) we'll manipulate the word_freq_df dataframe directly - remember, don't get stuck in API bondage!



In [24]:

    
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);

Further data cleaning

The typical Twords workflow is to collect tweets, load them into tweets_df, do preliminary cleaning, create word_freq_df to look at relative word frequencies, and then depending on any potential/undesired effects seen in data, further filter tweets_df and recreate word_freq_df until the data is cleaned sufficiently to be of use.

The next section shows how this process can work.

Fighting spam and unusually high relative frequencies

Repetitive non-desired tweets are the number one problem in interpreting results, as large numbers of tweets that use same phrase causes various non-relevant words to appear high in word_freq_df. Sometimes the problem is spam accounts doing advertising, sometimes it is a pop culture phenomenon that many different users are quoting: e.g. people enjoy quoting the line 'excuse my charisma' from the song "6 Foot 7 Foot" by Lil Wayne on Twitter, which means "excuse" appears much more frequently than background when searching for "charisma," even though this does not reveal something semantically interesting about the word "charisma." The problem is at its worst when the problem-word is rare in the background, which causes it to have a huge relative frequency.

When looking through the chart of extra-frequent words it is good to check which kinds of tweets those words are appearing in to be sure they are relevant to your search.

Best defense against spammy/repetitive tweets: large dataset that spans several years

A huge dataset of several million tweets is best for averaging over time periods where different phrases become popular. When this isn't possible (or even when it is), the tools below can help to filter out spammy/repetitive tweets from dataset.

Second best defense: choose higher threshold for background occurrences

It is a good idea to create a large word_freq_df with the top 10,000 or so words once and then investigate further. One problem that can occur is when the relative frequency for a word (below, we'll see this problem for a word like "minho") is huge because the background value is tiny - we can filter out cases like this by filtering word_freq_df to only include words with a high cutoff background frequency.

Investigating data

Note: this investigation was done using word_freq_df with 400 words - only top 30 are shown in graphics here so that notebook can be viewed reasonably on Github

The word "uniqueness" seems to make sense, but in looking at tweets that contain it we see uniqueness appears a lot due to the popular phrase "charisma, uniqueness, nerve and talent":



In [59]:

    
twit.tweets_containing("uniqueness").head(12)









    



771 tweets contain this term






    Out[59]:






  
    
      
      username
      text
    
  
  
    
      271
      nicolastian
      obvio que me refiero a charisma uniqueness nerve  talent
    
    
      356
      chiscoyo
      charisma uniqueness nerve ant talent
    
    
      361
      themaxgregory
      happy birthday laurennalice hope ya have an amazing 18th  slay with ya charisma uniqueness nerve and talent pictwittercomehqbuqpox4
    
    
      546
      danimalcrackers
      the time has come to put someone with charisma uniqueness nerve and talent on the scotus america needs rupaulpictwittercomvkjduz52n6
    
    
      556
      real_stephanie
      sabia que a ana ficaria she got the charisma uniqueness nerve and talent thevoicekidsbr
    
    
      907
      kaiocosta
      eduardocilto pede um pouco de charisma uniqueness nerve and talent tambm
    
    
      931
      andrewboff
      stopcityairport true all mayors should have charisma uniqueness nerve and talent
    
    
      1195
      emilyrioosss
      charismauniqueness nerve and talent
    
    
      1470
      _ivb_
      how have i only just realised that charisma uniqueness nerve and talent is an acronym of cunt
    
    
      1583
      gretadaniel_
      missfamenyc like im a straight biological female who doesnt do drag but i still have charisma uniqueness nerve and talent
    
    
      1624
      thezaprecap
      erikajayne dont they know it stands for charismauniquenessnervetalent rhobh rupaul bravotv
    
    
      1756
      jessemanke
      charisma uniqueness nerve and talenti finally get it lol rupaulsdragrace

For the word "iba", we appear to be finding tweets that are not in English. These are good candidates to drop if we only want English words. (There is a python library called langdetect that can classify languages of tweets quite well - the problem is it can only classify ~2500 tweets per minute, which is rather slow for one machine. An example will be given below of how to use it if that is desired.)



In [26]:

    
twit.tweets_containing(" iba ").head(12)









    



452 tweets contain this term






    Out[26]:






  
    
      
      username
      text
    
  
  
    
      172
      aniger_eyaf
      dude believe me iba charisma ni regina faye educalane  diba diba alyzzarobles97 ellsmayo lalalacielo cathlenecanlas mawyeeel
    
    
      388
      rhinazipagan
      o m g iba tlg ang charisma ni mengs onelove adn votemainefpp kca
    
    
      849
      gianinadlgomez
      marlou yan eh iba talaga charisma ggvfebibigwins
    
    
      859
      allanpcmln
      annvalenciaa thank you baby iba kasi talaga charisma mo hmp i love you
    
    
      1093
      francesdyan
      haha iba charisma ng aso may baby clarky na ang jadine then may baby monkey na ang lizquen dolceamore on ratedk
    
    
      1211
      thirdy333333
      justjunius iba talaga ang charisma leche ka
    
    
      1878
      _jungjaehyuns
      kallaseu hahaha im a ten biased pero si taeyong iba ang charisma
    
    
      1916
      pqmaricar26
      tintwirl hahaha iba talaha charisma ni quen hindi pahihindian dolceamoresweetbeginning
    
    
      1943
      auregondola_
      votemainefpp kca iba ang charisma mo maine
    
    
      1956
      mylzperhour
      mrkimpson super iba yung charisma ni jojeol  iu
    
    
      2028
      bebethbajada
      votemainefpp kca even the mid 50s me kilig pdn pala hahaha iba talaga ang charisma ng aldub kasama ang aldubnation wow
    
    
      2614
      pedrosokay
      asucido zari97262439 oo bhe iba ang charisma nila sa mga fans full of energy samarsummerwith tomiho

Another example we probably wish to drop is tweets containing "carpenter", because these overwhelmingly refer to the actress Charisma Carpenter:



In [27]:

    
twit.tweets_containing("carpenter").head(12)









    



1185 tweets contain this term






    Out[27]:






  
    
      
      username
      text
    
  
  
    
      12
      shirodesudesu
      could i just have a whole series based off that witch couple in supernatural with james marsters and charisma carpenter alliveeverwanted
    
    
      35
      corralantonello
      buffy the vampire slayer angel slave cordelia 6 figure new charisma carpenter ff32toolid10039campid5337597384item331777331777vectorid229466lgeo1pictwittercommfjrl8xsbo
    
    
      104
      dogbr0ther
      larsenbryce ryanhigginsryan you have slowly won me over you are the king of charisma unfortunately not charisma carpenter
    
    
      532
      emwendorf
      what joss whedon did to charisma carpentercordy 20022003
    
    
      720
      nissemus
      socialsoprano is that charisma carpenter
    
    
      822
      proudlovatjc
      certo che charisma carpenter  proprio un vampiro oh
    
    
      882
      alineteston
      honestly im kinda happy that charisma carpenters character will show some interest in jay at this point ill take anything chicagopd
    
    
      933
      elektrashearts
      charisma carpenter is in this episode too wow pictwittercomwxbtrtzovd
    
    
      1006
      ndeddiemac
      didnt statham have charisma carpenter in ex2 live your life buddy eddiedrunjflix
    
    
      1112
      goodfellazmag
      chicago pd charisma carpenter offers jay a kushy job  watch video  goodfellazmag mpcm
    
    
      1118
      feeds4u
      chicago pd charisma carpenter offers jay a kushy job  watch video  tv news
    
    
      1122
      bushsofferfr
      spoiler rappel le personnage de charisma carpenter sera rcurrent  sa relation avec jay pourrait sorienter vers qql chose de romantique

We find a similar thing when looking at tweets containing "minho" and "flaming": the Korean singer "Minho" has the nickname "Flaming Charisma", which is also not what we're interested in.

Thus we see we should drop tweets containing " iba ", "minho", "flaming", and "carpenter" from tweets_df. We do this and then recalculate word statistics:



In [28]:

    
twit.drop_by_term_in_tweet([" iba ", "minho", "flaming", "carpenter"])

Now recalculate word bag and word frequency object:



In [29]:

    
# recompute word bag for word statistics
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()









    



Time to make words_string:  0.001 minutes
Time to tokenize:  0.19 minutes
Time to compute word bag:  0.151 minutes



In [30]:

    
# create word frequency dataframe
twit.create_word_freq_df(400)









    



Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.2323 minutes



In [31]:

    
# plot new results - again, only showing top 30 here
twit.word_freq_df.sort_values("log relative frequency", ascending = True, inplace = True)



In [32]:

    
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);



In [33]:

    
twit.word_freq_df.sort_values("log relative frequency", ascending=False, inplace=True)
twit.word_freq_df.head(12)









    Out[33]:






  
    
      
      word
      occurrences
      frequency
      relative frequency
      log relative frequency
      background occurrences
    
  
  
    
      105
      uniqueness
      920
      0.000802
      892.236129
      6.793731
      65
    
    
      322
      oozes
      418
      0.000364
      752.858871
      6.623878
      35
    
    
      37
      aura
      1677
      0.001462
      419.505690
      6.039077
      252
    
    
      149
      charismatic
      722
      0.000629
      376.146618
      5.929979
      121
    
    
      76
      lacks
      1112
      0.000969
      210.506683
      5.349517
      333
    
    
      320
      lingo
      419
      0.000365
      140.495207
      4.945173
      188
    
    
      266
      und
      473
      0.000412
      103.892591
      4.643358
      287
    
    
      38
      gal
      1627
      0.001418
      98.713679
      4.592224
      1039
    
    
      371
      caravan
      379
      0.000330
      95.566248
      4.559820
      250
    
    
      14
      charm
      2810
      0.002449
      89.917749
      4.498895
      1970
    
    
      323
      talaga
      418
      0.000364
      76.599013
      4.338584
      344
    
    
      277
      ikon
      455
      0.000397
      67.967967
      4.219037
      422

It looks like we still have a lot of spam/repetive uninteresting content, but we can filter through and pick out interesting information. For example, we see the word "skill" is high in relative frequency:



In [34]:

    
twit.tweets_containing("skill").head(12)









    



1848 tweets contain this term






    Out[34]:






  
    
      
      username
      text
    
  
  
    
      30
      papanemes
      ill take one order of adequate interpersonal skills please and a side of charisma
    
    
      94
      vrsvno
      i am all set to witness the charisma and oratory skills of our pm at dav davwelcomesmodi
    
    
      99
      drhousepls
      im shipping bayley and finn balor cause that would be top tier genetics for their kids skill charisma abs ass
    
    
      130
      tresorgahungu
      charisma can take you to the door of your destiny however it is integrity that will keep you there leadershipskills
    
    
      253
      iiveinfear
      still hoping naomi becomes divas champion sometime in the future she has the charisma wrestling skills and she can back up the talk
    
    
      291
      upper_cayce
      atinyangryman so people do put skills into anything other than charisma
    
    
      428
      salisburryatche
      developing charisma skills  access is register ipzlgsx
    
    
      448
      ikramlely
      nak sthun dahskill nak usha awek ni aku ade lagi takcharisma aku ni dah drop kehuhutrok siall pangai
    
    
      630
      ben_mathes
      naval debugging subroutines still useful major charisma to refactor larger program is a hell of a skill to aspire too though
    
    
      872
      simonpennysmith
      never heard such a smug jargon talking no mark suit all the charisma and skill of a tin of boot polish
    
    
      889
      ethanjarnagin
      acekillcard fallout msjillyjelly lol i guess his charisma skill wasnt high enough
    
    
      896
      rwdhbib
      saad hariri 20  such a drastic improvement in speech and charisma and skills all that time away were well invested

It looks like people are often talking about charisma as a communication skill itself, as well as a way to hide a lack of skills elsewhere.

Another anti-spam technique: filter out over-represented users

Find which users are contributing disproportionately to results with twit.tweets_df['username'].value_counts(), which returns a list of the users with highest number of tweets in descending order. These results can be plotted using something like twit.tweets_df['username'].value_counts().iloc[:0].plot.barh()



In [35]:

    
twit.tweets_df['username'].value_counts().head(12)









    Out[35]:





pcyhrr             615
av_momo            484
avranki            334
kindlebooks_4u     329
ukcaravancentre    252
dndskillcheck      235
martinlighthous    185
zczzzcz            140
kvhanbin           139
beirutcallgirl     121
abubake19446198    116
realhumanpraise    105
Name: username, dtype: int64

This shows that a few users are responsible for hundreds of tweets in our data set - this can be a problem, as we saw in twit.word_freq_df that many of our overly-frequent terms only had several hundred example in the data set. To see an example of the problem, we look at all the tweets by the most prolific charisma tweeter, user pcyhrr:



In [36]:

    
twit.tweets_by("pcyhrr").head(12)









    Out[36]:






  
    
      
      username
      text
    
  
  
    
      76805
      pcyhrr
      charismahanbin rt triplekimcrown    support triple kim    pictwittercommi6qa6g82y
    
    
      76810
      pcyhrr
      charismahanbin rt hanbinth   18   ikoncertinbang
    
    
      77242
      pcyhrr
      charismahanbin rt bapcha father  i press conference will be aired live on daum tvpot  kakao tv on this fri pictwittercomyjgjldt11d
    
    
      77395
      pcyhrr
      charismahanbin rt alnikon yunhyeong and chanwoo at hamdeok beach 160525 nht pictwittercom4le6dg4osb
    
    
      77396
      pcyhrr
      charismahanbin errrrr95 55555555
    
    
      77397
      pcyhrr
      charismahanbin ikon  20152016 ikoncert showtime in seoul live dvd preorder  31st may 2016 release  21st june 2016 ikon ikonc
    
    
      77398
      pcyhrr
      charismahanbin rt ygentofficial ikon  20152016 ikoncert showtime in seoul live dvd 2016 05 31 preorde
    
    
      77400
      pcyhrr
      charismahanbin ikon  20152016 ikoncert showtime in seoul live dvd ikon
    
    
      77480
      pcyhrr
      charismahanbin rt nickkiie 160521      bi hanbin ikoncertinnanjing hanbinth charismah pictwittercomsl0ulnhynu
    
    
      77937
      pcyhrr
      charismahanbin bi hanbin ikon crokaybipictwittercomyqplawvcx7
    
    
      77938
      pcyhrr
      charismahanbin rt bapcha why is krunk always there cr dc ikon gallery pictwittercom2fba5gsper
    
    
      77939
      pcyhrr
      charismahanbin rt ygikonic ikonschedule 5 27   11 tvn     bobby

It looks like this user is repeatedly tweeting at or about charisma_hanbin: we definitely want to drop his tweets.

As another example we look at second most prolific charisma tweeter, av_momo:



In [37]:

    
twit.tweets_by("av_momo").head(12)









    Out[37]:






  
    
      
      username
      text
    
  
  
    
      1216
      av_momo
      kirakira debut fcupgal charisma gal  pictwittercomlx5en7dqvd
    
    
      1782
      av_momo
      charisma gal get you 15  gal pictwittercomkpdhspfqgv
    
    
      1793
      av_momo
      charisma gal get you 15   gal pictwittercomahc0imphvr
    
    
      1809
      av_momo
      charisma gal get you 15  gal pictwittercomqp6xndi2ts
    
    
      2246
      av_momo
      charisma gal get you 15  gal pictwittercomti0ia9pzxu
    
    
      2272
      av_momo
      charisma gal get you 15   gal pictwittercomfvdxecpmje
    
    
      2324
      av_momo
      charisma gal get you 15  gal pictwittercomsion1iy2hg
    
    
      2559
      av_momo
      charisma gal get you 15    pictwittercom29pi61letk
    
    
      2642
      av_momo
      kirakira debut fcupgal charisma gal  pictwittercomdjictqno7q
    
    
      2936
      av_momo
      kirakira debut fcupgal charisma gal  pictwittercomg5kb4bx9d3
    
    
      3362
      av_momo
      charisma gal get you 15  gal pictwittercomze9wujgduh
    
    
      3411
      av_momo
      charisma gal get you 15   gal pictwittercomjzwxa84fyy

This is again repetive and uninteresting data that we wish to drop.

To facilitate dropping users who tweet repeatedly we can use the function drop_by_username_with_n_tweets, which drops all tweets by users that have more than n tweets in our dataset. To be sure we filter out all offenders, here we will use n=1:



In [38]:

    
twit.drop_by_username_with_n_tweets(1)









    



Dropping tweets by repeated users...
Found 13195 users with more than 1 tweets in tweets_df
Finished 0 percent of user drops
Finished 5 percent of user drops
Finished 10 percent of user drops
Finished 15 percent of user drops
Finished 20 percent of user drops
Finished 25 percent of user drops
Finished 30 percent of user drops
Finished 35 percent of user drops
Finished 40 percent of user drops
Finished 45 percent of user drops
Finished 50 percent of user drops
Finished 55 percent of user drops
Finished 60 percent of user drops
Finished 65 percent of user drops
Finished 70 percent of user drops
Finished 75 percent of user drops
Finished 80 percent of user drops
Finished 85 percent of user drops
Finished 90 percent of user drops
Finished 95 percent of user drops
Finished 100 percent of user drops
Took 9.391 minutes to complete



In [39]:

    
# now recompute word bag and word frequency objeect and look at results
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()









    



Time to make words_string:  0.0 minutes
Time to tokenize:  0.106 minutes
Time to compute word bag:  0.092 minutes



In [40]:

    
twit.create_word_freq_df(400)









    



Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.2069 minutes



In [41]:

    
twit.word_freq_df.sort_values("log relative frequency", ascending = True, inplace = True)



In [42]:

    
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);

Another anti-spam technique: remove words that don't appear frequently enough in background

We see from this example that not removing rare background examples can cause problems: the word "lakas" appears only 19 times in background, and because of this it is now the number one word in frequency dataframe. We see a similar pattern with a number of other top words.



In [43]:

    
twit.word_freq_df.sort_values("log relative frequency", ascending=False).head(10)









    Out[43]:






  
    
      
      word
      occurrences
      frequency
      relative frequency
      log relative frequency
      background occurrences
    
  
  
    
      368
      lakas
      231
      0.000336
      1280.618505
      7.155098
      19
    
    
      64
      uniqueness
      748
      0.001089
      1212.131215
      7.100135
      65
    
    
      294
      oozes
      275
      0.000401
      827.610598
      6.718543
      35
    
    
      166
      charismatic
      420
      0.000612
      365.616102
      5.901584
      121
    
    
      376
      pence
      229
      0.000334
      294.159598
      5.684122
      82
    
    
      74
      lacks
      725
      0.001056
      229.326988
      5.435149
      333
    
    
      193
      lingo
      392
      0.000571
      219.628963
      5.391940
      188
    
    
      242
      und
      318
      0.000463
      116.709610
      4.759689
      287
    
    
      334
      een
      253
      0.000368
      106.596245
      4.669048
      250
    
    
      260
      talaga
      308
      0.000449
      94.309115
      4.546578
      344

We see that the highest log relative frequency words often have background occurrences smaller than 100 - this may collect rarer words that we are not interested in. (These background occurences were calculated from a sample of ~72 million words from Twitter.)

To correct for this we can remove words from consideration that don't appear often enough in background.

We'll do this by first creating a word_freq_df with many top words (10,000 here) and then filtering out words that don't have a frequent enough background rate before plotting.



In [44]:

    
twit.create_word_freq_df(10000)









    



Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  5.3828 minutes

Filter out words that occur less than 100 times in background:



In [45]:

    
twit.word_freq_df[twit.word_freq_df['background occurrences']>100].sort_values("log relative frequency", ascending=False).head(10)









    Out[45]:






  
    
      
      word
      occurrences
      frequency
      relative frequency
      log relative frequency
      background occurrences
    
  
  
    
      166
      charismatic
      420
      0.000612
      365.616102
      5.901584
      121
    
    
      74
      lacks
      725
      0.001056
      229.326988
      5.435149
      333
    
    
      193
      lingo
      392
      0.000571
      219.628963
      5.391940
      188
    
    
      406
      overflowing
      212
      0.000309
      189.241006
      5.243021
      118
    
    
      455
      realness
      192
      0.000280
      145.494917
      4.980141
      139
    
    
      663
      topgear
      138
      0.000201
      128.635855
      4.856986
      113
    
    
      708
      lacked
      126
      0.000184
      118.498790
      4.774903
      112
    
    
      242
      und
      318
      0.000463
      116.709610
      4.759689
      287
    
    
      440
      het
      196
      0.000285
      115.983835
      4.753451
      178
    
    
      334
      een
      253
      0.000368
      106.596245
      4.669048
      250

Now plot the word frequencies using altered word_freq_df:

At least 100 background occurrences:



In [46]:

    
num_words_to_plot = 32
background_cutoff = 100
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);

We can further explore by setting an even higher bound on background words:

At least 500 background occurrences:



In [47]:

    
num_words_to_plot = 32
background_cutoff = 500
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);

At least 2000 background occurrences:



In [48]:

    
num_words_to_plot = 32
background_cutoff = 2000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);

At least 10000 background occurrences:



In [49]:

    
num_words_to_plot = 32
background_cutoff = 10000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);

From this casual investigation a background cutoff of around 2000 occurrences appears to pick out the type of semantically interesting words we're looking for. The background cutoff will depend on the application, quality of background tweets, etc.

One last issue: language detection

As we saw above, sometimes we would get a number of tweets that were not in English, which might not be of interest. The python library langdetect can classify languages of tweets accurately, but it can only do ~2500 tweets per minutes, which works out to several hours for this charisma dataset. It may be worth it if it is found other languages are cluttering out the English words too much - here is example how how it can be applied.



In [50]:

    
smaller_charisma_df = twit.tweets_df[0:100].copy()



In [51]:

    
from langdetect import detect
import time



In [52]:

    
# this can take a while
start_time = time.time()
smaller_charisma_df["lang"] = smaller_charisma_df["text"].map(detect)
print "Took", round((time.time() - start_time)/60., 2), "minutes to compute"
sec_per_tweet = (time.time() - start_time)/float(len(smaller_charisma_df))
print "Took on average", round(sec_per_tweet,2), "seconds per tweet"
print "Can classify", round(60./sec_per_tweet,2), "tweets by language per minute"









    



Took 0.06 minutes to compute
Took on average 0.04 seconds per tweet
Can classify 1657.05 tweets by language per minute

This gives us a new column in our dataframe called "lang" that has abbreviation for predicted languages of each tweet.

If we want to keep only the tweets classifying as English ("en"), we can do this:



In [53]:

    
smaller_charisma_df = smaller_charisma_df[smaller_charisma_df['lang'] == 'en']



In [54]:

    
smaller_charisma_df.head(4)









    Out[54]:






  
    
      
      username
      date
      retweets
      favorites
      text
      mentions
      hashtags
      id
      permalink
      original_tweets
      lang
    
  
  
    
      0
      sixsooo
      2016-02-13
      0
      0
      i put charisma in my lingo n she fell for me i gave her realness n thats all she gone get from me
      NaN
      NaN
      698772986824957952
      https://twitter.com/SixSooo/status/698772986824957952
      I Put Charisma In My Lingo N She Fell For Me. I Gave Her Realness N That's All She Gone Get From Me
      en
    
    
      1
      longway252
      2016-02-13
      0
      0
      lavine charisma won the dunk contest
      NaN
      NaN
      698716318275674112
      https://twitter.com/Longway252/status/698716318275674112
      LaVine charisma won the dunk contest
      en
    
    
      2
      krock2step
      2016-02-13
      1
      1
      the westminster dog show has more personality and charisma than these presidential debates
      NaN
      NaN
      698716287061647362
      https://twitter.com/krock2step/status/698716287061647362
      The westminster dog show has more personality and charisma than these presidential debates
      en
    
    
      3
      totallytuesday
      2016-02-13
      0
      0
      gopdebate tytlive feelthebern bens cray cray has no charisma
      NaN
      #gopdebate #tytlive #feelthebern
      698715988192206850
      https://twitter.com/totallytuesday/status/698715988192206850
      #GOPDebate #tytlive #feeltheBern Ben's cray cray has no charisma
      en



In [55]:

    
len(smaller_charisma_df)









    Out[55]:





96



In [ ]:

	username	date	retweets	favorites	text	mentions	hashtags	id	permalink
0	Zer0Cool6711	2016/05/03	1	1	Yo @THEVinceRusso u gotta chill on @FightOwensFight he's over rite now ppl don't care how he looks his charisma that good	@THEVinceRusso @FightOwensFight	NaN	727647453315534848	https://twitter.com/Zer0Cool6711/status/727647453315534848
1	CultureSyndrome	2016/05/03	0	2	@shellechii that hitler'chu charisma. pic.twitter.com/lc5Zoi5Rly	@shellechii	NaN	727647245621956608	https://twitter.com/CultureSyndrome/status/727647245621956608
2	Charisma_Elf	2016/05/03	0	0	@VOCAL01D おはありでち＾＾ノン	@VOCAL01D	NaN	727646469746917376	https://twitter.com/Charisma_Elf/status/727646469746917376
3	Charisma___M	2016/05/03	0	0	@ToriKelly	@ToriKelly	NaN	727645669213777927	https://twitter.com/Charisma___M/status/727645669213777927
4	charisma_news	2016/05/03	0	0	Early Christians got the #Jesus story wrong, author says: http://bit.ly/1NiGeM6	NaN	#Jesus	727645183441969154	https://twitter.com/charisma_news/status/727645183441969154

	word	occurrences	frequency	relative frequency	log relative frequency	background occurrences
368	lakas	231	0.000336	1280.618505	7.155098	19
64	uniqueness	748	0.001089	1212.131215	7.100135	65
504	alis	178	0.000259	937.457096	6.843171	20
1126	spritzer	80	0.000117	842.658064	6.736561	10
294	oozes	275	0.000401	827.610598	6.718543	35
700	nicht	128	0.000186	674.126451	6.513418	20
906	exudes	100	0.000146	619.601517	6.429077	17
937	niet	96	0.000140	561.772042	6.331096	18
1538	likability	58	0.000084	509.105913	6.232656	12
1653	mehr	53	0.000077	507.509970	6.229516	11
2197	noch	38	0.000055	500.328225	6.215264	8
2138	viel	40	0.000058	421.329032	6.043414	10
1424	gravitas	63	0.000092	414.745766	6.027666	16
2386	likeability	35	0.000051	409.625448	6.015243	9
1404	dexterity	65	0.000095	402.740986	5.998294	17
2408	farages	34	0.000050	397.921863	5.986256	9
558	iba	162	0.000236	387.814222	5.960526	44
1665	izmir	53	0.000077	372.173978	5.919361	15
166	charismatic	420	0.000612	365.616102	5.901584	121
507	oozing	178	0.000259	328.932314	5.795852	57

	username	text
271	nicolastian	obvio que me refiero a charisma uniqueness nerve talent
356	chiscoyo	charisma uniqueness nerve ant talent
361	themaxgregory	happy birthday laurennalice hope ya have an amazing 18th slay with ya charisma uniqueness nerve and talent pictwittercomehqbuqpox4
546	danimalcrackers	the time has come to put someone with charisma uniqueness nerve and talent on the scotus america needs rupaulpictwittercomvkjduz52n6
556	real_stephanie	sabia que a ana ficaria she got the charisma uniqueness nerve and talent thevoicekidsbr
907	kaiocosta	eduardocilto pede um pouco de charisma uniqueness nerve and talent tambm
931	andrewboff	stopcityairport true all mayors should have charisma uniqueness nerve and talent
1195	emilyrioosss	charismauniqueness nerve and talent
1470	_ivb_	how have i only just realised that charisma uniqueness nerve and talent is an acronym of cunt
1583	gretadaniel_	missfamenyc like im a straight biological female who doesnt do drag but i still have charisma uniqueness nerve and talent
1624	thezaprecap	erikajayne dont they know it stands for charismauniquenessnervetalent rhobh rupaul bravotv
1756	jessemanke	charisma uniqueness nerve and talenti finally get it lol rupaulsdragrace

	username	text
172	aniger_eyaf	dude believe me iba charisma ni regina faye educalane diba diba alyzzarobles97 ellsmayo lalalacielo cathlenecanlas mawyeeel
388	rhinazipagan	o m g iba tlg ang charisma ni mengs onelove adn votemainefpp kca
849	gianinadlgomez	marlou yan eh iba talaga charisma ggvfebibigwins
859	allanpcmln	annvalenciaa thank you baby iba kasi talaga charisma mo hmp i love you
1093	francesdyan	haha iba charisma ng aso may baby clarky na ang jadine then may baby monkey na ang lizquen dolceamore on ratedk
1211	thirdy333333	justjunius iba talaga ang charisma leche ka
1878	_jungjaehyuns	kallaseu hahaha im a ten biased pero si taeyong iba ang charisma
1916	pqmaricar26	tintwirl hahaha iba talaha charisma ni quen hindi pahihindian dolceamoresweetbeginning
1943	auregondola_	votemainefpp kca iba ang charisma mo maine
1956	mylzperhour	mrkimpson super iba yung charisma ni jojeol iu
2028	bebethbajada	votemainefpp kca even the mid 50s me kilig pdn pala hahaha iba talaga ang charisma ng aldub kasama ang aldubnation wow
2614	pedrosokay	asucido zari97262439 oo bhe iba ang charisma nila sa mga fans full of energy samarsummerwith tomiho

	username	text
12	shirodesudesu	could i just have a whole series based off that witch couple in supernatural with james marsters and charisma carpenter alliveeverwanted
35	corralantonello	buffy the vampire slayer angel slave cordelia 6 figure new charisma carpenter ff32toolid10039campid5337597384item331777331777vectorid229466lgeo1pictwittercommfjrl8xsbo
104	dogbr0ther	larsenbryce ryanhigginsryan you have slowly won me over you are the king of charisma unfortunately not charisma carpenter
532	emwendorf	what joss whedon did to charisma carpentercordy 20022003
720	nissemus	socialsoprano is that charisma carpenter
822	proudlovatjc	certo che charisma carpenter proprio un vampiro oh
882	alineteston	honestly im kinda happy that charisma carpenters character will show some interest in jay at this point ill take anything chicagopd
933	elektrashearts	charisma carpenter is in this episode too wow pictwittercomwxbtrtzovd
1006	ndeddiemac	didnt statham have charisma carpenter in ex2 live your life buddy eddiedrunjflix
1112	goodfellazmag	chicago pd charisma carpenter offers jay a kushy job watch video goodfellazmag mpcm
1118	feeds4u	chicago pd charisma carpenter offers jay a kushy job watch video tv news
1122	bushsofferfr	spoiler rappel le personnage de charisma carpenter sera rcurrent sa relation avec jay pourrait sorienter vers qql chose de romantique

	word	occurrences	frequency	relative frequency	log relative frequency	background occurrences
105	uniqueness	920	0.000802	892.236129	6.793731	65
322	oozes	418	0.000364	752.858871	6.623878	35
37	aura	1677	0.001462	419.505690	6.039077	252
149	charismatic	722	0.000629	376.146618	5.929979	121
76	lacks	1112	0.000969	210.506683	5.349517	333
320	lingo	419	0.000365	140.495207	4.945173	188
266	und	473	0.000412	103.892591	4.643358	287
38	gal	1627	0.001418	98.713679	4.592224	1039
371	caravan	379	0.000330	95.566248	4.559820	250
14	charm	2810	0.002449	89.917749	4.498895	1970
323	talaga	418	0.000364	76.599013	4.338584	344
277	ikon	455	0.000397	67.967967	4.219037	422

	username	text
30	papanemes	ill take one order of adequate interpersonal skills please and a side of charisma
94	vrsvno	i am all set to witness the charisma and oratory skills of our pm at dav davwelcomesmodi
99	drhousepls	im shipping bayley and finn balor cause that would be top tier genetics for their kids skill charisma abs ass
130	tresorgahungu	charisma can take you to the door of your destiny however it is integrity that will keep you there leadershipskills
253	iiveinfear	still hoping naomi becomes divas champion sometime in the future she has the charisma wrestling skills and she can back up the talk
291	upper_cayce	atinyangryman so people do put skills into anything other than charisma
428	salisburryatche	developing charisma skills access is register ipzlgsx
448	ikramlely	nak sthun dahskill nak usha awek ni aku ade lagi takcharisma aku ni dah drop kehuhutrok siall pangai
630	ben_mathes	naval debugging subroutines still useful major charisma to refactor larger program is a hell of a skill to aspire too though
872	simonpennysmith	never heard such a smug jargon talking no mark suit all the charisma and skill of a tin of boot polish
889	ethanjarnagin	acekillcard fallout msjillyjelly lol i guess his charisma skill wasnt high enough
896	rwdhbib	saad hariri 20 such a drastic improvement in speech and charisma and skills all that time away were well invested

	username	text
76805	pcyhrr	charismahanbin rt triplekimcrown support triple kim pictwittercommi6qa6g82y
76810	pcyhrr	charismahanbin rt hanbinth 18 ikoncertinbang
77242	pcyhrr	charismahanbin rt bapcha father i press conference will be aired live on daum tvpot kakao tv on this fri pictwittercomyjgjldt11d
77395	pcyhrr	charismahanbin rt alnikon yunhyeong and chanwoo at hamdeok beach 160525 nht pictwittercom4le6dg4osb
77396	pcyhrr	charismahanbin errrrr95 55555555
77397	pcyhrr	charismahanbin ikon 20152016 ikoncert showtime in seoul live dvd preorder 31st may 2016 release 21st june 2016 ikon ikonc
77398	pcyhrr	charismahanbin rt ygentofficial ikon 20152016 ikoncert showtime in seoul live dvd 2016 05 31 preorde
77400	pcyhrr	charismahanbin ikon 20152016 ikoncert showtime in seoul live dvd ikon
77480	pcyhrr	charismahanbin rt nickkiie 160521 bi hanbin ikoncertinnanjing hanbinth charismah pictwittercomsl0ulnhynu
77937	pcyhrr	charismahanbin bi hanbin ikon crokaybipictwittercomyqplawvcx7
77938	pcyhrr	charismahanbin rt bapcha why is krunk always there cr dc ikon gallery pictwittercom2fba5gsper
77939	pcyhrr	charismahanbin rt ygikonic ikonschedule 5 27 11 tvn bobby

	username	text
1216	av_momo	kirakira debut fcupgal charisma gal pictwittercomlx5en7dqvd
1782	av_momo	charisma gal get you 15 gal pictwittercomkpdhspfqgv
1793	av_momo	charisma gal get you 15 gal pictwittercomahc0imphvr
1809	av_momo	charisma gal get you 15 gal pictwittercomqp6xndi2ts
2246	av_momo	charisma gal get you 15 gal pictwittercomti0ia9pzxu
2272	av_momo	charisma gal get you 15 gal pictwittercomfvdxecpmje
2324	av_momo	charisma gal get you 15 gal pictwittercomsion1iy2hg
2559	av_momo	charisma gal get you 15 pictwittercom29pi61letk
2642	av_momo	kirakira debut fcupgal charisma gal pictwittercomdjictqno7q
2936	av_momo	kirakira debut fcupgal charisma gal pictwittercomg5kb4bx9d3
3362	av_momo	charisma gal get you 15 gal pictwittercomze9wujgduh
3411	av_momo	charisma gal get you 15 gal pictwittercomjzwxa84fyy

	username	date	retweets	favorites	text	mentions	hashtags	id	permalink	original_tweets	lang
0	sixsooo	2016-02-13	0	0	i put charisma in my lingo n she fell for me i gave her realness n thats all she gone get from me	NaN	NaN	698772986824957952	https://twitter.com/SixSooo/status/698772986824957952	I Put Charisma In My Lingo N She Fell For Me. I Gave Her Realness N That's All She Gone Get From Me	en
1	longway252	2016-02-13	0	0	lavine charisma won the dunk contest	NaN	NaN	698716318275674112	https://twitter.com/Longway252/status/698716318275674112	LaVine charisma won the dunk contest	en
2	krock2step	2016-02-13	1	1	the westminster dog show has more personality and charisma than these presidential debates	NaN	NaN	698716287061647362	https://twitter.com/krock2step/status/698716287061647362	The westminster dog show has more personality and charisma than these presidential debates	en
3	totallytuesday	2016-02-13	0	0	gopdebate tytlive feelthebern bens cray cray has no charisma	NaN	#gopdebate #tytlive #feelthebern	698715988192206850	https://twitter.com/totallytuesday/status/698715988192206850	#GOPDebate #tytlive #feeltheBern Ben's cray cray has no charisma	en