Twords

This notebook will show a typical work flow when using Twords to analyze word frequencies of twitter data.

The basic work flow is you search twitter for a particular search term, say "charisma", either with the java search function or with the Twitter API. (Note: for more obscure terms like "charisma", an API search may take upwards of a day or two to get a reasonable sized data set. The java search code would be much faster in these cases.) All the returned tweets are then put into a bag-of-words frequency analyzer to find the frequency of each word in collection of tweets. These frequencies are compared with the background frequencies of English words on twitter (there are several options for computing this background rate, see below) and the results displayed in order of which words are most disproportionately more frequently used with that search.

E.g., if you search for "charisma", you will find that words like "great", "lacks", and "handsome" were roughly 20 times more likely to occur in tweets with the word "charisma" in it than in a random sample of all tweets. Twords creates a list of these disproportionately-used words, in order of how many times more likely they are to appear in tweets containing search term that otherwise.

Once the frequency chart is found with the list of disproportionately used words, the user can search the tweet corpus for tweets containing one of these words to see what types of tweets had them. In the charisma example, it was found that "carpenter" was one of the most common words in these tweets, but a search in the tweet corpus revealed that a popular actress was named "charisma carpenter", so all tweets about her were included. Since this was not the sense that was intended, Twords lets the user delete tweets from the corpus by search term and re-do the frequency analysis. In this case, tweets containing the word "carpenter" were removed. (This would also remove a tweet that said something like "the carpenter I hired had great charisma", which we would probably want to include in the corpus, but this is a small price to pay to remove the many unwanted "charisma carpenter" tweets.) Another common use of this delete function is to remove spam - spam often consists of idential tweets coming simultaneously from several different accounts, which can produce noticeable (and probably undesired) frequency associations in the final bag-of-words analysis.

Main objects within Twords class:

data_path: Path to twitter data collected with search_terms. Currently code is meant to read in data collected wtih Henrique twitter code.

background_path: path to background twitter data used to compare word frequencies

search_terms: list of terms used in twitter search to collect tweets located at data_path

tweets_df: pandas dataframe that contains all tweets located at data_path, possibly with some tweets dropped (see dropping methods below). If user needs to re-add tweets that were previously dropped they must reload all tweet data over again from the csv.

word_bag: python list of words in all tweets contained in tweets_df

freq_dist: nltk.FreqDist() object computed on word_bag

word_freq_df: pandas dataframe of frequencies of top words computed from freq_dist

The two most important objects are tweets_df and word_freq_df, the dataframes that hold the tweets themselves and the relative frequencies of words in those tweets.

Note on background word frequencies

If the search is done by the Twitter API, calculation of background word frequencies is straightforward with the sampling feature of the Twitter API.

If the tweets are gathered using the java search code (which queries twitter automatically through Twitter's web interface), background frequencies can be trickier, since the java library can specify search dates. Search dates set a year or more in the past in principle require background rates for those past dates, which cannot be obtained with the Twitter API. To approximate these rates, a search was done using ~50 of the most common English words simultaneously.

Another issue is that the java code can return either "top tweets" only or "all tweets" for a given search term, depending on which jar file is used.

A solution is to use all three background possibilities (current API background, Top tweets background w/ top 50 English search terms, and All tweets background with top 50 English search terms) and compare their word rates each other. In early versions of these tests the difference in word rates in background between Twitter API and Top tweets was typically less than factor of 2, which would make little difference in word rates comparison, as words of interest typically appeared ~10 or more times more frequently than base rate.

Example using search on word "charisma"

Showing how program can work with some previously collected tweets. These were found by searching for "charisma" and "charismatic" using java code, i.e. by calling create_java_tweets in Twords. The create_java_tweets function can return ~2000 tweets per minute, which is much faster than the Twitter API when searching for uncommon words like "charisma."


In [1]:
# to reload files that are changed automatically
%load_ext autoreload
%autoreload 2

In [2]:
from twords.twords import Twords 
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import pandas as pd
# this pandas line makes the dataframe display all text in a line; useful for seeing entire tweets
pd.set_option('display.max_colwidth', -1)

First set path to desired twitter data location, collect tweets, lower case of all letters, and drop tweets where the strings "charisma" or "charismatic" are contained in either a username or a mention.

Example with pre-existing data collected with Twords


In [3]:
twit = Twords()
twit.data_path = "data/java_collector/charisma_300000"
twit.background_path = 'jar_files_and_background/freq_table_72319443_total_words_twitter_corpus.csv'
twit.create_Background_dict()
twit.set_Search_terms(["charisma"])
twit.create_Stop_words()

In [4]:
twit.get_java_tweets_from_csv_list()

In [5]:
# find how many tweets we have in original dataset
print "Total number of tweets:", len(twit.tweets_df)


Total number of tweets: 267917

In [6]:
twit.tweets_df.head(5)


Out[6]:
username date retweets favorites text mentions hashtags id permalink
0 Zer0Cool6711 2016/05/03 1 1 Yo @THEVinceRusso u gotta chill on @FightOwensFight he's over rite now ppl don't care how he looks his charisma that good @THEVinceRusso @FightOwensFight NaN 727647453315534848 https://twitter.com/Zer0Cool6711/status/727647453315534848
1 CultureSyndrome 2016/05/03 0 2 @shellechii that hitler'chu charisma. pic.twitter.com/lc5Zoi5Rly @shellechii NaN 727647245621956608 https://twitter.com/CultureSyndrome/status/727647245621956608
2 Charisma_Elf 2016/05/03 0 0 @VOCAL01D おはありでち^^ノン @VOCAL01D NaN 727646469746917376 https://twitter.com/Charisma_Elf/status/727646469746917376
3 Charisma___M 2016/05/03 0 0 @ToriKelly @ToriKelly NaN 727645669213777927 https://twitter.com/Charisma___M/status/727645669213777927
4 charisma_news 2016/05/03 0 0 Early Christians got the #Jesus story wrong, author says: http://bit.ly/1NiGeM6 NaN #Jesus 727645183441969154 https://twitter.com/charisma_news/status/727645183441969154

Do standard cleaning


In [7]:
# create a column that stores the raw original tweets
twit.keep_column_of_original_tweets()

In [8]:
twit.lower_tweets()

In [9]:
# for this data set this drops about 200 tweets
twit.keep_only_unicode_tweet_text()

In [10]:
twit.remove_urls_from_tweets()


Removing urls from tweets...
This may take a minute - cleaning rate is about 400,000 tweets per minute
Time to complete: 0.631 minutes
Tweets cleaned per minute: 424275.6

In [11]:
twit.remove_punctuation_from_tweets()

In [12]:
twit.drop_non_ascii_characters_from_tweets()

In [13]:
# these will also commonly be used for collected tweets - DROP DUPLICATES and DROP BY SEARCH IN NAME
twit.drop_duplicate_tweets()

In [14]:
twit.drop_by_search_in_name()

In [15]:
twit.convert_tweet_dates_to_standard()

In [16]:
twit.sort_tweets_by_date()

In [17]:
# apparently not all tweets contain the word "charisma" at this point, so do this
# cleaning has already dropped about half of the tweets we started with 
len(twit.tweets_df)


Out[17]:
131841

In [18]:
twit.keep_tweets_with_terms("charisma")

In [19]:
len(twit.tweets_df)


Out[19]:
124062

Create the word bag and then the nltk object from the tweets in tweets_df


In [20]:
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()


Time to make words_string:  0.001 minutes
Time to tokenize:  0.196 minutes
Time to compute word bag:  0.154 minutes

Create dataframe showing the most common words in the tweets and plot the results

Ideally word_freq_df would be created with a large number like 10,000, since it is created in order of most common words in the corpus, and the more interesting relative frequencies may occur with less-common words. It takes about one minutes to compute per 1000 words though, so as a first look at data it can make sense to create with smaller values of n. Here we'll use n=400.


In [21]:
# this creates twit.word_freq_df, a dataframe that stores word frequency values
twit.create_word_freq_df(400)


Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.2357 minutes

In [56]:
twit.word_freq_df.sort_values("log relative frequency", ascending = False, inplace = True)

In [57]:
twit.word_freq_df.head(20)


Out[57]:
word occurrences frequency relative frequency log relative frequency background occurrences
368 lakas 231 0.000336 1280.618505 7.155098 19
64 uniqueness 748 0.001089 1212.131215 7.100135 65
504 alis 178 0.000259 937.457096 6.843171 20
1126 spritzer 80 0.000117 842.658064 6.736561 10
294 oozes 275 0.000401 827.610598 6.718543 35
700 nicht 128 0.000186 674.126451 6.513418 20
906 exudes 100 0.000146 619.601517 6.429077 17
937 niet 96 0.000140 561.772042 6.331096 18
1538 likability 58 0.000084 509.105913 6.232656 12
1653 mehr 53 0.000077 507.509970 6.229516 11
2197 noch 38 0.000055 500.328225 6.215264 8
2138 viel 40 0.000058 421.329032 6.043414 10
1424 gravitas 63 0.000092 414.745766 6.027666 16
2386 likeability 35 0.000051 409.625448 6.015243 9
1404 dexterity 65 0.000095 402.740986 5.998294 17
2408 farages 34 0.000050 397.921863 5.986256 9
558 iba 162 0.000236 387.814222 5.960526 44
1665 izmir 53 0.000077 372.173978 5.919361 15
166 charismatic 420 0.000612 365.616102 5.901584 121
507 oozing 178 0.000259 328.932314 5.795852 57

Plotting log of relative frequency

This gives the natural log of the frequency of word in tweets data divided by frequency of word in background data set. This is a logarithmic scale, so remember that a 1 on this graph means the given word is ~2.7 times more common, a 2 means it is ~7 times more common, and a 3 means it is ~20 times more common.

Instead of using plotting function provided by Twords (which plots all words from word_freq_df) we'll manipulate the word_freq_df dataframe directly - remember, don't get stuck in API bondage!


In [24]:
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


Further data cleaning

The typical Twords workflow is to collect tweets, load them into tweets_df, do preliminary cleaning, create word_freq_df to look at relative word frequencies, and then depending on any potential/undesired effects seen in data, further filter tweets_df and recreate word_freq_df until the data is cleaned sufficiently to be of use.

The next section shows how this process can work.

Fighting spam and unusually high relative frequencies

Repetitive non-desired tweets are the number one problem in interpreting results, as large numbers of tweets that use same phrase causes various non-relevant words to appear high in word_freq_df. Sometimes the problem is spam accounts doing advertising, sometimes it is a pop culture phenomenon that many different users are quoting: e.g. people enjoy quoting the line 'excuse my charisma' from the song "6 Foot 7 Foot" by Lil Wayne on Twitter, which means "excuse" appears much more frequently than background when searching for "charisma," even though this does not reveal something semantically interesting about the word "charisma." The problem is at its worst when the problem-word is rare in the background, which causes it to have a huge relative frequency.

When looking through the chart of extra-frequent words it is good to check which kinds of tweets those words are appearing in to be sure they are relevant to your search.

Best defense against spammy/repetitive tweets: large dataset that spans several years

A huge dataset of several million tweets is best for averaging over time periods where different phrases become popular. When this isn't possible (or even when it is), the tools below can help to filter out spammy/repetitive tweets from dataset.

Second best defense: choose higher threshold for background occurrences

It is a good idea to create a large word_freq_df with the top 10,000 or so words once and then investigate further. One problem that can occur is when the relative frequency for a word (below, we'll see this problem for a word like "minho") is huge because the background value is tiny - we can filter out cases like this by filtering word_freq_df to only include words with a high cutoff background frequency.

Investigating data

Note: this investigation was done using word_freq_df with 400 words - only top 30 are shown in graphics here so that notebook can be viewed reasonably on Github

The word "uniqueness" seems to make sense, but in looking at tweets that contain it we see uniqueness appears a lot due to the popular phrase "charisma, uniqueness, nerve and talent":


In [59]:
twit.tweets_containing("uniqueness").head(12)


771 tweets contain this term
Out[59]:
username text
271 nicolastian obvio que me refiero a charisma uniqueness nerve talent
356 chiscoyo charisma uniqueness nerve ant talent
361 themaxgregory happy birthday laurennalice hope ya have an amazing 18th slay with ya charisma uniqueness nerve and talent pictwittercomehqbuqpox4
546 danimalcrackers the time has come to put someone with charisma uniqueness nerve and talent on the scotus america needs rupaulpictwittercomvkjduz52n6
556 real_stephanie sabia que a ana ficaria she got the charisma uniqueness nerve and talent thevoicekidsbr
907 kaiocosta eduardocilto pede um pouco de charisma uniqueness nerve and talent tambm
931 andrewboff stopcityairport true all mayors should have charisma uniqueness nerve and talent
1195 emilyrioosss charismauniqueness nerve and talent
1470 _ivb_ how have i only just realised that charisma uniqueness nerve and talent is an acronym of cunt
1583 gretadaniel_ missfamenyc like im a straight biological female who doesnt do drag but i still have charisma uniqueness nerve and talent
1624 thezaprecap erikajayne dont they know it stands for charismauniquenessnervetalent rhobh rupaul bravotv
1756 jessemanke charisma uniqueness nerve and talenti finally get it lol rupaulsdragrace

For the word "iba", we appear to be finding tweets that are not in English. These are good candidates to drop if we only want English words. (There is a python library called langdetect that can classify languages of tweets quite well - the problem is it can only classify ~2500 tweets per minute, which is rather slow for one machine. An example will be given below of how to use it if that is desired.)


In [26]:
twit.tweets_containing(" iba ").head(12)


452 tweets contain this term
Out[26]:
username text
172 aniger_eyaf dude believe me iba charisma ni regina faye educalane diba diba alyzzarobles97 ellsmayo lalalacielo cathlenecanlas mawyeeel
388 rhinazipagan o m g iba tlg ang charisma ni mengs onelove adn votemainefpp kca
849 gianinadlgomez marlou yan eh iba talaga charisma ggvfebibigwins
859 allanpcmln annvalenciaa thank you baby iba kasi talaga charisma mo hmp i love you
1093 francesdyan haha iba charisma ng aso may baby clarky na ang jadine then may baby monkey na ang lizquen dolceamore on ratedk
1211 thirdy333333 justjunius iba talaga ang charisma leche ka
1878 _jungjaehyuns kallaseu hahaha im a ten biased pero si taeyong iba ang charisma
1916 pqmaricar26 tintwirl hahaha iba talaha charisma ni quen hindi pahihindian dolceamoresweetbeginning
1943 auregondola_ votemainefpp kca iba ang charisma mo maine
1956 mylzperhour mrkimpson super iba yung charisma ni jojeol iu
2028 bebethbajada votemainefpp kca even the mid 50s me kilig pdn pala hahaha iba talaga ang charisma ng aldub kasama ang aldubnation wow
2614 pedrosokay asucido zari97262439 oo bhe iba ang charisma nila sa mga fans full of energy samarsummerwith tomiho

Another example we probably wish to drop is tweets containing "carpenter", because these overwhelmingly refer to the actress Charisma Carpenter:


In [27]:
twit.tweets_containing("carpenter").head(12)


1185 tweets contain this term
Out[27]:
username text
12 shirodesudesu could i just have a whole series based off that witch couple in supernatural with james marsters and charisma carpenter alliveeverwanted
35 corralantonello buffy the vampire slayer angel slave cordelia 6 figure new charisma carpenter ff32toolid10039campid5337597384item331777331777vectorid229466lgeo1pictwittercommfjrl8xsbo
104 dogbr0ther larsenbryce ryanhigginsryan you have slowly won me over you are the king of charisma unfortunately not charisma carpenter
532 emwendorf what joss whedon did to charisma carpentercordy 20022003
720 nissemus socialsoprano is that charisma carpenter
822 proudlovatjc certo che charisma carpenter proprio un vampiro oh
882 alineteston honestly im kinda happy that charisma carpenters character will show some interest in jay at this point ill take anything chicagopd
933 elektrashearts charisma carpenter is in this episode too wow pictwittercomwxbtrtzovd
1006 ndeddiemac didnt statham have charisma carpenter in ex2 live your life buddy eddiedrunjflix
1112 goodfellazmag chicago pd charisma carpenter offers jay a kushy job watch video goodfellazmag mpcm
1118 feeds4u chicago pd charisma carpenter offers jay a kushy job watch video tv news
1122 bushsofferfr spoiler rappel le personnage de charisma carpenter sera rcurrent sa relation avec jay pourrait sorienter vers qql chose de romantique

We find a similar thing when looking at tweets containing "minho" and "flaming": the Korean singer "Minho" has the nickname "Flaming Charisma", which is also not what we're interested in.

Thus we see we should drop tweets containing " iba ", "minho", "flaming", and "carpenter" from tweets_df. We do this and then recalculate word statistics:


In [28]:
twit.drop_by_term_in_tweet([" iba ", "minho", "flaming", "carpenter"])

Now recalculate word bag and word frequency object:


In [29]:
# recompute word bag for word statistics
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()


Time to make words_string:  0.001 minutes
Time to tokenize:  0.19 minutes
Time to compute word bag:  0.151 minutes

In [30]:
# create word frequency dataframe
twit.create_word_freq_df(400)


Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.2323 minutes

In [31]:
# plot new results - again, only showing top 30 here
twit.word_freq_df.sort_values("log relative frequency", ascending = True, inplace = True)

In [32]:
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);



In [33]:
twit.word_freq_df.sort_values("log relative frequency", ascending=False, inplace=True)
twit.word_freq_df.head(12)


Out[33]:
word occurrences frequency relative frequency log relative frequency background occurrences
105 uniqueness 920 0.000802 892.236129 6.793731 65
322 oozes 418 0.000364 752.858871 6.623878 35
37 aura 1677 0.001462 419.505690 6.039077 252
149 charismatic 722 0.000629 376.146618 5.929979 121
76 lacks 1112 0.000969 210.506683 5.349517 333
320 lingo 419 0.000365 140.495207 4.945173 188
266 und 473 0.000412 103.892591 4.643358 287
38 gal 1627 0.001418 98.713679 4.592224 1039
371 caravan 379 0.000330 95.566248 4.559820 250
14 charm 2810 0.002449 89.917749 4.498895 1970
323 talaga 418 0.000364 76.599013 4.338584 344
277 ikon 455 0.000397 67.967967 4.219037 422

It looks like we still have a lot of spam/repetive uninteresting content, but we can filter through and pick out interesting information. For example, we see the word "skill" is high in relative frequency:


In [34]:
twit.tweets_containing("skill").head(12)


1848 tweets contain this term
Out[34]:
username text
30 papanemes ill take one order of adequate interpersonal skills please and a side of charisma
94 vrsvno i am all set to witness the charisma and oratory skills of our pm at dav davwelcomesmodi
99 drhousepls im shipping bayley and finn balor cause that would be top tier genetics for their kids skill charisma abs ass
130 tresorgahungu charisma can take you to the door of your destiny however it is integrity that will keep you there leadershipskills
253 iiveinfear still hoping naomi becomes divas champion sometime in the future she has the charisma wrestling skills and she can back up the talk
291 upper_cayce atinyangryman so people do put skills into anything other than charisma
428 salisburryatche developing charisma skills access is register ipzlgsx
448 ikramlely nak sthun dahskill nak usha awek ni aku ade lagi takcharisma aku ni dah drop kehuhutrok siall pangai
630 ben_mathes naval debugging subroutines still useful major charisma to refactor larger program is a hell of a skill to aspire too though
872 simonpennysmith never heard such a smug jargon talking no mark suit all the charisma and skill of a tin of boot polish
889 ethanjarnagin acekillcard fallout msjillyjelly lol i guess his charisma skill wasnt high enough
896 rwdhbib saad hariri 20 such a drastic improvement in speech and charisma and skills all that time away were well invested

It looks like people are often talking about charisma as a communication skill itself, as well as a way to hide a lack of skills elsewhere.

Another anti-spam technique: filter out over-represented users

Find which users are contributing disproportionately to results with twit.tweets_df['username'].value_counts(), which returns a list of the users with highest number of tweets in descending order. These results can be plotted using something like twit.tweets_df['username'].value_counts().iloc[:0].plot.barh()


In [35]:
twit.tweets_df['username'].value_counts().head(12)


Out[35]:
pcyhrr             615
av_momo            484
avranki            334
kindlebooks_4u     329
ukcaravancentre    252
dndskillcheck      235
martinlighthous    185
zczzzcz            140
kvhanbin           139
beirutcallgirl     121
abubake19446198    116
realhumanpraise    105
Name: username, dtype: int64

This shows that a few users are responsible for hundreds of tweets in our data set - this can be a problem, as we saw in twit.word_freq_df that many of our overly-frequent terms only had several hundred example in the data set. To see an example of the problem, we look at all the tweets by the most prolific charisma tweeter, user pcyhrr:


In [36]:
twit.tweets_by("pcyhrr").head(12)


Out[36]:
username text
76805 pcyhrr charismahanbin rt triplekimcrown support triple kim pictwittercommi6qa6g82y
76810 pcyhrr charismahanbin rt hanbinth 18 ikoncertinbang
77242 pcyhrr charismahanbin rt bapcha father i press conference will be aired live on daum tvpot kakao tv on this fri pictwittercomyjgjldt11d
77395 pcyhrr charismahanbin rt alnikon yunhyeong and chanwoo at hamdeok beach 160525 nht pictwittercom4le6dg4osb
77396 pcyhrr charismahanbin errrrr95 55555555
77397 pcyhrr charismahanbin ikon 20152016 ikoncert showtime in seoul live dvd preorder 31st may 2016 release 21st june 2016 ikon ikonc
77398 pcyhrr charismahanbin rt ygentofficial ikon 20152016 ikoncert showtime in seoul live dvd 2016 05 31 preorde
77400 pcyhrr charismahanbin ikon 20152016 ikoncert showtime in seoul live dvd ikon
77480 pcyhrr charismahanbin rt nickkiie 160521 bi hanbin ikoncertinnanjing hanbinth charismah pictwittercomsl0ulnhynu
77937 pcyhrr charismahanbin bi hanbin ikon crokaybipictwittercomyqplawvcx7
77938 pcyhrr charismahanbin rt bapcha why is krunk always there cr dc ikon gallery pictwittercom2fba5gsper
77939 pcyhrr charismahanbin rt ygikonic ikonschedule 5 27 11 tvn bobby

It looks like this user is repeatedly tweeting at or about charisma_hanbin: we definitely want to drop his tweets.

As another example we look at second most prolific charisma tweeter, av_momo:


In [37]:
twit.tweets_by("av_momo").head(12)


Out[37]:
username text
1216 av_momo kirakira debut fcupgal charisma gal pictwittercomlx5en7dqvd
1782 av_momo charisma gal get you 15 gal pictwittercomkpdhspfqgv
1793 av_momo charisma gal get you 15 gal pictwittercomahc0imphvr
1809 av_momo charisma gal get you 15 gal pictwittercomqp6xndi2ts
2246 av_momo charisma gal get you 15 gal pictwittercomti0ia9pzxu
2272 av_momo charisma gal get you 15 gal pictwittercomfvdxecpmje
2324 av_momo charisma gal get you 15 gal pictwittercomsion1iy2hg
2559 av_momo charisma gal get you 15 pictwittercom29pi61letk
2642 av_momo kirakira debut fcupgal charisma gal pictwittercomdjictqno7q
2936 av_momo kirakira debut fcupgal charisma gal pictwittercomg5kb4bx9d3
3362 av_momo charisma gal get you 15 gal pictwittercomze9wujgduh
3411 av_momo charisma gal get you 15 gal pictwittercomjzwxa84fyy

This is again repetive and uninteresting data that we wish to drop.

To facilitate dropping users who tweet repeatedly we can use the function drop_by_username_with_n_tweets, which drops all tweets by users that have more than n tweets in our dataset. To be sure we filter out all offenders, here we will use n=1:


In [38]:
twit.drop_by_username_with_n_tweets(1)


Dropping tweets by repeated users...
Found 13195 users with more than 1 tweets in tweets_df
Finished 0 percent of user drops
Finished 5 percent of user drops
Finished 10 percent of user drops
Finished 15 percent of user drops
Finished 20 percent of user drops
Finished 25 percent of user drops
Finished 30 percent of user drops
Finished 35 percent of user drops
Finished 40 percent of user drops
Finished 45 percent of user drops
Finished 50 percent of user drops
Finished 55 percent of user drops
Finished 60 percent of user drops
Finished 65 percent of user drops
Finished 70 percent of user drops
Finished 75 percent of user drops
Finished 80 percent of user drops
Finished 85 percent of user drops
Finished 90 percent of user drops
Finished 95 percent of user drops
Finished 100 percent of user drops
Took 9.391 minutes to complete

In [39]:
# now recompute word bag and word frequency objeect and look at results
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()


Time to make words_string:  0.0 minutes
Time to tokenize:  0.106 minutes
Time to compute word bag:  0.092 minutes

In [40]:
twit.create_word_freq_df(400)


Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.2069 minutes

In [41]:
twit.word_freq_df.sort_values("log relative frequency", ascending = True, inplace = True)

In [42]:
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


Another anti-spam technique: remove words that don't appear frequently enough in background

We see from this example that not removing rare background examples can cause problems: the word "lakas" appears only 19 times in background, and because of this it is now the number one word in frequency dataframe. We see a similar pattern with a number of other top words.


In [43]:
twit.word_freq_df.sort_values("log relative frequency", ascending=False).head(10)


Out[43]:
word occurrences frequency relative frequency log relative frequency background occurrences
368 lakas 231 0.000336 1280.618505 7.155098 19
64 uniqueness 748 0.001089 1212.131215 7.100135 65
294 oozes 275 0.000401 827.610598 6.718543 35
166 charismatic 420 0.000612 365.616102 5.901584 121
376 pence 229 0.000334 294.159598 5.684122 82
74 lacks 725 0.001056 229.326988 5.435149 333
193 lingo 392 0.000571 219.628963 5.391940 188
242 und 318 0.000463 116.709610 4.759689 287
334 een 253 0.000368 106.596245 4.669048 250
260 talaga 308 0.000449 94.309115 4.546578 344

We see that the highest log relative frequency words often have background occurrences smaller than 100 - this may collect rarer words that we are not interested in. (These background occurences were calculated from a sample of ~72 million words from Twitter.)

To correct for this we can remove words from consideration that don't appear often enough in background.

We'll do this by first creating a word_freq_df with many top words (10,000 here) and then filtering out words that don't have a frequent enough background rate before plotting.


In [44]:
twit.create_word_freq_df(10000)


Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  5.3828 minutes

Filter out words that occur less than 100 times in background:


In [45]:
twit.word_freq_df[twit.word_freq_df['background occurrences']>100].sort_values("log relative frequency", ascending=False).head(10)


Out[45]:
word occurrences frequency relative frequency log relative frequency background occurrences
166 charismatic 420 0.000612 365.616102 5.901584 121
74 lacks 725 0.001056 229.326988 5.435149 333
193 lingo 392 0.000571 219.628963 5.391940 188
406 overflowing 212 0.000309 189.241006 5.243021 118
455 realness 192 0.000280 145.494917 4.980141 139
663 topgear 138 0.000201 128.635855 4.856986 113
708 lacked 126 0.000184 118.498790 4.774903 112
242 und 318 0.000463 116.709610 4.759689 287
440 het 196 0.000285 115.983835 4.753451 178
334 een 253 0.000368 106.596245 4.669048 250

Now plot the word frequencies using altered word_freq_df:

At least 100 background occurrences:


In [46]:
num_words_to_plot = 32
background_cutoff = 100
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


We can further explore by setting an even higher bound on background words:

At least 500 background occurrences:


In [47]:
num_words_to_plot = 32
background_cutoff = 500
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


At least 2000 background occurrences:


In [48]:
num_words_to_plot = 32
background_cutoff = 2000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);