This notebook will show a typical work flow when using Twords to analyze word frequencies of twitter data.
The basic work flow is you search twitter for a particular search term, say "charisma", either with the java search function or with the Twitter API. (Note: for more obscure terms like "charisma", an API search may take upwards of a day or two to get a reasonable sized data set. The java search code would be much faster in these cases.) All the returned tweets are then put into a bag-of-words frequency analyzer to find the frequency of each word in collection of tweets. These frequencies are compared with the background frequencies of English words on twitter (there are several options for computing this background rate, see below) and the results displayed in order of which words are most disproportionately more frequently used with that search.
E.g., if you search for "charisma", you will find that words like "great", "lacks", and "handsome" were roughly 20 times more likely to occur in tweets with the word "charisma" in it than in a random sample of all tweets. Twords creates a list of these disproportionately-used words, in order of how many times more likely they are to appear in tweets containing search term that otherwise.
Once the frequency chart is found with the list of disproportionately used words, the user can search the tweet corpus for tweets containing one of these words to see what types of tweets had them. In the charisma example, it was found that "carpenter" was one of the most common words in these tweets, but a search in the tweet corpus revealed that a popular actress was named "charisma carpenter", so all tweets about her were included. Since this was not the sense that was intended, Twords lets the user delete tweets from the corpus by search term and re-do the frequency analysis. In this case, tweets containing the word "carpenter" were removed. (This would also remove a tweet that said something like "the carpenter I hired had great charisma", which we would probably want to include in the corpus, but this is a small price to pay to remove the many unwanted "charisma carpenter" tweets.) Another common use of this delete function is to remove spam - spam often consists of idential tweets coming simultaneously from several different accounts, which can produce noticeable (and probably undesired) frequency associations in the final bag-of-words analysis.
data_path: Path to twitter data collected with search_terms. Currently code is meant to read in data collected wtih Henrique twitter code.
background_path: path to background twitter data used to compare word frequencies
search_terms: list of terms used in twitter search to collect tweets located at data_path
tweets_df: pandas dataframe that contains all tweets located at data_path, possibly with some tweets dropped (see dropping methods below). If user needs to re-add tweets that were previously dropped they must reload all tweet data over again from the csv.
word_bag: python list of words in all tweets contained in tweets_df
freq_dist: nltk.FreqDist() object computed on word_bag
word_freq_df: pandas dataframe of frequencies of top words computed from freq_dist
The two most important objects are tweets_df and word_freq_df, the dataframes that hold the tweets themselves and the relative frequencies of words in those tweets.
If the search is done by the Twitter API, calculation of background word frequencies is straightforward with the sampling feature of the Twitter API.
If the tweets are gathered using the java search code (which queries twitter automatically through Twitter's web interface), background frequencies can be trickier, since the java library can specify search dates. Search dates set a year or more in the past in principle require background rates for those past dates, which cannot be obtained with the Twitter API. To approximate these rates, a search was done using ~50 of the most common English words simultaneously.
Another issue is that the java code can return either "top tweets" only or "all tweets" for a given search term, depending on which jar file is used.
A solution is to use all three background possibilities (current API background, Top tweets background w/ top 50 English search terms, and All tweets background with top 50 English search terms) and compare their word rates each other. In early versions of these tests the difference in word rates in background between Twitter API and Top tweets was typically less than factor of 2, which would make little difference in word rates comparison, as words of interest typically appeared ~10 or more times more frequently than base rate.
Showing how program can work with some previously collected tweets. These were found by searching for "charisma" and "charismatic" using java code, i.e. by calling create_java_tweets in Twords. The create_java_tweets function can return ~2000 tweets per minute, which is much faster than the Twitter API when searching for uncommon words like "charisma."
In [1]:
# to reload files that are changed automatically
%load_ext autoreload
%autoreload 2
In [2]:
from twords.twords import Twords
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import pandas as pd
# this pandas line makes the dataframe display all text in a line; useful for seeing entire tweets
pd.set_option('display.max_colwidth', -1)
First set path to desired twitter data location, collect tweets, lower case of all letters, and drop tweets where the strings "charisma" or "charismatic" are contained in either a username or a mention.
In [3]:
twit = Twords()
twit.data_path = "data/java_collector/charisma_300000"
twit.background_path = 'jar_files_and_background/freq_table_72319443_total_words_twitter_corpus.csv'
twit.create_Background_dict()
twit.set_Search_terms(["charisma"])
twit.create_Stop_words()
In [4]:
twit.get_java_tweets_from_csv_list()
In [5]:
# find how many tweets we have in original dataset
print "Total number of tweets:", len(twit.tweets_df)
In [6]:
twit.tweets_df.head(5)
Out[6]:
In [7]:
# create a column that stores the raw original tweets
twit.keep_column_of_original_tweets()
In [8]:
twit.lower_tweets()
In [9]:
# for this data set this drops about 200 tweets
twit.keep_only_unicode_tweet_text()
In [10]:
twit.remove_urls_from_tweets()
In [11]:
twit.remove_punctuation_from_tweets()
In [12]:
twit.drop_non_ascii_characters_from_tweets()
In [13]:
# these will also commonly be used for collected tweets - DROP DUPLICATES and DROP BY SEARCH IN NAME
twit.drop_duplicate_tweets()
In [14]:
twit.drop_by_search_in_name()
In [15]:
twit.convert_tweet_dates_to_standard()
In [16]:
twit.sort_tweets_by_date()
In [17]:
# apparently not all tweets contain the word "charisma" at this point, so do this
# cleaning has already dropped about half of the tweets we started with
len(twit.tweets_df)
Out[17]:
In [18]:
twit.keep_tweets_with_terms("charisma")
In [19]:
len(twit.tweets_df)
Out[19]:
In [20]:
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()
Ideally word_freq_df would be created with a large number like 10,000, since it is created in order of most common words in the corpus, and the more interesting relative frequencies may occur with less-common words. It takes about one minutes to compute per 1000 words though, so as a first look at data it can make sense to create with smaller values of n. Here we'll use n=400.
In [21]:
# this creates twit.word_freq_df, a dataframe that stores word frequency values
twit.create_word_freq_df(400)
In [56]:
twit.word_freq_df.sort_values("log relative frequency", ascending = False, inplace = True)
In [57]:
twit.word_freq_df.head(20)
Out[57]:
This gives the natural log of the frequency of word in tweets data divided by frequency of word in background data set. This is a logarithmic scale, so remember that a 1 on this graph means the given word is ~2.7 times more common, a 2 means it is ~7 times more common, and a 3 means it is ~20 times more common.
Instead of using plotting function provided by Twords (which plots all words from word_freq_df) we'll manipulate the word_freq_df dataframe directly - remember, don't get stuck in API bondage!
In [24]:
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
The typical Twords workflow is to collect tweets, load them into tweets_df, do preliminary cleaning, create word_freq_df to look at relative word frequencies, and then depending on any potential/undesired effects seen in data, further filter tweets_df and recreate word_freq_df until the data is cleaned sufficiently to be of use.
The next section shows how this process can work.
Repetitive non-desired tweets are the number one problem in interpreting results, as large numbers of tweets that use same phrase causes various non-relevant words to appear high in word_freq_df. Sometimes the problem is spam accounts doing advertising, sometimes it is a pop culture phenomenon that many different users are quoting: e.g. people enjoy quoting the line 'excuse my charisma' from the song "6 Foot 7 Foot" by Lil Wayne on Twitter, which means "excuse" appears much more frequently than background when searching for "charisma," even though this does not reveal something semantically interesting about the word "charisma." The problem is at its worst when the problem-word is rare in the background, which causes it to have a huge relative frequency.
When looking through the chart of extra-frequent words it is good to check which kinds of tweets those words are appearing in to be sure they are relevant to your search.
A huge dataset of several million tweets is best for averaging over time periods where different phrases become popular. When this isn't possible (or even when it is), the tools below can help to filter out spammy/repetitive tweets from dataset.
It is a good idea to create a large word_freq_df with the top 10,000 or so words once and then investigate further. One problem that can occur is when the relative frequency for a word (below, we'll see this problem for a word like "minho") is huge because the background value is tiny - we can filter out cases like this by filtering word_freq_df to only include words with a high cutoff background frequency.
The word "uniqueness" seems to make sense, but in looking at tweets that contain it we see uniqueness appears a lot due to the popular phrase "charisma, uniqueness, nerve and talent":
In [59]:
twit.tweets_containing("uniqueness").head(12)
Out[59]:
For the word "iba", we appear to be finding tweets that are not in English. These are good candidates to drop if we only want English words. (There is a python library called langdetect that can classify languages of tweets quite well - the problem is it can only classify ~2500 tweets per minute, which is rather slow for one machine. An example will be given below of how to use it if that is desired.)
In [26]:
twit.tweets_containing(" iba ").head(12)
Out[26]:
Another example we probably wish to drop is tweets containing "carpenter", because these overwhelmingly refer to the actress Charisma Carpenter:
In [27]:
twit.tweets_containing("carpenter").head(12)
Out[27]:
We find a similar thing when looking at tweets containing "minho" and "flaming": the Korean singer "Minho" has the nickname "Flaming Charisma", which is also not what we're interested in.
Thus we see we should drop tweets containing " iba ", "minho", "flaming", and "carpenter" from tweets_df. We do this and then recalculate word statistics:
In [28]:
twit.drop_by_term_in_tweet([" iba ", "minho", "flaming", "carpenter"])
Now recalculate word bag and word frequency object:
In [29]:
# recompute word bag for word statistics
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()
In [30]:
# create word frequency dataframe
twit.create_word_freq_df(400)
In [31]:
# plot new results - again, only showing top 30 here
twit.word_freq_df.sort_values("log relative frequency", ascending = True, inplace = True)
In [32]:
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
In [33]:
twit.word_freq_df.sort_values("log relative frequency", ascending=False, inplace=True)
twit.word_freq_df.head(12)
Out[33]:
It looks like we still have a lot of spam/repetive uninteresting content, but we can filter through and pick out interesting information. For example, we see the word "skill" is high in relative frequency:
In [34]:
twit.tweets_containing("skill").head(12)
Out[34]:
It looks like people are often talking about charisma as a communication skill itself, as well as a way to hide a lack of skills elsewhere.
Find which users are contributing disproportionately to results with twit.tweets_df['username'].value_counts(), which returns a list of the users with highest number of tweets in descending order. These results can be plotted using something like twit.tweets_df['username'].value_counts().iloc[:0].plot.barh()
In [35]:
twit.tweets_df['username'].value_counts().head(12)
Out[35]:
This shows that a few users are responsible for hundreds of tweets in our data set - this can be a problem, as we saw in twit.word_freq_df that many of our overly-frequent terms only had several hundred example in the data set. To see an example of the problem, we look at all the tweets by the most prolific charisma tweeter, user pcyhrr:
In [36]:
twit.tweets_by("pcyhrr").head(12)
Out[36]:
It looks like this user is repeatedly tweeting at or about charisma_hanbin: we definitely want to drop his tweets.
As another example we look at second most prolific charisma tweeter, av_momo:
In [37]:
twit.tweets_by("av_momo").head(12)
Out[37]:
This is again repetive and uninteresting data that we wish to drop.
To facilitate dropping users who tweet repeatedly we can use the function drop_by_username_with_n_tweets, which drops all tweets by users that have more than n tweets in our dataset. To be sure we filter out all offenders, here we will use n=1:
In [38]:
twit.drop_by_username_with_n_tweets(1)
In [39]:
# now recompute word bag and word frequency objeect and look at results
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()
In [40]:
twit.create_word_freq_df(400)
In [41]:
twit.word_freq_df.sort_values("log relative frequency", ascending = True, inplace = True)
In [42]:
num_words_to_plot = 32
twit.word_freq_df.set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
We see from this example that not removing rare background examples can cause problems: the word "lakas" appears only 19 times in background, and because of this it is now the number one word in frequency dataframe. We see a similar pattern with a number of other top words.
In [43]:
twit.word_freq_df.sort_values("log relative frequency", ascending=False).head(10)
Out[43]:
We see that the highest log relative frequency words often have background occurrences smaller than 100 - this may collect rarer words that we are not interested in. (These background occurences were calculated from a sample of ~72 million words from Twitter.)
To correct for this we can remove words from consideration that don't appear often enough in background.
We'll do this by first creating a word_freq_df with many top words (10,000 here) and then filtering out words that don't have a frequent enough background rate before plotting.
In [44]:
twit.create_word_freq_df(10000)
Filter out words that occur less than 100 times in background:
In [45]:
twit.word_freq_df[twit.word_freq_df['background occurrences']>100].sort_values("log relative frequency", ascending=False).head(10)
Out[45]:
In [46]:
num_words_to_plot = 32
background_cutoff = 100
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
In [47]:
num_words_to_plot = 32
background_cutoff = 500
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
In [48]:
num_words_to_plot = 32
background_cutoff = 2000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);