Exploring tweets containing word "brexit" in days just before and just after Brexit vote

June 21 - June 30, 2016

This data set is quite small at only ~10,000 tweets, so it is much harder to find semantically interesting results than with charisma dataset. Still may be useful info for people interested specifically in Brexit


In [1]:
import sys
sys.path.append('..')

from twords.twords import Twords 
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
# this pandas line makes the dataframe display all text in a line; useful for seeing entire tweets
pd.set_option('display.max_colwidth', -1)

In [2]:
twit = Twords()
twit.data_path = "../data/java_collector/brexit"
twit.background_path = '../jar_files_and_background/freq_table_72319443_total_words_twitter_corpus.csv'
twit.create_Background_dict()
twit.set_Search_terms(["brexit"])
twit.create_Stop_words()

In [3]:
twit.get_java_tweets_from_csv_list()

In [4]:
# find how many tweets we have in original dataset
print "Total number of tweets:", len(twit.tweets_df)


Total number of tweets: 9970

Standard cleaning


In [5]:
twit.keep_column_of_original_tweets()

In [6]:
twit.lower_tweets()

In [7]:
twit.keep_only_unicode_tweet_text()

In [8]:
twit.remove_urls_from_tweets()


Removing urls from tweets...
This may take a minute - cleaning rate is about 400,000 tweets per minute
Time to complete: 0.052 minutes
Tweets cleaned per minute: 192814.7

In [9]:
twit.remove_punctuation_from_tweets()

In [10]:
twit.drop_non_ascii_characters_from_tweets()

In [11]:
twit.drop_duplicate_tweets()

In [12]:
twit.drop_by_search_in_name()

In [13]:
twit.convert_tweet_dates_to_standard()

In [14]:
twit.sort_tweets_by_date()

In [15]:
len(twit.tweets_df)


Out[15]:
8837

In [16]:
twit.keep_tweets_with_terms("brexit")

In [17]:
len(twit.tweets_df)


Out[17]:
8100

Create word_freq_df


In [18]:
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()
twit.create_word_freq_df(1000)


Time to make words_string:  0.001 minutes
Time to tokenize:  0.019 minutes
Time to compute word bag:  0.017 minutes
Creating word_freq_df...
Takes about 1 minute per 1000 words
Time to create word_freq_df:  0.6446 minutes

In [19]:
twit.word_freq_df.sort_values("log relative frequency", ascending = False, inplace = True)
twit.word_freq_df.head(20)


Out[19]:
word occurrences frequency relative frequency log relative frequency background occurrences
98 unido 66 0.000846 6800.222878 8.824711 9
101 reino 65 0.000833 6697.189198 8.809443 9
106 tras 63 0.000808 3651.256034 8.202827 16
61 postbrexit 86 0.001103 3322.836179 8.108574 24
283 brexits 34 0.000436 2252.021862 7.719584 14
618 jatuh 17 0.000218 1970.519129 7.586052 8
296 europea 33 0.000423 1912.562684 7.556199 16
769 entender 14 0.000180 1622.780459 7.391896 8
370 inggris 27 0.000346 1564.824014 7.355529 16
140 uks 53 0.000680 1404.201867 7.247224 35
644 salida 17 0.000218 1313.679420 7.180587 12
441 europes 24 0.000308 1236.404160 7.119963 18
822 futuro 13 0.000167 1205.494056 7.094645 10
994 finanzas 11 0.000141 1133.370480 7.032951 9
163 britains 48 0.000615 1085.623164 6.989909 41
972 votos 11 0.000141 850.027860 6.745269 12
823 setelah 13 0.000167 709.114150 6.564017 17
544 emas 19 0.000244 704.750371 6.557844 25
150 qu 52 0.000667 660.544688 6.493065 73
250 nigelfarage 36 0.000462 629.866270 6.445508 53

Plot results with varying background cutoffs

At least 100 background occurrences:


In [20]:
num_words_to_plot = 32
background_cutoff = 100
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


At least 500 background occurrences:


In [21]:
num_words_to_plot = 32
background_cutoff = 500
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


At least 2000 background occurrences:


In [22]:
num_words_to_plot = 32
background_cutoff = 2000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);


At least 10,000 background occurrences:


In [23]:
num_words_to_plot = 32
background_cutoff = 2000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
                num_words_to_plot/2.), fontsize=30, color="c"); 
plt.title("log relative frequency", fontsize=30); 
ax = plt.axes();        
ax.xaxis.grid(linewidth=4);



In [ ]: