Tweet Count Analysis

This script explores the quantity of tweets we have over time: how many do we have in total? Are there significant periods containing few or even zero tweets? Are there duplicate tweets in the dataset?

Setup

Import the preprocessing functions (but do it in a funky way, because Jupyter notebook won't let me import from this notebook's parent directory), and get the tweets. Also, we have to tell the SparkContext that any computing nodes will need to load the preprocessing module before running.


In [1]:
import os
import sys

# From https://stackoverflow.com/a/36218558 .
def sparkImport(module_name, module_directory):
    """
    Convenience function. 
    
    Tells the SparkContext sc (must already exist) to load
    module module_name on every computational node before
    executing an RDD. 
    
    Args:
        module_name: the name of the module, without ".py". 
        module_directory: the path, absolute or relative, to
                          the directory containing module
                          module_Name. 
    
    Returns: none. 
    """
    module_path = os.path.abspath(
        module_directory + "/" + module_name + ".py")
    sc.addPyFile(module_path)

# Add all scripts from repository to local path. 
# From https://stackoverflow.com/a/35273613 .
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import preprocessing

sparkImport("preprocessing", "..")

tweets = sc.textFile("tweets.csv") \
    .filter(preprocessing.format_is_correct) \
    .map(preprocessing.split_record)

Count 'Em

How many tweets total? How many duplicate tweets?


In [2]:
initial_count = tweets.count()
print("Total number of tweets:    " + str(initial_count))


Total number of tweets:    30950907

In [37]:
tweet_ids = tweets \
    .map(lambda record: record[preprocessing.field_index['id']]) \
    .distinct()
final_count = tweet_ids.count()

print("Number of duplicates:      " + str(initial_count - final_count))
print("Number of distinct tweets: " + str(final_count))


Number of duplicates:      217728
Number of distinct tweets: 30733179

Less than $1\%$ of our tweets are duplicates, so we have approximately the quantity of tweets that we thought we did.

Plot Counts Over Time

Look at distribution over time: per day, per week, and per month. Look for any large gaps, or frequent small gaps, that might limit how we use the data.

Tweets per Week

First, let's count how many tweets are in each week, where a week is defined as Monday - Friday. Note that this is independent of month and year, which makes it easier to both count and plot. (In the code below, the .use('Agg') command tells matplotlib to use Agg to display graphs 'n such; by default it is something else, which NYU HPC does not have installed/loaded.)


In [1]:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as pyplot

def get_week(unix_timestamp):
    # Add 3 to the day, because Unix timestamp 0 is on a Thursday. 
    return ((int(unix_timestamp) / secondsPerDay + 3) / 7)

secondsPerDay = 24*60*60

weekly_tweet_counts = tweets \
    .map(
        lambda record:
            (get_week(record[preprocessing.field_index['timestamp']]), 1)) \
    .countByKey()

Now we have the tweet counts as a dictionary of (week_index, count) pairs. Before we go further, we should fill in the any missing weeks with a 0 value.


In [39]:
for week_index in range(min(weekly_tweet_counts.keys()), max(weekly_tweet_counts.keys())):
    if week_index not in weekly_tweet_counts.keys():
        weekly_tweet_counts[week_index] = 0

After we've filled in missing weeks with a 0 value, we sort the pairs, then repackage them as a tuple of week indexes and a tuple of counts. Then we can pass these week indexes and counts to a bar plot function as x- and y-values, respectively.


In [40]:
weekly_tweet_counts_list = sorted(weekly_tweet_counts.items())
weekly_tweet_counts_xy = zip(*weekly_tweet_counts_list)
week_indexes = weekly_tweet_counts_xy[0]
week_counts = weekly_tweet_counts_xy[1]

currentFigure = pyplot.figure()
pyplot.figure(currentFigure.number)
pyplot.bar(week_indexes, week_counts, width=1.0)
pyplot.title('Tweet Count per Week')
pyplot.xlabel('Week Index')
pyplot.ylabel('Tweet Count')
pyplot.xlim([min(week_indexes), max(week_indexes)])
pyplot.ylim([0, max(week_counts)])
pyplot.savefig("tweet_count_per_week.png")

Unfortunately, we can't automatically display the figure in a Jupyter notebook on NYU's HPC server. So, we saved it to a file, and now we can display it below:

What do the Missing Tweets Mean?

We're using LDA to generate topic models over the previous 31 days. This means that, if we have no tweets, we have no topic model. However, if we don't have tweets for a particular time period, we can't fix that. It's more important to note that, if we have too few tweets, we have bad topic models. Bad topic model means a bad prediction, so we shouldn't predict on dates such that there are too few tweets. So how few tweets is "too few"? It isn't quite clear yet. Eventually, we will have to choose some method for determining this (there are several existing techniques) and run with it.

For now, however, we note that the quality of the LDA model should not vary significantly between any two predictions. One way to enforce some minimum quality is to choose a simple cutoff value $c_{\text{min}}$ for weekly tweet counts. If a week's tweet count falls below $c_{\text{min}}$, we shouldn't try to train an LDA model on any 31 days overlapping that week. How do we choose the value of $c_{\text{min}}$, though?

We will choose the cutoff value based on a histogram of the tweet counts. We should expect a top-heavy distribution of tweet counts (lots of high-range values), with a large gap in the middle (almost no middle values), followed by several smaller frequency bumps of low tweet counts (a few low values). Our cutoff value can be in the upper end of the large gap in the middle.


In [41]:
sorted_week_counts = sorted(week_counts)

currentFigure = pyplot.figure()
pyplot.figure(currentFigure.number)
pyplot.hist(sorted_week_counts, 40)
pyplot.title("Distribution of Weekly Tweet Counts")
pyplot.xlabel("Weekly Tweet Count")
pyplot.ylabel("Frequency")
pyplot.savefig("distribution_of_weekly_counts.png")

Ignoring the frequency of weeks containing 0 tweets, it seems roughly that there are two overlapping normal curves: one centered at approximately $325000$, and another centered at approximately $220000$. As a rough, temporary guess, we can set the cutoff at somewhere between $100000$ and $150000$, the tail end of the normal curve with the lowest mean; we will choose $c_{\text{min}} = 150000$ as our cutoff, just to err on the side of having more tweets for our LDA model. So, the rule is that, given a day to make a prediction, we only make a prediction if the past 31 days does not overlap with any week containing less than $c_{\text{min}} = 150000$ tweets.

However, this is a cumbersome rule to program, and I am a bit lazy. Given that one month is approximately four weeks, we can come up with a simpler rule: only make a prediction if the last 31 days contains at least $4c_{\text{min}} = 600000$ tweets.

Now we have to make sure that we actually have days on which our alorithm is allowed to make predictions. Let's count the number of days which satisfy the rule we've come up with.


In [4]:
c_min = 150000

def get_day(unix_timestamp):
    return int(unix_timestamp) / (24*60*60)

tweets_per_day = tweets \
    .map(lambda record:
         (get_day(record[preprocessing.field_index['timestamp']]), 1)) \
    .countByKey()

for day in range(min(tweets_per_day.keys()), max(tweets_per_day.keys())):
    if day not in tweets_per_day.keys():
        tweets_per_day[day] = 0

num_valid_days = 0
for day in range(min(tweets_per_day.keys()), max(tweets_per_day.keys())):
    # check if day has enough tweets
    valid_days = range(day - 31, day)
    valid_day_counts = [tweets_per_day[past_day]
                        for past_day in valid_days]
    if sum(valid_day_counts) > 4*c_min:
        num_valid_days = num_valid_days + 1

print("Number of days satisfying our rule: " + str(num_valid_days))


Number of days satisfying our rule: 824

Given that our we intended to have approximately three years ($= 1095 \text{ days}$) of tweets, this rule only reduces our valid prediction days by $20\%$.

Tweets per Day

When training the LDA model, we group the tweets by geographic location. With this model, we'll compute the topic vector for each document---which corresponds to having a topic vector for each location ("location" here being a square on a grid, or a precinct, or some other continuous region of land; we'll decide exactly what that is at a later time).

However, something we'd like to know is this: at a given location, how are the topics changing over time? To measure this, we'd like to first select all the tweets in a given location. Then, we'll group these tweets into documents by day, and compute a topic vector for each of these documents using our already-trained LDA model. This presents a problem similar to the one discussed in the previous section: how many tweets are required for the LDA model to produce a "good" topic vector for the document? (The previous section was concerned primarily with how many tweets are required for training the LDA model.)

Similarly to the previous section, we will plot the number of tweets on each day, as well as the distribution of tweets per day.