Descriptive Statistics

For the descriptive statistics part of this assignment, I considered the subset corresponding to 'concernedstudent1950' i.e. job_id=4255. In order to retrieve the data of interest it was necessary to query the 'tweet' table indicating the correct job_id and iso_language as 'en' corresponding to tweets in english.


In [1]:
import MySQLdb as mdb
import sys
import time
import csv
import numpy as np
import pandas as pd
%matplotlib inline

con = mdb.connect('128.206.116.195', 'tg4_ro', '?3stEt7!3hUbRa-R', 'tw4_db')
if not(con):
    con = mdb.connect('opendata.missouri.edu','datascience','datascience','datascience')
if not(con):
    print('Could not connect to the database. Terminating.')
    sys.exit()
cur = con.cursor()

How many tweets are in the collection?

The total of number of tweets found was 232837 where 'job_id' was 4255 and the tweets are in English.


In [2]:
query = "SELECT count(*) FROM tweet WHERE job_id=4255 AND iso_language LIKE 'en' ORDER BY created_at DESC"
start = time.time()
cur.execute(query)
final = time.time() - start
print("Query execution time: %i seconds" % final)
result = cur.fetchone()
count = result[0]
print("Tweets in collection: %i" % count)


Query execution time: 461 seconds
Tweets in collection: 232837

When do they start?

A simple way to find the first tweet with job_id=4255 is to simply query the date the tweet created with the SQL 'min()' function. This returns the lower value found for tweets in english and that correspond to the subject of 'concernedstudent1950'. However, this query still seems to iterate through the whole table in order to determine the minimum value. Hence, the following query is retrieved in ~500 seconds (~8 min).


In [11]:
query = "SELECT min(created_at),tweet_id_str FROM tweet WHERE job_id=4255 AND iso_language LIKE 'en'"
start = time.time()
cur.execute(query)
final = time.time() - start
result = cur.fetchone()
print("Query execution time: %i seconds" % final)
print("Time of first tweet: %s" % result[0])
print("id of first tweet: %s" % result[1])


Query execution time: 462 seconds
Time of first tweet: 2015-10-28 19:55:28
id of first tweet: 659458481099964416

After determining what the first tweet is for job_id=4255, we can use its 'twitter_id_str' to determine its row position in order to help all future queries. Using the SQL 'offset' parameter allows us to skip rows, and query only the rows needed. In this case, we can skip all rows preceeding the tweet id '659458481099964416'. The number of rows before the first occurrence of our job_id of interest can be found with the following query. The resulting count is 134342313, which tells us that the first tweet with job_id=4255 appears is in the row 134342314, i.e. that all of these tweets appear in the second half of the overall dataset. Running time is approximately 3 minutes.


In [12]:
query = "SELECT count(created_at) FROM tweet WHERE tweet_id_str < 659458481099964416"
start = time.time()
cur.execute(query)
final = time.time() - start
result = cur.fetchone()
print("Query execution time: %i seconds" % final)
print("offset = %i" % result[0])


Query execution time: 217 seconds
offset = 134342313

When do they end?

The last tweet in this set can be found using the 'DESC' parameter. This lets us search the table starting from the bottom for a fast retrieval (< 2 seconds). Executing the next cell tells us that the last tweet of interest occurred April 21 at 09:18:37.


In [6]:
query = "SELECT created_at,tweet_id_str FROM tweet WHERE job_id=4255 AND iso_language='en' ORDER BY created_at DESC LIMIT 1"
start = time.time()
cur.execute(query)
final = time.time() - start
print("Query execution time: %i seconds" % final)
result = cur.fetchone()
print("Time of last tweet in dataset: %s" % result[0])
print("id of last tweet: %s" % result[1])


Query execution time: 2 seconds
Time of last tweet in dataset: 2016-04-21 09:18:37
id of last tweet: 723078475318353920

Retrieving the entire subset table

At this point, it became evident that it was necessary to retrieve the whole table, store it in secondary memory and use Python for any further carpentry. The resulting file is 38MB. It can be found in '/data/4255.csv'. The csv file is created using the following cell and running time is approximately 13 minutes.


In [ ]:
variables = "tweet_id_str,created_at,REPLACE(text,'\n',' ')"
conditions = "WHERE job_id=4255 AND iso_language='en' ORDER BY created_at DESC"
query = "SELECT %s FROM tweet %s" % (variables,conditions)
start = time.time()
cur.execute(query)
final = time.time() - start
result = cur.fetchall()
print("Query execution time: %i" % final)

with open('4255.csv','w') as out:
    csv_out = csv.writer(out)
    for i in result:
        csv_out.writerow(i)

The next step considered sampling of the dataset.


In [3]:
table = pd.read_csv('data/4255.csv',header=None)
df = pd.DataFrame({'tid':table[0],'time':table[1],'text':table[2]})
sample = df.sample(frac=0.5,random_state=12345,weights=None,replace=False)
t = pd.Series(sample.time)
counts = t.value_counts()
counts.plot()


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4dd26d6828>

What is the trend for tweet volume?

It can be seen in the graph above that the sample for tweet volume decreases drastically after the first months of the events regarding 'concernedstudent1950' occurring. It remains low until the last date.

What hashtags show up as most prominent in each month of the lifecycle?

Which twitter users are the most mentioned?

How frequently is each user mentioned during each month of the lifecycle?

What is the relationship between the volume of tweets you selected and the volume of tweets for other collections in the data set?