For the descriptive statistics part of this assignment, I considered the subset corresponding to 'concernedstudent1950' i.e. job_id=4255. In order to retrieve the data of interest it was necessary to query the 'tweet' table indicating the correct job_id and iso_language as 'en' corresponding to tweets in english.
In [1]:
import MySQLdb as mdb
import sys
import time
import csv
import numpy as np
import pandas as pd
%matplotlib inline
con = mdb.connect('128.206.116.195', 'tg4_ro', '?3stEt7!3hUbRa-R', 'tw4_db')
if not(con):
con = mdb.connect('opendata.missouri.edu','datascience','datascience','datascience')
if not(con):
print('Could not connect to the database. Terminating.')
sys.exit()
cur = con.cursor()
In [2]:
query = "SELECT count(*) FROM tweet WHERE job_id=4255 AND iso_language LIKE 'en' ORDER BY created_at DESC"
start = time.time()
cur.execute(query)
final = time.time() - start
print("Query execution time: %i seconds" % final)
result = cur.fetchone()
count = result[0]
print("Tweets in collection: %i" % count)
A simple way to find the first tweet with job_id=4255 is to simply query the date the tweet created with the SQL 'min()' function. This returns the lower value found for tweets in english and that correspond to the subject of 'concernedstudent1950'. However, this query still seems to iterate through the whole table in order to determine the minimum value. Hence, the following query is retrieved in ~500 seconds (~8 min).
In [11]:
query = "SELECT min(created_at),tweet_id_str FROM tweet WHERE job_id=4255 AND iso_language LIKE 'en'"
start = time.time()
cur.execute(query)
final = time.time() - start
result = cur.fetchone()
print("Query execution time: %i seconds" % final)
print("Time of first tweet: %s" % result[0])
print("id of first tweet: %s" % result[1])
After determining what the first tweet is for job_id=4255, we can use its 'twitter_id_str' to determine its row position in order to help all future queries. Using the SQL 'offset' parameter allows us to skip rows, and query only the rows needed. In this case, we can skip all rows preceeding the tweet id '659458481099964416'. The number of rows before the first occurrence of our job_id of interest can be found with the following query. The resulting count is 134342313, which tells us that the first tweet with job_id=4255 appears is in the row 134342314, i.e. that all of these tweets appear in the second half of the overall dataset. Running time is approximately 3 minutes.
In [12]:
query = "SELECT count(created_at) FROM tweet WHERE tweet_id_str < 659458481099964416"
start = time.time()
cur.execute(query)
final = time.time() - start
result = cur.fetchone()
print("Query execution time: %i seconds" % final)
print("offset = %i" % result[0])
In [6]:
query = "SELECT created_at,tweet_id_str FROM tweet WHERE job_id=4255 AND iso_language='en' ORDER BY created_at DESC LIMIT 1"
start = time.time()
cur.execute(query)
final = time.time() - start
print("Query execution time: %i seconds" % final)
result = cur.fetchone()
print("Time of last tweet in dataset: %s" % result[0])
print("id of last tweet: %s" % result[1])
At this point, it became evident that it was necessary to retrieve the whole table, store it in secondary memory and use Python for any further carpentry. The resulting file is 38MB. It can be found in '/data/4255.csv'. The csv file is created using the following cell and running time is approximately 13 minutes.
In [ ]:
variables = "tweet_id_str,created_at,REPLACE(text,'\n',' ')"
conditions = "WHERE job_id=4255 AND iso_language='en' ORDER BY created_at DESC"
query = "SELECT %s FROM tweet %s" % (variables,conditions)
start = time.time()
cur.execute(query)
final = time.time() - start
result = cur.fetchall()
print("Query execution time: %i" % final)
with open('4255.csv','w') as out:
csv_out = csv.writer(out)
for i in result:
csv_out.writerow(i)
The next step considered sampling of the dataset.
In [3]:
table = pd.read_csv('data/4255.csv',header=None)
df = pd.DataFrame({'tid':table[0],'time':table[1],'text':table[2]})
sample = df.sample(frac=0.5,random_state=12345,weights=None,replace=False)
t = pd.Series(sample.time)
counts = t.value_counts()
counts.plot()
Out[3]: