In the previous Twitter notebook, we left off exploring the time dimension of tweets and linguistic diversity. Now we can begin to look into the content of the tweets themselves. We will begin small (least complex) and eventually extend our analysis to more complex linguistic analytics. For the time being, we will be pausing our progression on linguistic diversity in order to grasp the basics of analyzing natural language. But don't worry, we will wrap back around to linguistic diversity and how it relates to some of the more advanced computational linguistic components.
We will begin by looking at retweets. For those that are unfamiliar with the functionality of Twitter, a user has the ability to tweet someone else's tweet. This is (cleverly) known as a retweet. The reasons for "retweeting" are many, but there are certainly some information is encoded into a retweet, may this be aggreement with the tweet, popularity of the user tweeting, or some combination of the two.
Retweets are easy to extract from this database. All retweets begin with "RT" in the text
feature of the tweet
table.
In [ ]:
# RUN THIS CELL FIRST!!!
import time
from pydrill.client import PyDrill
import psycopg2
import pandas as pd
Pydrill
In [ ]:
drill = PyDrill(host='128.206.116.250', port=8048)
if not drill.is_active():
raise ImproperlyConfigured('Please run Drill first')
# Start the Timer
start = time.perf_counter()
rows = drill.query("SELECT * FROM dfs.datasets.`twitter/tweet.json` WHERE text LIKE 'RT%' LIMIT 100000")
end = time.perf_counter()
# How long did this look up take?
print("Time to for PyDrill:")
print(end - start)
print('------------------')
# pandas dataframe
df = rows.to_dataframe()
df.head()
Postgres
In [ ]:
statement = """
SELECT *
FROM twitter.tweet
WHERE text LIKE 'RT%'
LIMIT 100000;
"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df.head()
Simple enough. Just as a reminder, the %
operator in SQL matches any or no characters.
But we can also find some simple characteristics about retweets, such as the number of unique retweets...
Pydrill
In [ ]:
statement = """
SELECT COUNT(*)
FROM
(SELECT DISTINCT text
FROM
(SELECT text
FROM dfs.datasets.`twitter/tweet.json`
WHERE text LIKE 'RT%'
LIMIT 10000) AS retweets) AS retweet_count"""
drill = PyDrill(host='128.206.116.250', port=8048)
if not drill.is_active():
raise ImproperlyConfigured('Please run Drill first')
# Start the Timer
start = time.perf_counter()
rows = drill.query(statement)
end = time.perf_counter()
# How long did this look up take?
print("Time to for PyDrill:")
print(end - start)
print('------------------')
# pandas dataframe
df = rows.to_dataframe()
df.head()
Postgres
In [ ]:
statement = """
SELECT COUNT(*)
FROM
(SELECT DISTINCT text
FROM
(SELECT text
FROM twitter.tweet
WHERE text
LIKE 'RT%'
LIMIT 10000) AS retweets) AS retweet_count"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df.head()
Keep in mind that LIMIT
works independently of ORDER BY
so the results of the queries may be slightly different each time you run it since we aren't taking into account all rows in the database.
Again, counting the number of unique retweets isn't that interesting. What would be more interesting is see what retweets are the most popular.
In [ ]:
statement = """
SELECT DISTINCT text, COUNT(*)
FROM
(SELECT text
FROM twitter.tweet
WHERE text LIKE 'RT%'
LIMIT 10000) AS retweets
GROUP BY text
ORDER BY count DESC;"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df
We can take this even further and circle back to the linguistic component. We can add another feature to the group by to find the count of the most popular retweets per language.
In [ ]:
statement = """
SELECT DISTINCT text, iso_language, COUNT(*)
FROM
(SELECT text, iso_language
FROM twitter.tweet
WHERE text LIKE 'RT%'
LIMIT 10000) AS retweets
GROUP BY text, iso_language
ORDER BY count DESC;"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df
In [ ]:
statement = """
SELECT DISTINCT job_id, text, iso_language, COUNT(*)
FROM
(SELECT job_id, text, iso_language
FROM twitter.tweet
WHERE text LIKE 'RT%' AND job_id >= 255
LIMIT 10000) AS retweets
GROUP BY job_id, text, iso_language
ORDER BY count DESC;"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df
We can further limit our query by a certain hashtag. It may be of interest to find the most common retweet by a certain topic where a topic is defined by a tweet hashtag. We have been collecting data for the duration of the 2017 MLB season, so if you are a baseball fan, we can extract the most retweeted tweets for a particular baseball team.
In [ ]:
statement = """
SELECT DISTINCT job_id, text, iso_language, COUNT(*)
FROM
(SELECT job_id, text, iso_language
FROM twitter.tweet
WHERE lower(text)
LIKE 'rt%#royals%' AND job_id >= 255 LIMIT 100) AS retweets
GROUP BY job_id, text, iso_language
ORDER BY count DESC;"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df
You will notice that there were use lower(text)
in the query. This turns all characters to lowercase as we don't want to mess with cases when looking for a simple hashtag.
If you are unfamiliar with the Royals, they are a team from Kansas City, MO and there seems to be a lot of retweets coming from job_id 269. If you had to guess, where might this tweet come from? In fact, we can find that fairly simply by joining with the job
table.
In [ ]:
statement = """
SELECT DISTINCT description, iso_language, COUNT(*)
FROM
(SELECT h.description, t.job_id, t.text, t.iso_language
FROM twitter.tweet t, twitter.job h
WHERE lower(t.text) LIKE 'rt%#royals%' AND h.job_id >= 255 and h.job_id = t.job_id LIMIT 100) AS retweets
GROUP BY description, iso_language
ORDER BY count DESC;"""
start = time.perf_counter()
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
end = time.perf_counter()
# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')
# create dictionary from the rows and column names
job = {}
for i in list(range(len(column_names))):
job['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
df = pd.DataFrame(job)
df
We aren't collecting on Kansas City proper, but Overland park is a suburb of Kansas City and the database is collecting on it. It makes sense that the majority of the tweets come from there. However, this can lead to some interesting questions that we may want to consider a time dimension for. Why are there other cities using "#royals"?