Twitter: An Analysis in Linguistic Diversity

Part V

In the previous Twitter notebook, we left off exploring the time dimension of tweets and linguistic diversity. Now we can begin to look into the content of the tweets themselves. We will begin small (least complex) and eventually extend our analysis to more complex linguistic analytics. For the time being, we will be pausing our progression on linguistic diversity in order to grasp the basics of analyzing natural language. But don't worry, we will wrap back around to linguistic diversity and how it relates to some of the more advanced computational linguistic components.


Retweets

We will begin by looking at retweets. For those that are unfamiliar with the functionality of Twitter, a user has the ability to tweet someone else's tweet. This is (cleverly) known as a retweet. The reasons for "retweeting" are many, but there are certainly some information is encoded into a retweet, may this be aggreement with the tweet, popularity of the user tweeting, or some combination of the two.

Retweets are easy to extract from this database. All retweets begin with "RT" in the text feature of the tweet table.


In [ ]:
# RUN THIS CELL FIRST!!!

import time
from pydrill.client import PyDrill
import psycopg2
import pandas as pd

Pydrill


In [ ]:
drill = PyDrill(host='128.206.116.250', port=8048)

if not drill.is_active():
    raise ImproperlyConfigured('Please run Drill first')


# Start the Timer
start = time.perf_counter()

rows = drill.query("SELECT * FROM  dfs.datasets.`twitter/tweet.json` WHERE text LIKE 'RT%' LIMIT 100000")
end = time.perf_counter()

# How long did this look up take?
print("Time to for PyDrill:")
print(end - start)
print('------------------')



# pandas dataframe

df = rows.to_dataframe()
df.head()

Postgres


In [ ]:
statement = """
SELECT * 
FROM  twitter.tweet 
WHERE text LIKE 'RT%' 
LIMIT 100000;
"""

start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df.head()

Simple enough. Just as a reminder, the % operator in SQL matches any or no characters.

But we can also find some simple characteristics about retweets, such as the number of unique retweets...

Pydrill


In [ ]:
statement = """
SELECT COUNT(*) 
FROM 
    (SELECT DISTINCT text
    FROM 
        (SELECT text 
        FROM dfs.datasets.`twitter/tweet.json` 
        WHERE text LIKE 'RT%' 
        LIMIT 10000) AS retweets) AS retweet_count"""

drill = PyDrill(host='128.206.116.250', port=8048)

if not drill.is_active():
    raise ImproperlyConfigured('Please run Drill first')


# Start the Timer
start = time.perf_counter()

rows = drill.query(statement)
end = time.perf_counter()

# How long did this look up take?
print("Time to for PyDrill:")
print(end - start)
print('------------------')



# pandas dataframe

df = rows.to_dataframe()
df.head()

Postgres


In [ ]:
statement = """
SELECT COUNT(*) 
FROM 
    (SELECT DISTINCT text
    FROM 
        (SELECT text 
        FROM twitter.tweet 
        WHERE text 
        LIKE 'RT%' 
        LIMIT 10000) AS retweets) AS retweet_count"""


start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df.head()

Keep in mind that LIMIT works independently of ORDER BY so the results of the queries may be slightly different each time you run it since we aren't taking into account all rows in the database.

Again, counting the number of unique retweets isn't that interesting. What would be more interesting is see what retweets are the most popular.


In [ ]:
statement = """
SELECT DISTINCT text, COUNT(*) 
FROM 
    (SELECT text 
    FROM twitter.tweet 
    WHERE text LIKE 'RT%' 
    LIMIT 10000) AS retweets 
GROUP BY text 
ORDER BY count DESC;"""

start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df

We can take this even further and circle back to the linguistic component. We can add another feature to the group by to find the count of the most popular retweets per language.


In [ ]:
statement = """
SELECT DISTINCT text, iso_language, COUNT(*) 
FROM 
    (SELECT text, iso_language 
    FROM twitter.tweet 
    WHERE text LIKE 'RT%' 
    LIMIT 10000) AS retweets 
GROUP BY text, iso_language 
ORDER BY count DESC;"""

start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df

In [ ]:
statement = """
SELECT DISTINCT job_id, text, iso_language, COUNT(*) 
FROM 
    (SELECT job_id, text, iso_language 
    FROM twitter.tweet 
    WHERE text LIKE 'RT%' AND job_id >= 255 
    LIMIT 10000) AS retweets 
GROUP BY job_id, text, iso_language 
ORDER BY count DESC;"""

start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df

We can further limit our query by a certain hashtag. It may be of interest to find the most common retweet by a certain topic where a topic is defined by a tweet hashtag. We have been collecting data for the duration of the 2017 MLB season, so if you are a baseball fan, we can extract the most retweeted tweets for a particular baseball team.


In [ ]:
statement = """
SELECT DISTINCT job_id, text, iso_language, COUNT(*) 
FROM 
    (SELECT job_id, text, iso_language 
    FROM twitter.tweet 
    WHERE lower(text) 
    LIKE 'rt%#royals%' AND job_id >= 255 LIMIT 100) AS retweets 
GROUP BY job_id, text, iso_language 
ORDER BY count DESC;"""

start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df

You will notice that there were use lower(text) in the query. This turns all characters to lowercase as we don't want to mess with cases when looking for a simple hashtag.

If you are unfamiliar with the Royals, they are a team from Kansas City, MO and there seems to be a lot of retweets coming from job_id 269. If you had to guess, where might this tweet come from? In fact, we can find that fairly simply by joining with the job table.


In [ ]:
statement = """
SELECT DISTINCT description, iso_language, COUNT(*) 
FROM 
    (SELECT h.description, t.job_id, t.text, t.iso_language 
    FROM twitter.tweet t, twitter.job h
    WHERE lower(t.text) LIKE 'rt%#royals%' AND h.job_id >= 255 and h.job_id = t.job_id LIMIT 100) AS retweets 
GROUP BY  description, iso_language 
ORDER BY count DESC;"""

start = time.perf_counter()

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

end = time.perf_counter()

# How long did this look up take?
print("Time to for Postgres:")
print(end - start)
print('------------------')

# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
df = pd.DataFrame(job)

df

We aren't collecting on Kansas City proper, but Overland park is a suburb of Kansas City and the database is collecting on it. It makes sense that the majority of the tweets come from there. However, this can lead to some interesting questions that we may want to consider a time dimension for. Why are there other cities using "#royals"?