Twitter: An Analysis in Linguistic Diversity

Part VI

There are more tables than just the tweet and job table. Everytime there is a hashtag in a tweet, the hashtag table updates with a new row with a tweet id, character start location of the hashtag, and the hashtag itself (just the text, sans #). The tweet id maps to the tweed_id_str of the tweet table so that we can connect back to the original text or other features of the tweet. The character start location marks where in the sequence of letters the hashtag begins. Surprsingly, this is more meaningful than one might think as hashtags often hold different roles depending on where they are within the tweet.

We are going to be continuing the analysis of twitter data but we are going to need to find some important characteristics pertaining to hashtags.

Let's start off by find the most popular hashtags (Remember, we are putting a limit on the query for performance purposes).


In [ ]:
# BE SURE TO RUN THIS CELL BEFORE ANY OF THE OTHER CELLS

import psycopg2
import pandas as pd

In [ ]:
# query database
statement = """
SELECT DISTINCT text, COUNT(*)
FROM 
(SELECT text 
FROM twitter.hashtag
LIMIT 10000) AS hashtag_text
GROUP BY text
ORDER BY count DESC
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(job)

Now, lets find the most popular hashtags for the city of Provo, Utah!

We first need create a relationship between the job table and the tweet table and then from the tweet table to the hashtag table.

Then we want to search the job.description for something like "Provo".


In [ ]:
# query database
statement = """
SELECT DISTINCT lower(text), COUNT(*)
FROM 
(SELECT h.text
FROM twitter.hashtag h, twitter.tweet t, twitter.job j
WHERE h.tweet_id = t.tweet_id_str AND t.job_id = j.job_id AND j.description LIKE 'Provo%'
LIMIT 10000) AS hashtag_text
GROUP BY lower(text)
ORDER BY count DESC
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(job)

Notice that we used lower(text) in our group by. Again, hashtags can be typed out different ways but they will all be the same hashtag. For example, #GameOfThrones is the same as #gameofthrones.

Postgres also has some built in functions that make preprocesssing our data for particular purposes a lot simpler. Imagine that we wanted to count the most common words in the tweet text. We can do that like so:


In [ ]:
statement = '''
SELECT lower(word), COUNT(DISTINCT rn) AS num_rows
FROM(
    SELECT UNNEST(STRING_TO_ARRAY(text, ' ')) AS word,
       ROW_NUMBER() OVER(ORDER BY text) AS rn
    FROM 
        (SELECT text FROM
        twitter.tweet LIMIT 10000) y
        ) x
GROUP BY lower(word)
ORDER BY num_rows DESC'''

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(job)

There are a couple of things going on here so we will break it down piece by piece and then put the components back together again.

SELECT UNNEST(STRING_TO_ARRAY(text, ' ')) AS word,
       ROW_NUMBER() OVER(ORDER BY text) AS rn
FROM 
       (SELECT text FROM
       twitter.tweet LIMIT 1000) y

So line by line:

SELECT UNNEST(STRING_TO_ARRAY(text, ' ')) AS word,

This is first going to turn the text column into an array where each word is an item in the array. It then unnests makes each word its own value in a row. We call this new column of words word.

ROW_NUMBER() OVER(ORDER BY text) AS rn

This column just creates a column of row numbers.

Let's take a look at what this produces before the aggregations.


In [ ]:
statement = '''
    SELECT UNNEST(STRING_TO_ARRAY(text, ' ')) AS word,
       ROW_NUMBER() OVER(ORDER BY text) AS rn
    FROM 
        (SELECT text FROM
        twitter.tweet LIMIT 1000) y 
'''

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(job)

The problem with just counting words is that some words don't hold a lot of meaning (or they aren't that interesting when doing analysis). These are words like 'I', 'your', 'is', etc... These are known as stop words. We can actually load in a list of stop words from the Natural Language Toolkit (nltk) library. Now, this list is tokenized and we haven't tokenized the text of our tweets, so some probably won't apply, but it will allow us to remove a lot of these stop words from the scope of this analysis.


In [ ]:
import nltk
from nltk.corpus import stopwords

stops = stopwords.words('english')

print(stops)

We are going to integrate this list into our statement. If the word is in this list, we don't want it returned.


In [ ]:
statement = '''
SELECT lower(word), COUNT(DISTINCT rn) AS num_rows 
FROM
(SELECT * FROM
(SELECT UNNEST(STRING_TO_ARRAY(text, ' ')) AS word,
       ROW_NUMBER() OVER(ORDER BY text) AS rn
    FROM 
        (SELECT text FROM
        twitter.tweet LIMIT 1000) y ) x) z
        WHERE lower(word) NOT IN ('rt',{})
    GROUP BY lower(word)
ORDER BY num_rows DESC; 
'''.format(', '.join(map("'{}'".format, stops)))

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(job)

So this isn't perfect in terms of cleaning up our results, but it did remove quite a bit of unwanted information. It is important to keep in mind that Twitter is very messy, more so than other natural language documents. That is because it is informal and users are constrained to write tweets within 140 characters. This lends itself to some creative spelling.

Also, notice how we removed 'rt' as well.

Now we can also extract the most common words per hashtag to begin to get a sense of what people might be talking about when using it.


In [ ]:
statement = '''
SELECT lower(word), COUNT(DISTINCT rn) AS num_rows 
FROM
(SELECT * FROM
(SELECT UNNEST(STRING_TO_ARRAY(text, ' ')) AS word,
       ROW_NUMBER() OVER(ORDER BY text) AS rn
    FROM 
        (SELECT t.text FROM
        twitter.tweet t, twitter.hashtag h
        WHERE t.tweet_id_str = h.tweet_id AND lower(h.text) = 'tornado'
        LIMIT 1000) y ) x) z
        WHERE lower(word) NOT IN ('rt',{})
    GROUP BY lower(word)
ORDER BY num_rows DESC; 
'''.format(', '.join(map("'{}'".format, stops)))

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job = {}
for i in list(range(len(column_names))):
     job['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(job)

We chose to see what words most commonly occur with "#tornado" and the results make sense.

YOUR TURN

Choose a hashtag and find the most common words from 10,000 tweets that use the hashtag. Do the results make some sense?


In [ ]:
# Your code here
# --------------

In this notebook, we created a very simple way of modeling topics in twitter according to hashtags using almost entirely Postgres. There are some more advance methods of topic modeling but this is a good start in order to grasp what people are talking about most commonly when using a certain hashtag.