We've explored the basics of natural language processing using Postgres and the steps we took are often a great starting point for the rest of the analysis, but rarely will you ever just stop with those results. You will often have to pull in data after performing some aggregations, joins, etc... and then continuing on with a general purpose programming language like Python
or R
.
In this notebook, we are going to use postgres to pull in our data but then we are going to perform some more complex data carpentry.
In [ ]:
# BE SURE TO RUN THIS CELL BEFORE ANY OF THE OTHER CELLS
import psycopg2
import pandas as pd
import re
# pull in our stopwords
from nltk.corpus import stopwords
stops = stopwords.words('english')
Our query this time is going to extract the both the hashtag and the tweets associated with the hashtag. We are going to created documents full of tweets that are defined by their hashtags so we need to be able to reference the hashtags per tweet.
...oh and we are only taking from Chicago.
In [ ]:
# define our query
statement = """
SELECT lower(t.text) as tweet, lower(h.text) as hashtag
FROM twitter.tweet t, twitter.hashtag h
WHERE t.job_id = 273 AND t.text NOT LIKE 'RT%' AND t.iso_language = 'en' AND t.tweet_id_str = h.tweet_id
LIMIT 100000;"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
# execute the statement from above
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
# fetch all of the rows associated with the query
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
tweet_dict = {}
for i in list(range(len(column_names))):
tweet_dict['{}'.format(column_names[i])] = [x[i] for x in rows]
tweets = pd.DataFrame(tweet_dict)
In [ ]:
tweets.head()
We can now use pandas
to count how many times each hashtag was used. We can turn this into a data frame.
In [ ]:
hashtag_groups = tweets.groupby('hashtag').size().sort_values().reset_index()
...and the most popular hashtags for Chicago.
In [ ]:
hashtag_groups.tail()
Twitter is unique from other types of natural language given the constraints on size. This often makes it difficult to find coherent topics from tweets. Therefore, we want to create documents of tweets. Each document is a list of tweets that contain a particular hashtag. So what we want to do is create a list of tweets per hashtag.
In [ ]:
docs = tweets.groupby('hashtag')['tweet'].apply(list).reset_index()
Above, we are grouping by hashtag and then concatenating the tweets per group into a list. So this is going to be a data frame where the first attribute is the hashtag and the second is a list of tweets with that hashtag. Let's take a look...
In [ ]:
docs.head()
We now need to use a helper function to remove some patterns from the tweets that we don't want. First, we don't want '@' signs or '#'s. We also want to remove urls. We will create a regular expression to do that.
In [ ]:
def removePatterns(string, replacement, *pats):
for pattern in pats:
string = re.sub(pattern,replacement,string)
return string
The function above takes in a sting and replace each of the patterns in that string with the replacement. Notice that we use *pats
. This is a way to create an unspecified number of arguments. Let's look at an example.
In [ ]:
s = "I have @3 friends named #Arnold"
removePatterns(s,'', '#','@')
This took the string s
and replaced @
and #
with a blank ''.
Below, we are going to create a regular expression that matches urls. We also want these removed
In [ ]:
url = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
In natural language processing, you often have to tokenize a task, which is to break it up text up into components. These components are often splitting on words so that each word is a unit called a token. Below we are going to simultaneously remove the patterns we don't want and tokenize each tweet and save it to a list of lists called tokenized_docs
.
In [ ]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenized_docs = []
for i in docs['tweet']:
document = []
for text in i:
document.append(tokenizer.tokenize(removePatterns(text,'','@','#',url).lower()))
tokenized_docs.append(document)
We can now look at the first item of tokenized_docs
to see what it looks like. Notice that it contains a list/lists.
In [ ]:
tokenized_docs[0]
We then remove the stop words and return it to a list of lists object.
In [ ]:
stops_removed = []
for doc in tokenized_docs:
phrases = []
for phrase in doc:
p = [i for i in phrase if i not in stops]
phrases.append(p)
stops_removed.append(phrases)
After tokenization, there is also stemming. This is the process and getting words to their base version. We are going to do a similar process here where we save it to a list of lists called texts
.
NOTE: This could take a couple of minutes
In [ ]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
texts = []
for doc in stops_removed:
stemmed = []
for phrase in doc:
try:
stemmed.append([p_stemmer.stem(i) for i in phrase])
except:
pass
texts.append(stemmed)
And let's look at the first item...
In [ ]:
texts[0]
And now we have our data in a format that can be worked with for more advanced topic modeling. To reiterate, we have pulled our data from the database getting it in a format that is workable so we can reduce the lines of code written in python, create documents of hashtags, tokenized, removed stopwords and stemmed. We also removed characters that we didn't want and urls from each tweet.