This particular database was collecting tweets between middle of January through the middle of May. So there is a time dimension to these data. If we glance back at the first tweet notebook, we see that there is an attribute named created_at. This is a timestamp of when the tweet was published for the world to see.
Adding a time component to an analysis gives us the option to follow trends. When are hashtags popular? How quickly do they die? For our linguistic diversity analysis, it certainly begs for a modified analysis.
Quick question: Would accounting for time when calculating a shannon index on a city have any effect? Would cities stay stable throughout time in regards to their linguistic diversity? Are some cities more prone to fluctuations than others? Well, a timestamp allows us to explore these questions and more.
Today we are going to be comparing two different cities: New York City, New York and Columbia, Missouri. By now you are probably aware that the job id for Columbia is 261. But what is the job id for New York City. That is a simple query of the job table.
In [1]:
# BE SURE TO RUN THIS CELL BEFORE ANY OF THE OTHER CELLS
import psycopg2
import pandas as pd
from skbio.diversity.alpha import shannon
In [2]:
# query database
statement = """
SELECT *
FROM twitter.job
WHERE description LIKE '%New York City%';
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
new_york = {}
for i in list(range(len(column_names))):
new_york['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
pd.DataFrame(new_york)
Out[2]:
So above we use the LIKE statement in conjunction with the % sign. The LIKE operator is going to match a string while the % matches any string of 0 or greater length. If you know the exact match then you needn't use the % sign.
PRACTICE: Just a refresher. Query the 10,000 tweets from the tweet table where the job_id corresponds to New York City. Be sure to also select the the description column from the job table so that every record returned has a description saying "New York City, New York.
In [3]:
# put your code here
# ------------------
# query database
statement = """
SELECT j.description, t.*
FROM twitter.job j, twitter.tweet t
WHERE j.description LIKE '%New York City%' AND j.job_id = t.job_id
LIMIT 10000;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
new_york = {}
for i in list(range(len(column_names))):
new_york['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
pd.DataFrame(new_york)
Out[3]:
A Couple Things to Think About
When dealing with timestamps, the timestamp itself is often too precise to extract anything meaningful. Therefore, we generally have to bin them into larger time buckets, say weeks, months or even years depending on the amount of data and the type of problem. That is where we find ourselves right now.
To start, we are going to practice using Postgres to creat columns of month and year so that we can do some aggregations on them.
In [4]:
# query database
statement = """
SELECT t.*,
date_part('month',created_at) as month,
date_part('year', created_at) as year
FROM twitter.job j, twitter.tweet t
WHERE j.description LIKE '%New York City%' AND j.job_id = t.job_id
LIMIT 1000;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
new_york = {}
for i in list(range(len(column_names))):
new_york['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
pd.DataFrame(new_york)
Out[4]:
Now we can apply our counting of languages per month. Now that we have month and year columns, we just need to add that to our GROUP BY clause like so...
In [5]:
# query database
statement = """
SELECT DISTINCT iso_language,month,year , COUNT(*) FROM
(SELECT t.*,
date_part('month',created_at) as month,
date_part('year', created_at) as year
FROM twitter.job j, twitter.tweet t
WHERE j.description LIKE '%New York City%' AND j.job_id = t.job_id
LIMIT 100000) AS new_york
GROUP BY iso_language ,month, year;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
new_york = {}
for i in list(range(len(column_names))):
new_york['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
pd.DataFrame(new_york)
Out[5]:
Well, it looks like limiting by 100,000 rows only gives a single month. That's not that interesting. What if we decrease the scope of time a little bit? Let's say by week of the year.
Count the number of languages in New York City per week of the year. Turn that into a data frame and call it week_ny. If you need some documentation on how to get the week from a timestamp field, look here (https://www.postgresql.org/docs/8.0/static/functions-datetime.html).
In [6]:
# put your code here
# ------------------
# query database
statement = """
SELECT DISTINCT iso_language,week,year , COUNT(*) FROM
(SELECT t.*,
date_part('week',created_at) as week,
date_part('year', created_at) as year
FROM twitter.job j, twitter.tweet t
WHERE j.description LIKE '%New York City%' AND j.job_id = t.job_id
LIMIT 1000000) AS new_york
GROUP BY iso_language ,week, year;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
new_york = {}
for i in list(range(len(column_names))):
new_york['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
week_ny = pd.DataFrame(new_york)
In [7]:
# put your code here
# ------------------
week_ny['count'].groupby(week_ny['week']).apply(shannon)
Out[7]:
Even weeks are rather few. So let's take a look at days. Keep in mind that this next query could take a few minutes.
In [8]:
# query database
statement = """
SELECT DISTINCT iso_language,day,month,year , COUNT(*) FROM
(SELECT t.*,
date_part('day',created_at) as day,
date_part('month', created_at) as month,
date_part('year', created_at) as year
FROM twitter.job j, twitter.tweet t
WHERE j.description LIKE '%New York City%' AND j.job_id = t.job_id
LIMIT 1000000) AS new_york
GROUP BY iso_language ,day,month, year;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
new_york = {}
for i in list(range(len(column_names))):
new_york['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
day_ny = pd.DataFrame(new_york)
We can use the head method to see what this data frame looks like.
In [9]:
day_ny.head()
Out[9]:
And now we can find shannon for each day...
In [10]:
date_ny = day_ny.groupby(['day','month','year'])['count'].apply(shannon).reset_index()
We also want the day, month, and year columns to be one date column. We can do that by using the to_datetime method and specify the columns that contribute to the date. We will call this new column date.
In [11]:
date_ny['date'] = pd.to_datetime(date_ny.year*10000+date_ny.month*100+date_ny.day,format='%Y%m%d')
# nicer column name
date_ny['shannon'] = date_ny['count']
Let's glimpse at what this gave us...
In [16]:
date_ny.head()
Out[16]:
AND FINALLY...
...we want to plot this relationship between date and shannon.
In [13]:
%matplotlib inline
#import matplotlib
#import numpy as np
#import matplotlib.pyplot as plt
pandas actually has matplotlib built in so that we can plot relationships. In this case, the date is going to be the x-axis and shannon will be the y-axis. pandas likes the x-axis to be the index of the data frame, so we first want to subset the data to be only the columns we want to plot, and then set the index to date. After that, we just call the plot method like so...
In [17]:
date_ny[['date','shannon']].set_index('date').plot()
Out[17]:
In [ ]:
# put your code here
# ------------------