RECAP: In the second Twitter notebook, we left off with a query that provided us with the language counts for the city of Columbia, MO. But remember, we also had to do some prep work before we actually could find these counts, primarily, restricting it to one user per language and then removing the "und" language rows.
Today we are going to pick up where we left off except this time we aren't going to limit it on one city. Run a query that returns the language counts for each city. Remember to restrict it to one user per language (per city) and to remove the 'und' iso_language
rows. LIMIT ROWS TO 100,000! City is job_id
and not all jobs correspond to the geographic location of the tweet. Therefore, make sure you query only those jobs that are equal to or greater than job 255.
In [1]:
# BE SURE TO RUN THIS CELL BEFORE ANY OF THE OTHER CELLS
import psycopg2
import pandas as pd
In [33]:
# put your code here
# ------------------
statement = """
SELECT DISTINCT iso_language, job_id,COUNT(*)
FROM
(SELECT
DISTINCT ON (from_user, iso_language)
*
FROM (SELECT * FROM twitter.tweet WHERE iso_language != 'und' AND job_id >= 255 LIMIT 100000) as T
ORDER BY from_user, iso_language) as UNIQ
GROUP BY iso_language, job_id;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
cities = {}
for i in list(range(len(column_names))):
cities['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
pd.DataFrame(cities)
Out[33]:
Databases can do a lot, but there are somethings that are more easily acheived through throught the flexibility of a general-purpose programming language like Python
. In this notebook, we are going run through running a query, returning the results to a pandas DataFrame
object, and performing some operations over the data frame in order to analyze our data. Let's go ahead and create a data frame.
All the way back in the first Twitter notebook we covered how to create a dictionary from our returned query and subsequently how we can turn that dictionary into a data frame using pandas
. From the query above, turn the language counts per city into a data frame. Call this data frame df
.
In [34]:
# put your code here
# ------------------
df = pd.DataFrame(cities)
We have made substantial progress in the preparation of the calculating the linguistic diversity of these different cities. However, we want to use a single metric that can give us a summary of each city. For this, we're going to borrow from the biological literature and use Shannon's Diversity Index $H$. This is a sort of entropy index that is generally used to measure the biological richness of an ecosystem.
We needn't get into the weeds about this measure but just so we know it takes into account both the number of unique languages as well as how balanced those languages are in a city. Therefore, max Shannon would be an equal count for every language of a city.
We can go ahead and see how this works by reading in the function form the scikit-bio
library.
In [12]:
from skbio.diversity.alpha import shannon
In [13]:
pop1 = [10,10,10]
pop2 = [10,20,30]
pop3 = [100,200,300]
print("pop1 ({}) is more diverse that pop2 ({}), which is as diverse as pop3({})".format(
shannon(pop1),shannon(pop2),shannon(pop3)))
In [14]:
pop1 = [10,10,10]
pop2 = [10,10,10,10]
print("""Also, more languages but equal eveness means greater diversity.
For example, pop2 ({}) is more diverse than pop1 ({})""".format(shannon(pop2),shannon(pop1)))
This shannon
function operates on counts. Conveniently, we have pulled out language counts for per city, so we should be able to run it over this column per city to calculate the index for each city.
Let's return to the Columbia, MO for this example. We can run our final query from the second Twitter notebook, convert this to a data frame, and then run shannon over the count
column. This is what that would look like:
In [9]:
statement = """
SELECT DISTINCT iso_language, COUNT(*)
FROM
(SELECT
DISTINCT ON (from_user, iso_language)
*
FROM (SELECT * FROM twitter.tweet WHERE job_id = 261 AND iso_language != 'und' LIMIT 10000) as T
ORDER BY from_user, iso_language) as UNIQ
GROUP BY iso_language;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
como = {}
for i in list(range(len(column_names))):
como['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
columbia = pd.DataFrame(como)
In [24]:
print("The shannon index for Columbia is {}".format(shannon(columbia['count'])))
Remember there was a reason that we did all of that preparation in the last notebook. There were some users who were high volume tweeters making the language counts less even. What do you think would happen if we ran the shannon
function over language counts without restricting it to one user per language? Is Columbia more or less diverse according to this measure?
Run a query prior to our preparation steps for Columbia, MO. Is Columbia more or less diverse before clean up?
In [22]:
# put your code here
# ------------------
# query database
statement = """
SELECT DISTINCT iso_language, COUNT(*)
FROM (
SELECT iso_language
FROM twitter.tweet
WHERE job_id = 261
LIMIT 10000) AS users
GROUP BY iso_language
ORDER BY count;
"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
# create dictionary from the rows and column names
job_261 = {}
for i in list(range(len(column_names))):
job_261['{}'.format(column_names[i])] = [x[i] for x in rows]
# turn dictionary into a data frame
not_clean = pd.DataFrame(job_261)
shannon(not_clean['count'])
This isn't all we want though. One city is boring if we can't compare to others. This is where pandas
comes in handy. We can chain together methods to return results that we are looking for.
Below, there is a sample data frame that contains language counts for three different cities. We can use the groupby
method to find the shannon index for each city
on the count
column. Let's take a look...
In [26]:
samp = pd.DataFrame({'lang': ['en','es','fr','en','es','fr','en','es','fr','ru'],
'count': [30,20,10,11,10,9,30,1,1,1],
'city': [1,1,1,2,2,2,3,3,3,3]})
samp
Out[26]:
In [27]:
samp['count'].groupby(samp['city']).apply(shannon)
Out[27]:
Here, we start with the the count
column as this is the column we want to find shannon on. Then we groupby
city in order to find shannon per city. Finally, we run the apply
method because shannon
is a function and not a pandas
method. Therefore, apply
will apply the shannon
function to every grouping of the count
column.
Now run shannon over the language counts of df
. Be sure to group by city (job_id
).
In [35]:
# put your code here
# ------------------
df['count'].groupby(df['job_id']).apply(shannon)
Out[35]:
We can also sort the values:
In [73]:
samp['count'].groupby(samp['city']).apply(shannon).sort_values()
Out[73]:
In [36]:
# put your code here
# ------------------
df['count'].groupby(df['job_id']).apply(shannon).sort_values()
Out[36]: