In the past century, human mobility is at its greatest since our dawn. This mobility begets diverse hotspots across the landscape. However, some places are more diverse than others. A copule of questions should come to mind. What is meant when we call a place diverse? Why type of diversity? In this case, we are talking about the topography of langauges for particular cities across the United States, but we could be measuring ethnic or socioeconomic diversity. How do we measure diversity? These are just a few of the questions that we will be exploring as we begin to formalize our understanding of big data manipulation.
What is linguistic diversity? For our purposes, linguistic diversity is a some sort of measure that accounts for both the number of languages spoken in a particular area as well as the number of speakers per language. This is where Twitter comes in. Collecting language data on a region, in this case cities, is particularly difficult, and if you ware wanting to find an open source dataset with such linguistic information, you would be hard pressed to find one that counts up past five distinct languages. Twitter, however, provides the language spoken in the Tweet as an attribute accessible through their API. It also allows a user to collect on tweets within a specified region by specifying a geographic radius around a city. This gives us the unique ability to analyze the linguistic landscape of a given region. This, of course, can be compared between locations or evaluated across time splits in the data.
It is also important to note that this may not be representative of any given physical location. In fact, it almost certainly overweights English's speakers compared to others. A large proportion of Twitter is written in English, more so than there are native English speakers. Yet, this doesn't mean that there isn't any connection to the physical landscape. In fact, all things being equal, the expectation would that more diverse Twitter cities are also more diverse physical cites and vice versa. However, we won't be making such assumptions throughout these notebooks. To do so would take some sort of validation which is outside of the purview of these lessons.
The Twitter data are stored in a Postgres database and contains several tables. The primary table that we will be working with is the tweet
table, however, there are also a hashtag
, mention
, url
and job
table. We will come back to these, but for the time being, let's go over the tweet
table and its attributes.
attribute | description |
---|---|
tweet_id_str |
tweet's identifier |
job_id |
job identifier (pertaining to geographic location) |
created_at |
when the tweet was written |
text |
the text of the tweet |
from_user |
user id who created the tweet |
from_user_name |
username of tweet creator |
from_user_created_at |
the date the user was created |
from_user_followers |
number of followers the user has |
from_user_following |
number of people the user is following |
from_user_favorites |
sum of likes of user's tweets |
to_user |
the id of the person the user tweeted at |
to_user_name |
the user name of the person tweeted at |
location_geo |
the coordinates of where the tweet was sent out |
iso_language |
the language of the tweet |
We can also query for the column names of the table and their data type...
In [1]:
import psycopg2
import pandas as pd
In [2]:
# define our query
statement = """SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'tweet';"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
# execute the statement from above
cursor.execute(statement)
# fetch all of the rows associated with the query
cols = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
cols
Out[2]:
Okay, so we have a good idea about type of information the tweet
table contains so let's return to the other tables in the database. We mentioned these above, but if you were just curious about what tables existed in the database, you could query Postgress for it like so...
In [3]:
statement = """SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'twitter'"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
tables = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
tables
Out[3]:
In [4]:
# put your code here
# ------------------
statement = """SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'hashtag';"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
# execute the statement from above
cursor.execute(statement)
# fetch all of the rows associated with the query
cols = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
cols
Out[4]:
Let's start looking at some of the data now. Most of our analysis revolves around the tweet
table, so we will pick up from there. We can look at some of the data in the tweet
table. The first 10 rows should suffice...
In [5]:
statement = """SELECT *
FROM twitter.tweet
LIMIT 10;"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
# execute the statement from above
cursor.execute(statement)
# fetch all of the rows associated with the query
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
rows
Out[5]:
This is kind of a messy way to view the data. Right now we have an object called rows
, which stores each row in a tuple and all of the tuples are stored in a single list. How about we change this into a more readable format...
In [6]:
statement = """SELECT *
FROM twitter.tweet
LIMIT 10;"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
We can easily turn this into a dictionary object, which can subsequently be turned into a pandas DataFrame
object. For this dictionary, the field name (column name) will be the key and the values will be a list of corresponding values in the rows
object.
For example, rows[<x>][0]
are all values of tweet_id_str
. Conveniently, column_names[0]
is tweet_id_str
. Then all we have to do is create an empty dictionary and begin to fill it with the the contents of column_names
and rows
.
In [7]:
tweet_dict = {}
for i in list(range(len(column_names))):
tweet_dict['{}'.format(column_names[i])] = [x[i] for x in rows]
What's happening here? Well, we can assume that the number or values per row should equal the number of column names. Therefore, we iterate through a list ranging from to the length of the column_names
object and create a key from each column name item (k['{}'.format(column_names[i])]
). The values are the corresponding value index in the rows
object. We use the single line list constructor to build a list of values for each key. For a multiline for
loop that does the same thing, it would look like:
list_o_lists = [] # create an empty list to store lists of values
for i in list(range(len(column_names))):
vals = [] # create empty list to store the values
for x in rows:
vals.append(x[i])
list_o_lists.append(vals)
...and then turning this dict into a data frame is simple. Just run pandas
' DataFrame
method over the dictionary object you just created.
In [8]:
pd.DataFrame(tweet_dict)
Out[8]:
Back to linguistic diversity...
So we know that one of the component of linguistic diversity includes the number of unique languages. We can find the unique langauges through a simple query of the database. Below we find the unique languages from 10,000 rows.
**A Note about Limits**: We limit the data returned to 10,000 rows because there are approximately 300 million tweets in the total datasset. If everyone queried the whole table without limits at the same time, we would need a much larger server! If you are curious about the results for all the tweets, drop us a note and we will run your analysis when the server load is low (i.e., while you sleep!).
In [9]:
statement = """SELECT DISTINCT iso_language
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
langs = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
Now the languages are in a variable called "lang". To view that variable, we simply type it in an output window like below
In [10]:
langs
Out[10]:
In [11]:
# put your code here
# ------------------
statement = """SELECT COUNT(*) FROM (SELECT DISTINCT iso_language
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs) AS lang_count"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
num_langs = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
print(num_langs)
The other component of diversity is the number of speakers per language (at least for our measure). Again, this can be done in SQL.
In [12]:
# use COUNT with a GROUP BY to count the number of speakers per language
statement = """SELECT DISTINCT iso_language, COUNT(*)
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs GROUP BY iso_language;"""
try:
connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
# use our connection values to establish a connection
conn = psycopg2.connect(connect_str)
cursor = conn.cursor()
cursor.execute(statement)
column_names = [desc[0] for desc in cursor.description]
num_langs = cursor.fetchall()
except Exception as e:
print("Uh oh, can't connect. Invalid dbname, user or password?")
print(e)
print(num_langs)
In [13]:
# put your code here
# ------------------
lang_dict = {}
for i in list(range(len(column_names))):
lang_dict['{}'.format(column_names[i])] = [x[i] for x in num_langs]
pd.DataFrame(lang_dict)
Out[13]:
These were the basics. There is still a long way to go before we are ready for analysis, but this is a good place to start as we will use these statements as building blocks for the rest of the Twitter notebooks. In the next notebook, we will cover further data manipulation and preparation in order to get it in a state that is ready for analysis. This will include the removal of some rows and more aggregations.