Twitter: An Analysis in Linguistic Diversity

In the past century, human mobility is at its greatest since our dawn. This mobility begets diverse hotspots across the landscape. However, some places are more diverse than others. A copule of questions should come to mind. What is meant when we call a place diverse? Why type of diversity? In this case, we are talking about the topography of langauges for particular cities across the United States, but we could be measuring ethnic or socioeconomic diversity. How do we measure diversity? These are just a few of the questions that we will be exploring as we begin to formalize our understanding of big data manipulation.


Linguistic Diversity and Twitter's role

What is linguistic diversity? For our purposes, linguistic diversity is a some sort of measure that accounts for both the number of languages spoken in a particular area as well as the number of speakers per language. This is where Twitter comes in. Collecting language data on a region, in this case cities, is particularly difficult, and if you ware wanting to find an open source dataset with such linguistic information, you would be hard pressed to find one that counts up past five distinct languages. Twitter, however, provides the language spoken in the Tweet as an attribute accessible through their API. It also allows a user to collect on tweets within a specified region by specifying a geographic radius around a city. This gives us the unique ability to analyze the linguistic landscape of a given region. This, of course, can be compared between locations or evaluated across time splits in the data.

It is also important to note that this may not be representative of any given physical location. In fact, it almost certainly overweights English's speakers compared to others. A large proportion of Twitter is written in English, more so than there are native English speakers. Yet, this doesn't mean that there isn't any connection to the physical landscape. In fact, all things being equal, the expectation would that more diverse Twitter cities are also more diverse physical cites and vice versa. However, we won't be making such assumptions throughout these notebooks. To do so would take some sort of validation which is outside of the purview of these lessons.


The Data

The Twitter data are stored in a Postgres database and contains several tables. The primary table that we will be working with is the tweet table, however, there are also a hashtag, mention, url and job table. We will come back to these, but for the time being, let's go over the tweet table and its attributes.

attribute description
tweet_id_str tweet's identifier
job_id job identifier (pertaining to geographic location)
created_at when the tweet was written
text the text of the tweet
from_user user id who created the tweet
from_user_name username of tweet creator
from_user_created_at the date the user was created
from_user_followers number of followers the user has
from_user_following number of people the user is following
from_user_favorites sum of likes of user's tweets
to_user the id of the person the user tweeted at
to_user_name the user name of the person tweeted at
location_geo the coordinates of where the tweet was sent out
iso_language the language of the tweet

We can also query for the column names of the table and their data type...


In [1]:
import psycopg2
import pandas as pd

In [2]:
# define our query
statement = """SELECT column_name, data_type, is_nullable
    FROM information_schema.columns
    WHERE table_name = 'tweet';"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    # execute the statement from above
    cursor.execute(statement)
    # fetch all of the rows associated with the query
    cols = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
cols


Out[2]:
[('tweet_id_str', 'character varying', 'NO'),
 ('job_id', 'integer', 'NO'),
 ('created_at', 'timestamp without time zone', 'NO'),
 ('text', 'text', 'NO'),
 ('from_user', 'character varying', 'NO'),
 ('from_user_id_str', 'character varying', 'NO'),
 ('from_user_name', 'character varying', 'NO'),
 ('from_user_fullname', 'text', 'NO'),
 ('from_user_created_at', 'timestamp without time zone', 'NO'),
 ('from_user_followers', 'integer', 'NO'),
 ('from_user_following', 'integer', 'NO'),
 ('from_user_favorites', 'integer', 'NO'),
 ('from_user_tweets', 'integer', 'NO'),
 ('from_user_timezone', 'character varying', 'YES'),
 ('to_user', 'character varying', 'YES'),
 ('to_user_id_str', 'character varying', 'YES'),
 ('to_user_name', 'character varying', 'YES'),
 ('source', 'text', 'YES'),
 ('location_geo', 'text', 'YES'),
 ('location_geo_0', 'numeric', 'YES'),
 ('location_geo_1', 'numeric', 'YES'),
 ('iso_language', 'character varying', 'NO'),
 ('analysis_state', 'integer', 'YES')]

Okay, so we have a good idea about type of information the tweet table contains so let's return to the other tables in the database. We mentioned these above, but if you were just curious about what tables existed in the database, you could query Postgress for it like so...


In [3]:
statement = """SELECT table_name 
    FROM information_schema.tables
    WHERE table_schema = 'twitter'"""


try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    cursor.execute(statement)
    
    tables = cursor.fetchall()

except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
tables


Out[3]:
[('tweet',), ('url',), ('mention',), ('hashtag',), ('job',)]

YOUR TURN

Above we ran a bit of code to return the column names of the tweet table. Pick one of the tables above (not tweet) and check out their columns.


In [4]:
# put your code here
# ------------------
statement = """SELECT column_name, data_type, is_nullable
    FROM information_schema.columns
    WHERE table_name = 'hashtag';"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    # execute the statement from above
    cursor.execute(statement)
    # fetch all of the rows associated with the query
    cols = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
cols


Out[4]:
[('hashtag_id', 'bigint', 'NO'),
 ('tweet_id', 'character varying', 'NO'),
 ('text', 'character varying', 'NO'),
 ('index_start', 'smallint', 'NO'),
 ('index_end', 'smallint', 'NO'),
 ('job_id', 'integer', 'NO'),
 ('analysis_state', 'integer', 'YES')]

Let's start looking at some of the data now. Most of our analysis revolves around the tweet table, so we will pick up from there. We can look at some of the data in the tweet table. The first 10 rows should suffice...


In [5]:
statement = """SELECT * 
FROM twitter.tweet
LIMIT 10;"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    # execute the statement from above
    cursor.execute(statement)
    # fetch all of the rows associated with the query
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
rows


Out[5]:
[('852188225141702658',
  284,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @true_pundit: WATCH: Dem Congressman Floats Insane Conspiracy Theory On Live TV, Offers No Evidence #TruePundit https://t.co/W4lqS9Fq6f',
  '179987990',
  '179987990',
  'fjisback',
  'Julie G',
  datetime.datetime(2010, 8, 18, 15, 36, 39),
  168,
  155,
  34867,
  9907,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225145798657',
  290,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '@J_Nova_Kane As long as you have sirracha',
  '315966784',
  '315966784',
  'Shoes_n_Natural',
  'AfroditE',
  datetime.datetime(2011, 6, 12, 18, 47, 27),
  572,
  375,
  11211,
  115584,
  'Quito',
  '169741804',
  '169741804',
  'J_Nova_Kane',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225233985536',
  273,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  "I'm all for the prospect but am afraid of altering the skyline bc I really like the way it looks now https://t.co/Z9k4oIh4Je",
  '762789480285757440',
  '762789480285757440',
  'egg_mom2',
  'eggmom',
  datetime.datetime(2016, 8, 8, 23, 15, 59),
  33,
  30,
  1206,
  881,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225263292421',
  275,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @fight4theyouth: Joy is the ultimate rebellion. Happiness is the fastest track to success. Bliss is the best achievement. --@chaninicholâ\x80¦',
  '26175426',
  '26175426',
  'flowerwheel76',
  'Amy Blumenreder',
  datetime.datetime(2009, 3, 24, 4, 12, 41),
  1286,
  2416,
  106969,
  57457,
  'Eastern Time (US & Canada)',
  None,
  None,
  None,
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225422503937',
  258,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '@tamarapalooza YEP',
  '423986080',
  '423986080',
  '__blve',
  'hali',
  datetime.datetime(2011, 11, 29, 5, 22),
  381,
  352,
  39078,
  18817,
  'Pacific Time (US & Canada)',
  '3317200850',
  '3317200850',
  'tamarapalooza',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'und',
  0),
 ('852188225422667776',
  284,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '@FueledbyLOLZ happy anniversary!',
  '2576924275',
  '2576924275',
  'HappyRunningCo',
  'HappyRunningCo',
  datetime.datetime(2014, 6, 19, 13, 49, 4),
  693,
  485,
  12077,
  5043,
  None,
  '216194230',
  '216194230',
  'FueledbyLOLZ',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225443471360',
  269,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @JasonKander: We have to cut public transit funding for working people so POTUS, whose wife and son live in NYC, can fly to Florida by hâ\x80¦',
  '1605994344',
  '1605994344',
  'LasEkristen',
  'Elizabeth Kristen',
  datetime.datetime(2013, 7, 19, 14, 12, 36),
  1334,
  4299,
  67,
  18999,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225447890945',
  273,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @johnchartman: Not bad Garcia',
  '2440571473',
  '2440571473',
  'Korab_VII',
  'Korrigan',
  datetime.datetime(2014, 4, 12, 20, 23, 9),
  587,
  639,
  3762,
  726,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225485643780',
  275,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '.@PrimeraTech Introduces LX1000 Color Label Printer https://t.co/Wv4nMjMaZP https://t.co/eLZYgYOSBA',
  '32906967',
  '32906967',
  'BevNET',
  'BevNET.com',
  datetime.datetime(2009, 4, 18, 15, 4, 24),
  14266,
  1597,
  983,
  21073,
  'Quito',
  None,
  None,
  None,
  '<a href="http://www.bevnet.com" rel="nofollow">BevNET</a>',
  None,
  None,
  None,
  'es',
  0),
 ('852188225527570433',
  273,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @nikkiorion: G://www.smarturl.it/ye',
  '712899113113763840',
  '712899113113763840',
  'lydiamay_2',
  'lydia',
  datetime.datetime(2016, 3, 24, 7, 9, 28),
  68,
  321,
  6785,
  176,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0)]

This is kind of a messy way to view the data. Right now we have an object called rows, which stores each row in a tuple and all of the tuples are stored in a single list. How about we change this into a more readable format...


In [6]:
statement = """SELECT * 
FROM twitter.tweet
LIMIT 10;"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

We can easily turn this into a dictionary object, which can subsequently be turned into a pandas DataFrame object. For this dictionary, the field name (column name) will be the key and the values will be a list of corresponding values in the rows object.

For example, rows[<x>][0] are all values of tweet_id_str. Conveniently, column_names[0] is tweet_id_str. Then all we have to do is create an empty dictionary and begin to fill it with the the contents of column_names and rows.


In [7]:
tweet_dict = {}
for i in list(range(len(column_names))):
     tweet_dict['{}'.format(column_names[i])] = [x[i] for x in rows]

What's happening here? Well, we can assume that the number or values per row should equal the number of column names. Therefore, we iterate through a list ranging from to the length of the column_names object and create a key from each column name item (k['{}'.format(column_names[i])]). The values are the corresponding value index in the rows object. We use the single line list constructor to build a list of values for each key. For a multiline for loop that does the same thing, it would look like:

list_o_lists = [] # create an empty list to store lists of values

for i in list(range(len(column_names))):
    vals = [] # create empty list to store the values
    for x in rows:
        vals.append(x[i])
    list_o_lists.append(vals)

...and then turning this dict into a data frame is simple. Just run pandas' DataFrame method over the dictionary object you just created.


In [8]:
pd.DataFrame(tweet_dict)


Out[8]:
analysis_state created_at from_user from_user_created_at from_user_favorites from_user_followers from_user_following from_user_fullname from_user_id_str from_user_name ... job_id location_geo location_geo_0 location_geo_1 source text to_user to_user_id_str to_user_name tweet_id_str
0 0 2017-04-12 17:57:18 293690109 2011-05-05 20:04:11 180 570 711 ♍️ 293690109 _Butter21 ... 210 None None None <a href="http://twitter.com/download/iphone" r... RT @BleacherReport: MLB commish Rob Manfred wa... None None None 852219087375781889
1 0 2017-04-12 17:57:18 4859763987 2016-01-29 10:24:41 129 54 331 Reinardo jose silva 4859763987 JoseVerde28 ... 223 None None None <a href="http://blackberry.com/twitter" rel="n... RT @WarOfParlay: Ayer Lamentablemente fue un d... None None None 852219087405150209
2 0 2017-04-12 17:57:18 3537421936 2015-09-03 20:10:35 1724 330 268 Karen 3537421936 karen_marin02 ... 236 None None None <a href="http://twitter.com/download/android" ... Si no te importa lo que piense la gente, ya di... None None None 852219087442915330
3 0 2017-04-12 17:57:18 3056458174 2015-02-23 16:59:19 50357 6117 5763 PROMO GANG C.E.O. 3056458174 ceep757 ... 284 None None None <a href="http://twitter.com" rel="nofollow">Tw... RT @TUOWLS_WBB: Our Big 5 award winners... hig... None None None 852219087530991624
4 0 2017-04-12 17:57:18 849320016315023362 2017-04-04 17:57:26 1067 179 454 Bob Abooey 849320016315023362 BobAbooey8 ... 275 None None None <a href="http://twitter.com" rel="nofollow">Tw... Watch Shia LaBeouf Text People From A Remote C... None None None 852219087631650816
5 0 2017-04-12 17:57:18 43871279 2009-06-01 11:50:18 71484 2588 2298 absentminded 43871279 mlccm ... 290 None None None <a href="http://twitter.com/download/iphone" r... RT @nattylumpo88: Ok. Alright. We get it. You ... None None None 852219087845511173
6 0 2017-04-12 17:57:18 30308791 2009-04-10 20:46:41 13742 473 596 Shaker Pepper 30308791 shakepepper ... 269 None None None <a href="http://twitter.com/#!/download/ipad" ... RT @cdotharrison: I saw Charlie Murphy at the ... None None None 852219087954493440
7 0 2017-04-12 17:57:18 46049431 2009-06-10 05:56:50 283 6197 1693 Plushbeds.com 46049431 plushbeds ... 284 None None None <a href="http://www.audiense.com" rel="nofollo... @HBQ_1 We've got a question for you: about how... 17570566 17570566 HBQ_1 852219087967145984
8 0 2017-04-12 17:57:18 3431548829 2015-08-19 14:21:23 592 25 106 RRussell 3431548829 RRussell74351 ... 255 None None None <a href="http://www.twitter.com" rel="nofollow... RT @saladinahmed: it's not 'sharia' if you're ... None None None 852219088059412482
9 0 2017-04-12 17:57:18 356713597 2011-08-17 07:44:59 6372 362 714 Leah 356713597 brLeaHkaway ... 275 None None None <a href="http://twitter.com/download/iphone" r... RT @celtics: Get ready for the playoffs with t... None None None 852219088113999882

10 rows × 23 columns

Back to linguistic diversity...

So we know that one of the component of linguistic diversity includes the number of unique languages. We can find the unique langauges through a simple query of the database. Below we find the unique languages from 10,000 rows.

**A Note about Limits**: We limit the data returned to 10,000 rows because there are approximately 300 million tweets in the total datasset. If everyone queried the whole table without limits at the same time, we would need a much larger server! If you are curious about the results for all the tweets, drop us a note and we will run your analysis when the server load is low (i.e., while you sleep!).


In [9]:
statement = """SELECT DISTINCT iso_language 
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)

    langs = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

Now the languages are in a variable called "lang". To view that variable, we simply type it in an output window like below


In [10]:
langs


Out[10]:
[('pt',),
 ('cy',),
 ('ru',),
 ('de',),
 ('fr',),
 ('es',),
 ('nl',),
 ('ro',),
 ('el',),
 ('is',),
 ('eu',),
 ('tr',),
 ('pl',),
 ('ja',),
 ('th',),
 ('da',),
 ('ar',),
 ('lv',),
 ('fi',),
 ('und',),
 ('lt',),
 ('en',),
 ('in',),
 ('tl',),
 ('ko',),
 ('et',),
 ('no',),
 ('sv',),
 ('it',),
 ('iw',),
 ('ht',)]

YOUR TURN

How many unique languages are there from 10,000 rows. Feel free to use SQL or Python.


In [11]:
# put your code here
# ------------------

statement = """SELECT COUNT(*) FROM (SELECT DISTINCT iso_language 
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs) AS lang_count"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    cursor.execute(statement)
    num_langs = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
print(num_langs)


[(40,)]

The other component of diversity is the number of speakers per language (at least for our measure). Again, this can be done in SQL.


In [12]:
# use COUNT with a GROUP BY to count the number of speakers per language
statement = """SELECT DISTINCT iso_language, COUNT(*) 
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs GROUP BY iso_language;"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    cursor.execute(statement)
    column_names = [desc[0] for desc in cursor.description]
    num_langs = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
print(num_langs)


[('es', 388), ('th', 12), ('cs', 1), ('tl', 55), ('lt', 4), ('ar', 18), ('sv', 5), ('in', 44), ('nl', 10), ('is', 2), ('fi', 4), ('no', 3), ('ko', 34), ('cy', 5), ('pl', 3), ('da', 4), ('ro', 3), ('eu', 1), ('ht', 16), ('et', 13), ('und', 621), ('pt', 57), ('ru', 4), ('hi', 1), ('fr', 17), ('ja', 79), ('en', 8557), ('de', 17), ('ur', 2), ('tr', 8), ('it', 11), ('zh', 1)]

YOUR TURN

Put the above counts in a data frame object where one column is the language and the other is the number of speakers (which are really represented as tweets by language) for that language.


In [13]:
# put your code here
# ------------------

lang_dict = {}
for i in list(range(len(column_names))):
     lang_dict['{}'.format(column_names[i])] = [x[i] for x in num_langs]
        
pd.DataFrame(lang_dict)


Out[13]:
count iso_language
0 388 es
1 12 th
2 1 cs
3 55 tl
4 4 lt
5 18 ar
6 5 sv
7 44 in
8 10 nl
9 2 is
10 4 fi
11 3 no
12 34 ko
13 5 cy
14 3 pl
15 4 da
16 3 ro
17 1 eu
18 16 ht
19 13 et
20 621 und
21 57 pt
22 4 ru
23 1 hi
24 17 fr
25 79 ja
26 8557 en
27 17 de
28 2 ur
29 8 tr
30 11 it
31 1 zh

These were the basics. There is still a long way to go before we are ready for analysis, but this is a good place to start as we will use these statements as building blocks for the rest of the Twitter notebooks. In the next notebook, we will cover further data manipulation and preparation in order to get it in a state that is ready for analysis. This will include the removal of some rows and more aggregations.