Twitter: An Analysis of Linguistic Diversity

Part III

RECAP: In the second Twitter notebook, we left off with a query that provided us with the language counts for the city of Columbia, MO. But remember, we also had to do some prep work before we actually could find these counts, primarily, restricting it to one user per language and then removing the "und" language rows.

YOUR TURN

Today we are going to pick up where we left off except this time we aren't going to limit it on one city. Run a query that returns the language counts for each city. Remember to restrict it to one user per language (per city) and to remove the 'und' iso_language rows. LIMIT ROWS TO 100,000! City is job_id and not all jobs correspond to the geographic location of the tweet. Therefore, make sure you query only those jobs that are equal to or greater than job 255.


In [1]:
# BE SURE TO RUN THIS CELL BEFORE ANY OF THE OTHER CELLS

import psycopg2
import pandas as pd

In [33]:
# put your code here
# ------------------

statement = """
SELECT DISTINCT iso_language, job_id,COUNT(*)
FROM
(SELECT
  DISTINCT ON (from_user, iso_language)
  *
FROM (SELECT * FROM twitter.tweet WHERE iso_language != 'und' AND job_id >=  255 LIMIT 100000) as T
ORDER BY from_user, iso_language) as UNIQ
GROUP BY iso_language, job_id;
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
cities = {}
for i in list(range(len(column_names))):
     cities['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(cities)


Out[33]:
count iso_language job_id
0 10 et 295
1 4 hi 255
2 29 ht 305
3 1 lt 295
4 2 nl 283
5 5 de 264
6 17 fr 301
7 1 ro 293
8 4 fi 304
9 5 tr 305
10 18 fa 296
11 1 no 263
12 2 hu 303
13 16 pt 292
14 17 ro 294
15 10 eu 303
16 6271 en 297
17 22 it 266
18 5 es 298
19 35 it 291
20 27 tr 278
21 9 ro 291
22 14 it 279
23 21 pl 291
24 9 ar 297
25 3 it 274
26 7 ja 301
27 66 ko 275
28 5 tr 276
29 5 uk 262
... ... ... ...
1410 1 bg 296
1411 2 lt 257
1412 2 ar 264
1413 5 lv 290
1414 14 cy 269
1415 1 zh 269
1416 10 ja 285
1417 10 nl 294
1418 1 eu 287
1419 3 it 278
1420 1 vi 279
1421 2 vi 296
1422 5 tr 258
1423 11 ht 263
1424 14 in 301
1425 28 ht 295
1426 4 cs 255
1427 8 de 276
1428 2 no 264
1429 3 eu 292
1430 2 cy 299
1431 3 cy 259
1432 1 nl 278
1433 24 es 302
1434 3 tr 297
1435 3 sl 282
1436 1 ne 266
1437 11865 en 279
1438 2 sl 304
1439 1 bn 294

1440 rows × 3 columns

Databases can do a lot, but there are somethings that are more easily acheived through throught the flexibility of a general-purpose programming language like Python. In this notebook, we are going run through running a query, returning the results to a pandas DataFrame object, and performing some operations over the data frame in order to analyze our data. Let's go ahead and create a data frame.

YOUR TURN

All the way back in the first Twitter notebook we covered how to create a dictionary from our returned query and subsequently how we can turn that dictionary into a data frame using pandas. From the query above, turn the language counts per city into a data frame. Call this data frame df.


In [34]:
# put your code here
# ------------------


df = pd.DataFrame(cities)

We have made substantial progress in the preparation of the calculating the linguistic diversity of these different cities. However, we want to use a single metric that can give us a summary of each city. For this, we're going to borrow from the biological literature and use Shannon's Diversity Index $H$. This is a sort of entropy index that is generally used to measure the biological richness of an ecosystem.

We needn't get into the weeds about this measure but just so we know it takes into account both the number of unique languages as well as how balanced those languages are in a city. Therefore, max Shannon would be an equal count for every language of a city.

We can go ahead and see how this works by reading in the function form the scikit-bio library.


In [12]:
from skbio.diversity.alpha import shannon

In [13]:
pop1 = [10,10,10]
pop2 = [10,20,30]
pop3 = [100,200,300]


print("pop1 ({}) is more diverse that pop2 ({}), which is as diverse as pop3({})".format(
    shannon(pop1),shannon(pop2),shannon(pop3)))


pop1 (1.584962500721156) is more diverse that pop2 (1.459147917027245), which is as diverse as pop3(1.459147917027245)

In [14]:
pop1 = [10,10,10]
pop2 = [10,10,10,10]

print("""Also, more languages but equal eveness means greater diversity. 
For example, pop2 ({}) is more diverse than pop1 ({})""".format(shannon(pop2),shannon(pop1)))


Also, more languages but equal eveness means greater diversity. 
For example, pop2 (2.0) is more diverse than pop1 (1.584962500721156)

This shannon function operates on counts. Conveniently, we have pulled out language counts for per city, so we should be able to run it over this column per city to calculate the index for each city.

Let's return to the Columbia, MO for this example. We can run our final query from the second Twitter notebook, convert this to a data frame, and then run shannon over the count column. This is what that would look like:


In [9]:
statement = """
SELECT DISTINCT iso_language, COUNT(*)
FROM
(SELECT
  DISTINCT ON (from_user, iso_language)
  *
FROM (SELECT * FROM twitter.tweet WHERE job_id = 261 AND iso_language != 'und' LIMIT 10000) as T
ORDER BY from_user, iso_language) as UNIQ
GROUP BY iso_language;
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
como = {}
for i in list(range(len(column_names))):
     como['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
columbia = pd.DataFrame(como)

In [24]:
print("The shannon index for Columbia is {}".format(shannon(columbia['count'])))


The shannon index for Columbia is 0.4826637062516524

YOUR TURN

Remember there was a reason that we did all of that preparation in the last notebook. There were some users who were high volume tweeters making the language counts less even. What do you think would happen if we ran the shannon function over language counts without restricting it to one user per language? Is Columbia more or less diverse according to this measure?

Run a query prior to our preparation steps for Columbia, MO. Is Columbia more or less diverse before clean up?


In [22]:
# put your code here
# ------------------
# query database
statement = """
SELECT DISTINCT iso_language, COUNT(*) 
FROM (
    SELECT iso_language 
    FROM twitter.tweet 
    WHERE job_id = 261 
    LIMIT 10000) AS users 
GROUP BY iso_language
ORDER BY count;
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job_261 = {}
for i in list(range(len(column_names))):
     job_261['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
not_clean = pd.DataFrame(job_261)

shannon(not_clean['count'])

This isn't all we want though. One city is boring if we can't compare to others. This is where pandas comes in handy. We can chain together methods to return results that we are looking for.

Below, there is a sample data frame that contains language counts for three different cities. We can use the groupby method to find the shannon index for each city on the count column. Let's take a look...


In [26]:
samp = pd.DataFrame({'lang': ['en','es','fr','en','es','fr','en','es','fr','ru'], 
                     'count': [30,20,10,11,10,9,30,1,1,1],
                     'city': [1,1,1,2,2,2,3,3,3,3]})

samp


Out[26]:
city count lang
0 1 30 en
1 1 20 es
2 1 10 fr
3 2 11 en
4 2 10 es
5 2 9 fr
6 3 30 en
7 3 1 es
8 3 1 fr
9 3 1 ru

In [27]:
samp['count'].groupby(samp['city']).apply(shannon)


Out[27]:
city
1    1.459148
2    1.580145
3    0.583584
Name: count, dtype: float64

Here, we start with the the count column as this is the column we want to find shannon on. Then we groupby city in order to find shannon per city. Finally, we run the apply method because shannon is a function and not a pandas method. Therefore, apply will apply the shannon function to every grouping of the count column.

YOUR TURN

Now run shannon over the language counts of df. Be sure to group by city (job_id).


In [35]:
# put your code here
# ------------------

df['count'].groupby(df['job_id']).apply(shannon)


Out[35]:
job_id
255    0.474082
257    0.811743
258    0.461246
259    0.570626
260    0.513110
261    0.219413
262    0.726236
263    0.428360
264    0.448493
265    0.247811
266    0.517386
267    0.482261
268    0.533045
269    0.469562
270    0.324071
271    1.022896
272    0.335922
273    0.442550
274    1.076723
275    0.534188
276    0.530560
277    0.557913
278    1.090108
279    0.607578
280    0.769191
281    0.423214
282    0.628510
283    0.683817
284    0.565751
285    0.501317
286    0.518679
287    0.438962
288    0.995194
289    0.551534
290    0.515006
291    0.637216
292    1.050421
293    0.276684
294    0.401339
295    0.454176
296    0.538168
297    0.486923
298    0.277594
299    0.294267
300    0.464356
301    0.510689
302    0.444084
303    0.647629
304    0.438157
305    0.372750
Name: count, dtype: float64

We can also sort the values:


In [73]:
samp['count'].groupby(samp['city']).apply(shannon).sort_values()


Out[73]:
city
3    0.583584
1    1.459148
2    1.580145
Name: count, dtype: float64

YOUR TURN

Now sort the shannon indexes. What is the least linguistically diverse city according to this dataset? The most?


In [36]:
# put your code here
# ------------------

df['count'].groupby(df['job_id']).apply(shannon).sort_values()


Out[36]:
job_id
261    0.219413
265    0.247811
293    0.276684
298    0.277594
299    0.294267
270    0.324071
272    0.335922
305    0.372750
294    0.401339
281    0.423214
263    0.428360
304    0.438157
287    0.438962
273    0.442550
302    0.444084
264    0.448493
295    0.454176
258    0.461246
300    0.464356
269    0.469562
255    0.474082
267    0.482261
297    0.486923
285    0.501317
301    0.510689
260    0.513110
290    0.515006
266    0.517386
286    0.518679
276    0.530560
268    0.533045
275    0.534188
296    0.538168
289    0.551534
277    0.557913
284    0.565751
259    0.570626
279    0.607578
282    0.628510
291    0.637216
303    0.647629
283    0.683817
262    0.726236
280    0.769191
257    0.811743
288    0.995194
271    1.022896
292    1.050421
274    1.076723
278    1.090108
Name: count, dtype: float64