Twitter: An Analysis of Linguistic Diversity

Part III

RECAP: In the second Twitter notebook, we left off with a query that provided us with the language counts for the city of Columbia, MO. But remember, we also had to do some prep work before we actually could find these counts, primarily, restricting it to one user per language and then removing the "und" language rows.

YOUR TURN

Today we are going to pick up where we left off except this time we aren't going to limit it on one city. Run a query that returns the language counts for each city. Remember to restrict it to one user per language (per city) and to remove the 'und' iso_language rows. LIMIT ROWS TO 100,000! City is job_id and not all jobs correspond to the geographic location of the tweet. Therefore, make sure you query only those jobs that are equal to or greater than job 255.



In [1]:

    
# BE SURE TO RUN THIS CELL BEFORE ANY OF THE OTHER CELLS

import psycopg2
import pandas as pd



In [33]:

    
# put your code here
# ------------------

statement = """
SELECT DISTINCT iso_language, job_id,COUNT(*)
FROM
(SELECT
  DISTINCT ON (from_user, iso_language)
  *
FROM (SELECT * FROM twitter.tweet WHERE iso_language != 'und' AND job_id >=  255 LIMIT 100000) as T
ORDER BY from_user, iso_language) as UNIQ
GROUP BY iso_language, job_id;
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
cities = {}
for i in list(range(len(column_names))):
     cities['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
pd.DataFrame(cities)









    Out[33]:






  
    
      
      count
      iso_language
      job_id
    
  
  
    
      0
      10
      et
      295
    
    
      1
      4
      hi
      255
    
    
      2
      29
      ht
      305
    
    
      3
      1
      lt
      295
    
    
      4
      2
      nl
      283
    
    
      5
      5
      de
      264
    
    
      6
      17
      fr
      301
    
    
      7
      1
      ro
      293
    
    
      8
      4
      fi
      304
    
    
      9
      5
      tr
      305
    
    
      10
      18
      fa
      296
    
    
      11
      1
      no
      263
    
    
      12
      2
      hu
      303
    
    
      13
      16
      pt
      292
    
    
      14
      17
      ro
      294
    
    
      15
      10
      eu
      303
    
    
      16
      6271
      en
      297
    
    
      17
      22
      it
      266
    
    
      18
      5
      es
      298
    
    
      19
      35
      it
      291
    
    
      20
      27
      tr
      278
    
    
      21
      9
      ro
      291
    
    
      22
      14
      it
      279
    
    
      23
      21
      pl
      291
    
    
      24
      9
      ar
      297
    
    
      25
      3
      it
      274
    
    
      26
      7
      ja
      301
    
    
      27
      66
      ko
      275
    
    
      28
      5
      tr
      276
    
    
      29
      5
      uk
      262
    
    
      ...
      ...
      ...
      ...
    
    
      1410
      1
      bg
      296
    
    
      1411
      2
      lt
      257
    
    
      1412
      2
      ar
      264
    
    
      1413
      5
      lv
      290
    
    
      1414
      14
      cy
      269
    
    
      1415
      1
      zh
      269
    
    
      1416
      10
      ja
      285
    
    
      1417
      10
      nl
      294
    
    
      1418
      1
      eu
      287
    
    
      1419
      3
      it
      278
    
    
      1420
      1
      vi
      279
    
    
      1421
      2
      vi
      296
    
    
      1422
      5
      tr
      258
    
    
      1423
      11
      ht
      263
    
    
      1424
      14
      in
      301
    
    
      1425
      28
      ht
      295
    
    
      1426
      4
      cs
      255
    
    
      1427
      8
      de
      276
    
    
      1428
      2
      no
      264
    
    
      1429
      3
      eu
      292
    
    
      1430
      2
      cy
      299
    
    
      1431
      3
      cy
      259
    
    
      1432
      1
      nl
      278
    
    
      1433
      24
      es
      302
    
    
      1434
      3
      tr
      297
    
    
      1435
      3
      sl
      282
    
    
      1436
      1
      ne
      266
    
    
      1437
      11865
      en
      279
    
    
      1438
      2
      sl
      304
    
    
      1439
      1
      bn
      294
    
  

1440 rows × 3 columns

Databases can do a lot, but there are somethings that are more easily acheived through throught the flexibility of a general-purpose programming language like Python. In this notebook, we are going run through running a query, returning the results to a pandas DataFrame object, and performing some operations over the data frame in order to analyze our data. Let's go ahead and create a data frame.

YOUR TURN

All the way back in the first Twitter notebook we covered how to create a dictionary from our returned query and subsequently how we can turn that dictionary into a data frame using pandas. From the query above, turn the language counts per city into a data frame. Call this data frame df.



In [34]:

    
# put your code here
# ------------------


df = pd.DataFrame(cities)

We have made substantial progress in the preparation of the calculating the linguistic diversity of these different cities. However, we want to use a single metric that can give us a summary of each city. For this, we're going to borrow from the biological literature and use Shannon's Diversity Index $H$. This is a sort of entropy index that is generally used to measure the biological richness of an ecosystem.

We needn't get into the weeds about this measure but just so we know it takes into account both the number of unique languages as well as how balanced those languages are in a city. Therefore, max Shannon would be an equal count for every language of a city.

We can go ahead and see how this works by reading in the function form the scikit-bio library.



In [12]:

    
from skbio.diversity.alpha import shannon



In [13]:

    
pop1 = [10,10,10]
pop2 = [10,20,30]
pop3 = [100,200,300]


print("pop1 ({}) is more diverse that pop2 ({}), which is as diverse as pop3({})".format(
    shannon(pop1),shannon(pop2),shannon(pop3)))









    



pop1 (1.584962500721156) is more diverse that pop2 (1.459147917027245), which is as diverse as pop3(1.459147917027245)



In [14]:

    
pop1 = [10,10,10]
pop2 = [10,10,10,10]

print("""Also, more languages but equal eveness means greater diversity. 
For example, pop2 ({}) is more diverse than pop1 ({})""".format(shannon(pop2),shannon(pop1)))









    



Also, more languages but equal eveness means greater diversity. 
For example, pop2 (2.0) is more diverse than pop1 (1.584962500721156)

This shannon function operates on counts. Conveniently, we have pulled out language counts for per city, so we should be able to run it over this column per city to calculate the index for each city.

Let's return to the Columbia, MO for this example. We can run our final query from the second Twitter notebook, convert this to a data frame, and then run shannon over the count column. This is what that would look like:



In [9]:

    
statement = """
SELECT DISTINCT iso_language, COUNT(*)
FROM
(SELECT
  DISTINCT ON (from_user, iso_language)
  *
FROM (SELECT * FROM twitter.tweet WHERE job_id = 261 AND iso_language != 'und' LIMIT 10000) as T
ORDER BY from_user, iso_language) as UNIQ
GROUP BY iso_language;
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
como = {}
for i in list(range(len(column_names))):
     como['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
columbia = pd.DataFrame(como)



In [24]:

    
print("The shannon index for Columbia is {}".format(shannon(columbia['count'])))









    



The shannon index for Columbia is 0.4826637062516524

YOUR TURN

Remember there was a reason that we did all of that preparation in the last notebook. There were some users who were high volume tweeters making the language counts less even. What do you think would happen if we ran the shannon function over language counts without restricting it to one user per language? Is Columbia more or less diverse according to this measure?

Run a query prior to our preparation steps for Columbia, MO. Is Columbia more or less diverse before clean up?



In [22]:

    
# put your code here
# ------------------
# query database
statement = """
SELECT DISTINCT iso_language, COUNT(*) 
FROM (
    SELECT iso_language 
    FROM twitter.tweet 
    WHERE job_id = 261 
    LIMIT 10000) AS users 
GROUP BY iso_language
ORDER BY count;
"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
# create dictionary from the rows and column names   
job_261 = {}
for i in list(range(len(column_names))):
     job_261['{}'.format(column_names[i])] = [x[i] for x in rows]

# turn dictionary into a data frame
not_clean = pd.DataFrame(job_261)

shannon(not_clean['count'])

This isn't all we want though. One city is boring if we can't compare to others. This is where pandas comes in handy. We can chain together methods to return results that we are looking for.

Below, there is a sample data frame that contains language counts for three different cities. We can use the groupby method to find the shannon index for each city on the count column. Let's take a look...



In [26]:

    
samp = pd.DataFrame({'lang': ['en','es','fr','en','es','fr','en','es','fr','ru'], 
                     'count': [30,20,10,11,10,9,30,1,1,1],
                     'city': [1,1,1,2,2,2,3,3,3,3]})

samp



In [27]:

    
samp['count'].groupby(samp['city']).apply(shannon)









    Out[27]:





city
1    1.459148
2    1.580145
3    0.583584
Name: count, dtype: float64

Here, we start with the the count column as this is the column we want to find shannon on. Then we groupby city in order to find shannon per city. Finally, we run the apply method because shannon is a function and not a pandas method. Therefore, apply will apply the shannon function to every grouping of the count column.

YOUR TURN

Now run shannon over the language counts of df. Be sure to group by city (job_id).



In [35]:

    
# put your code here
# ------------------

df['count'].groupby(df['job_id']).apply(shannon)









    Out[35]:





job_id
255    0.474082
257    0.811743
258    0.461246
259    0.570626
260    0.513110
261    0.219413
262    0.726236
263    0.428360
264    0.448493
265    0.247811
266    0.517386
267    0.482261
268    0.533045
269    0.469562
270    0.324071
271    1.022896
272    0.335922
273    0.442550
274    1.076723
275    0.534188
276    0.530560
277    0.557913
278    1.090108
279    0.607578
280    0.769191
281    0.423214
282    0.628510
283    0.683817
284    0.565751
285    0.501317
286    0.518679
287    0.438962
288    0.995194
289    0.551534
290    0.515006
291    0.637216
292    1.050421
293    0.276684
294    0.401339
295    0.454176
296    0.538168
297    0.486923
298    0.277594
299    0.294267
300    0.464356
301    0.510689
302    0.444084
303    0.647629
304    0.438157
305    0.372750
Name: count, dtype: float64

We can also sort the values:



In [73]:

    
samp['count'].groupby(samp['city']).apply(shannon).sort_values()









    Out[73]:





city
3    0.583584
1    1.459148
2    1.580145
Name: count, dtype: float64

YOUR TURN

Now sort the shannon indexes. What is the least linguistically diverse city according to this dataset? The most?



In [36]:

    
# put your code here
# ------------------

df['count'].groupby(df['job_id']).apply(shannon).sort_values()









    Out[36]:





job_id
261    0.219413
265    0.247811
293    0.276684
298    0.277594
299    0.294267
270    0.324071
272    0.335922
305    0.372750
294    0.401339
281    0.423214
263    0.428360
304    0.438157
287    0.438962
273    0.442550
302    0.444084
264    0.448493
295    0.454176
258    0.461246
300    0.464356
269    0.469562
255    0.474082
267    0.482261
297    0.486923
285    0.501317
301    0.510689
260    0.513110
290    0.515006
266    0.517386
286    0.518679
276    0.530560
268    0.533045
275    0.534188
296    0.538168
289    0.551534
277    0.557913
284    0.565751
259    0.570626
279    0.607578
282    0.628510
291    0.637216
303    0.647629
283    0.683817
262    0.726236
280    0.769191
257    0.811743
288    0.995194
271    1.022896
292    1.050421
274    1.076723
278    1.090108
Name: count, dtype: float64

	count	iso_language	job_id
0	10	et	295
1	4	hi	255
2	29	ht	305
3	1	lt	295
4	2	nl	283
5	5	de	264
6	17	fr	301
7	1	ro	293
8	4	fi	304
9	5	tr	305
10	18	fa	296
11	1	no	263
12	2	hu	303
13	16	pt	292
14	17	ro	294
15	10	eu	303
16	6271	en	297
17	22	it	266
18	5	es	298
19	35	it	291
20	27	tr	278
21	9	ro	291
22	14	it	279
23	21	pl	291
24	9	ar	297
25	3	it	274
26	7	ja	301
27	66	ko	275
28	5	tr	276
29	5	uk	262
...	...	...	...
1410	1	bg	296
1411	2	lt	257
1412	2	ar	264
1413	5	lv	290
1414	14	cy	269
1415	1	zh	269
1416	10	ja	285
1417	10	nl	294
1418	1	eu	287
1419	3	it	278
1420	1	vi	279
1421	2	vi	296
1422	5	tr	258
1423	11	ht	263
1424	14	in	301
1425	28	ht	295
1426	4	cs	255
1427	8	de	276
1428	2	no	264
1429	3	eu	292
1430	2	cy	299
1431	3	cy	259
1432	1	nl	278
1433	24	es	302
1434	3	tr	297
1435	3	sl	282
1436	1	ne	266
1437	11865	en	279
1438	2	sl	304
1439	1	bn	294