Twitter: An Analysis in Linguistic Diversity

In the past century, human mobility is at its greatest since our dawn. This mobility begets diverse hotspots across the landscape. However, some places are more diverse than others. A copule of questions should come to mind. What is meant when we call a place diverse? Why type of diversity? In this case, we are talking about the topography of langauges for particular cities across the United States, but we could be measuring ethnic or socioeconomic diversity. How do we measure diversity? These are just a few of the questions that we will be exploring as we begin to formalize our understanding of big data manipulation.

Linguistic Diversity and Twitter's role

What is linguistic diversity? For our purposes, linguistic diversity is a some sort of measure that accounts for both the number of languages spoken in a particular area as well as the number of speakers per language. This is where Twitter comes in. Collecting language data on a region, in this case cities, is particularly difficult, and if you ware wanting to find an open source dataset with such linguistic information, you would be hard pressed to find one that counts up past five distinct languages. Twitter, however, provides the language spoken in the Tweet as an attribute accessible through their API. It also allows a user to collect on tweets within a specified region by specifying a geographic radius around a city. This gives us the unique ability to analyze the linguistic landscape of a given region. This, of course, can be compared between locations or evaluated across time splits in the data.

It is also important to note that this may not be representative of any given physical location. In fact, it almost certainly overweights English's speakers compared to others. A large proportion of Twitter is written in English, more so than there are native English speakers. Yet, this doesn't mean that there isn't any connection to the physical landscape. In fact, all things being equal, the expectation would that more diverse Twitter cities are also more diverse physical cites and vice versa. However, we won't be making such assumptions throughout these notebooks. To do so would take some sort of validation which is outside of the purview of these lessons.

The Data

The Twitter data are stored in a Postgres database and contains several tables. The primary table that we will be working with is the tweet table, however, there are also a hashtag, mention, url and job table. We will come back to these, but for the time being, let's go over the tweet table and its attributes.

attribute	description
`tweet_id_str`	tweet's identifier
`job_id`	job identifier (pertaining to geographic location)
`created_at`	when the tweet was written
`text`	the text of the tweet
`from_user`	user id who created the tweet
`from_user_name`	username of tweet creator
`from_user_created_at`	the date the user was created
`from_user_followers`	number of followers the user has
`from_user_following`	number of people the user is following
`from_user_favorites`	sum of likes of user's tweets
`to_user`	the id of the person the user tweeted at
`to_user_name`	the user name of the person tweeted at
`location_geo`	the coordinates of where the tweet was sent out
`iso_language`	the language of the tweet

We can also query for the column names of the table and their data type...



In [1]:

    
import psycopg2
import pandas as pd



In [2]:

    
# define our query
statement = """SELECT column_name, data_type, is_nullable
    FROM information_schema.columns
    WHERE table_name = 'tweet';"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    # execute the statement from above
    cursor.execute(statement)
    # fetch all of the rows associated with the query
    cols = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
cols









    Out[2]:





[('tweet_id_str', 'character varying', 'NO'),
 ('job_id', 'integer', 'NO'),
 ('created_at', 'timestamp without time zone', 'NO'),
 ('text', 'text', 'NO'),
 ('from_user', 'character varying', 'NO'),
 ('from_user_id_str', 'character varying', 'NO'),
 ('from_user_name', 'character varying', 'NO'),
 ('from_user_fullname', 'text', 'NO'),
 ('from_user_created_at', 'timestamp without time zone', 'NO'),
 ('from_user_followers', 'integer', 'NO'),
 ('from_user_following', 'integer', 'NO'),
 ('from_user_favorites', 'integer', 'NO'),
 ('from_user_tweets', 'integer', 'NO'),
 ('from_user_timezone', 'character varying', 'YES'),
 ('to_user', 'character varying', 'YES'),
 ('to_user_id_str', 'character varying', 'YES'),
 ('to_user_name', 'character varying', 'YES'),
 ('source', 'text', 'YES'),
 ('location_geo', 'text', 'YES'),
 ('location_geo_0', 'numeric', 'YES'),
 ('location_geo_1', 'numeric', 'YES'),
 ('iso_language', 'character varying', 'NO'),
 ('analysis_state', 'integer', 'YES')]

Okay, so we have a good idea about type of information the tweet table contains so let's return to the other tables in the database. We mentioned these above, but if you were just curious about what tables existed in the database, you could query Postgress for it like so...



In [3]:

    
statement = """SELECT table_name 
    FROM information_schema.tables
    WHERE table_schema = 'twitter'"""


try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    cursor.execute(statement)
    
    tables = cursor.fetchall()

except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
tables









    Out[3]:





[('tweet',), ('url',), ('mention',), ('hashtag',), ('job',)]

YOUR TURN

Above we ran a bit of code to return the column names of the tweet table. Pick one of the tables above (not tweet) and check out their columns.



In [4]:

    
# put your code here
# ------------------
statement = """SELECT column_name, data_type, is_nullable
    FROM information_schema.columns
    WHERE table_name = 'hashtag';"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    # execute the statement from above
    cursor.execute(statement)
    # fetch all of the rows associated with the query
    cols = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
cols









    Out[4]:





[('hashtag_id', 'bigint', 'NO'),
 ('tweet_id', 'character varying', 'NO'),
 ('text', 'character varying', 'NO'),
 ('index_start', 'smallint', 'NO'),
 ('index_end', 'smallint', 'NO'),
 ('job_id', 'integer', 'NO'),
 ('analysis_state', 'integer', 'YES')]

Let's start looking at some of the data now. Most of our analysis revolves around the tweet table, so we will pick up from there. We can look at some of the data in the tweet table. The first 10 rows should suffice...



In [5]:

    
statement = """SELECT * 
FROM twitter.tweet
LIMIT 10;"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    # execute the statement from above
    cursor.execute(statement)
    # fetch all of the rows associated with the query
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
rows









    Out[5]:





[('852188225141702658',
  284,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @true_pundit: WATCH: Dem Congressman Floats Insane Conspiracy Theory On Live TV, Offers No Evidence #TruePundit https://t.co/W4lqS9Fq6f',
  '179987990',
  '179987990',
  'fjisback',
  'Julie G',
  datetime.datetime(2010, 8, 18, 15, 36, 39),
  168,
  155,
  34867,
  9907,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225145798657',
  290,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '@J_Nova_Kane As long as you have sirracha',
  '315966784',
  '315966784',
  'Shoes_n_Natural',
  'AfroditE',
  datetime.datetime(2011, 6, 12, 18, 47, 27),
  572,
  375,
  11211,
  115584,
  'Quito',
  '169741804',
  '169741804',
  'J_Nova_Kane',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225233985536',
  273,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  "I'm all for the prospect but am afraid of altering the skyline bc I really like the way it looks now https://t.co/Z9k4oIh4Je",
  '762789480285757440',
  '762789480285757440',
  'egg_mom2',
  'eggmom',
  datetime.datetime(2016, 8, 8, 23, 15, 59),
  33,
  30,
  1206,
  881,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225263292421',
  275,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @fight4theyouth: Joy is the ultimate rebellion. Happiness is the fastest track to success. Bliss is the best achievement. --@chaninicholâ\x80¦',
  '26175426',
  '26175426',
  'flowerwheel76',
  'Amy Blumenreder',
  datetime.datetime(2009, 3, 24, 4, 12, 41),
  1286,
  2416,
  106969,
  57457,
  'Eastern Time (US & Canada)',
  None,
  None,
  None,
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225422503937',
  258,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '@tamarapalooza YEP',
  '423986080',
  '423986080',
  '__blve',
  'hali',
  datetime.datetime(2011, 11, 29, 5, 22),
  381,
  352,
  39078,
  18817,
  'Pacific Time (US & Canada)',
  '3317200850',
  '3317200850',
  'tamarapalooza',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'und',
  0),
 ('852188225422667776',
  284,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '@FueledbyLOLZ happy anniversary!',
  '2576924275',
  '2576924275',
  'HappyRunningCo',
  'HappyRunningCo',
  datetime.datetime(2014, 6, 19, 13, 49, 4),
  693,
  485,
  12077,
  5043,
  None,
  '216194230',
  '216194230',
  'FueledbyLOLZ',
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225443471360',
  269,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @JasonKander: We have to cut public transit funding for working people so POTUS, whose wife and son live in NYC, can fly to Florida by hâ\x80¦',
  '1605994344',
  '1605994344',
  'LasEkristen',
  'Elizabeth Kristen',
  datetime.datetime(2013, 7, 19, 14, 12, 36),
  1334,
  4299,
  67,
  18999,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225447890945',
  273,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @johnchartman: Not bad Garcia',
  '2440571473',
  '2440571473',
  'Korab_VII',
  'Korrigan',
  datetime.datetime(2014, 4, 12, 20, 23, 9),
  587,
  639,
  3762,
  726,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
  None,
  None,
  None,
  'en',
  0),
 ('852188225485643780',
  275,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  '.@PrimeraTech Introduces LX1000 Color Label Printer https://t.co/Wv4nMjMaZP https://t.co/eLZYgYOSBA',
  '32906967',
  '32906967',
  'BevNET',
  'BevNET.com',
  datetime.datetime(2009, 4, 18, 15, 4, 24),
  14266,
  1597,
  983,
  21073,
  'Quito',
  None,
  None,
  None,
  '<a href="http://www.bevnet.com" rel="nofollow">BevNET</a>',
  None,
  None,
  None,
  'es',
  0),
 ('852188225527570433',
  273,
  datetime.datetime(2017, 4, 12, 15, 54, 40),
  'RT @nikkiorion: G://www.smarturl.it/ye',
  '712899113113763840',
  '712899113113763840',
  'lydiamay_2',
  'lydia',
  datetime.datetime(2016, 3, 24, 7, 9, 28),
  68,
  321,
  6785,
  176,
  None,
  None,
  None,
  None,
  '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  None,
  None,
  None,
  'en',
  0)]

This is kind of a messy way to view the data. Right now we have an object called rows, which stores each row in a tuple and all of the tuples are stored in a single list. How about we change this into a more readable format...



In [6]:

    
statement = """SELECT * 
FROM twitter.tweet
LIMIT 10;"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)
    
    column_names = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

We can easily turn this into a dictionary object, which can subsequently be turned into a pandas DataFrame object. For this dictionary, the field name (column name) will be the key and the values will be a list of corresponding values in the rows object.

For example, rows[<x>][0] are all values of tweet_id_str. Conveniently, column_names[0] is tweet_id_str. Then all we have to do is create an empty dictionary and begin to fill it with the the contents of column_names and rows.



In [7]:

    
tweet_dict = {}
for i in list(range(len(column_names))):
     tweet_dict['{}'.format(column_names[i])] = [x[i] for x in rows]

What's happening here? Well, we can assume that the number or values per row should equal the number of column names. Therefore, we iterate through a list ranging from to the length of the column_names object and create a key from each column name item (k['{}'.format(column_names[i])]). The values are the corresponding value index in the rows object. We use the single line list constructor to build a list of values for each key. For a multiline for loop that does the same thing, it would look like:

list_o_lists = [] # create an empty list to store lists of values

for i in list(range(len(column_names))):
    vals = [] # create empty list to store the values
    for x in rows:
        vals.append(x[i])
    list_o_lists.append(vals)

...and then turning this dict into a data frame is simple. Just run pandas' DataFrame method over the dictionary object you just created.



In [8]:

    
pd.DataFrame(tweet_dict)









    Out[8]:






  
    
      
      analysis_state
      created_at
      from_user
      from_user_created_at
      from_user_favorites
      from_user_followers
      from_user_following
      from_user_fullname
      from_user_id_str
      from_user_name
      ...
      job_id
      location_geo
      location_geo_0
      location_geo_1
      source
      text
      to_user
      to_user_id_str
      to_user_name
      tweet_id_str
    
  
  
    
      0
      0
      2017-04-12 17:57:18
      293690109
      2011-05-05 20:04:11
      180
      570
      711
      âï¸
      293690109
      _Butter21
      ...
      210
      None
      None
      None
      <a href="http://twitter.com/download/iphone" r...
      RT @BleacherReport: MLB commish Rob Manfred wa...
      None
      None
      None
      852219087375781889
    
    
      1
      0
      2017-04-12 17:57:18
      4859763987
      2016-01-29 10:24:41
      129
      54
      331
      Reinardo jose silva
      4859763987
      JoseVerde28
      ...
      223
      None
      None
      None
      <a href="http://blackberry.com/twitter" rel="n...
      RT @WarOfParlay: Ayer Lamentablemente fue un d...
      None
      None
      None
      852219087405150209
    
    
      2
      0
      2017-04-12 17:57:18
      3537421936
      2015-09-03 20:10:35
      1724
      330
      268
      Karen
      3537421936
      karen_marin02
      ...
      236
      None
      None
      None
      <a href="http://twitter.com/download/android" ...
      Si no te importa lo que piense la gente, ya di...
      None
      None
      None
      852219087442915330
    
    
      3
      0
      2017-04-12 17:57:18
      3056458174
      2015-02-23 16:59:19
      50357
      6117
      5763
      PROMO GANG C.E.O.
      3056458174
      ceep757
      ...
      284
      None
      None
      None
      <a href="http://twitter.com" rel="nofollow">Tw...
      RT @TUOWLS_WBB: Our Big 5 award winners... hig...
      None
      None
      None
      852219087530991624
    
    
      4
      0
      2017-04-12 17:57:18
      849320016315023362
      2017-04-04 17:57:26
      1067
      179
      454
      Bob Abooey
      849320016315023362
      BobAbooey8
      ...
      275
      None
      None
      None
      <a href="http://twitter.com" rel="nofollow">Tw...
      Watch Shia LaBeouf Text People From A Remote C...
      None
      None
      None
      852219087631650816
    
    
      5
      0
      2017-04-12 17:57:18
      43871279
      2009-06-01 11:50:18
      71484
      2588
      2298
      absentminded
      43871279
      mlccm
      ...
      290
      None
      None
      None
      <a href="http://twitter.com/download/iphone" r...
      RT @nattylumpo88: Ok. Alright. We get it. You ...
      None
      None
      None
      852219087845511173
    
    
      6
      0
      2017-04-12 17:57:18
      30308791
      2009-04-10 20:46:41
      13742
      473
      596
      Shaker Pepper
      30308791
      shakepepper
      ...
      269
      None
      None
      None
      <a href="http://twitter.com/#!/download/ipad" ...
      RT @cdotharrison: I saw Charlie Murphy at the ...
      None
      None
      None
      852219087954493440
    
    
      7
      0
      2017-04-12 17:57:18
      46049431
      2009-06-10 05:56:50
      283
      6197
      1693
      Plushbeds.com
      46049431
      plushbeds
      ...
      284
      None
      None
      None
      <a href="http://www.audiense.com" rel="nofollo...
      @HBQ_1 We've got a question for you: about how...
      17570566
      17570566
      HBQ_1
      852219087967145984
    
    
      8
      0
      2017-04-12 17:57:18
      3431548829
      2015-08-19 14:21:23
      592
      25
      106
      RRussell
      3431548829
      RRussell74351
      ...
      255
      None
      None
      None
      <a href="http://www.twitter.com" rel="nofollow...
      RT @saladinahmed: it's not 'sharia' if you're ...
      None
      None
      None
      852219088059412482
    
    
      9
      0
      2017-04-12 17:57:18
      356713597
      2011-08-17 07:44:59
      6372
      362
      714
      Leah
      356713597
      brLeaHkaway
      ...
      275
      None
      None
      None
      <a href="http://twitter.com/download/iphone" r...
      RT @celtics: Get ready for the playoffs with t...
      None
      None
      None
      852219088113999882
    
  

10 rows × 23 columns

Back to linguistic diversity...

So we know that one of the component of linguistic diversity includes the number of unique languages. We can find the unique langauges through a simple query of the database. Below we find the unique languages from 10,000 rows.

**A Note about Limits**: We limit the data returned to 10,000 rows because there are approximately 300 million tweets in the total datasset. If everyone queried the whole table without limits at the same time, we would need a much larger server! If you are curious about the results for all the tweets, drop us a note and we will run your analysis when the server load is low (i.e., while you sleep!).



In [9]:

    
statement = """SELECT DISTINCT iso_language 
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    cursor.execute(statement)

    langs = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

Now the languages are in a variable called "lang". To view that variable, we simply type it in an output window like below



In [10]:

    
langs









    Out[10]:





[('pt',),
 ('cy',),
 ('ru',),
 ('de',),
 ('fr',),
 ('es',),
 ('nl',),
 ('ro',),
 ('el',),
 ('is',),
 ('eu',),
 ('tr',),
 ('pl',),
 ('ja',),
 ('th',),
 ('da',),
 ('ar',),
 ('lv',),
 ('fi',),
 ('und',),
 ('lt',),
 ('en',),
 ('in',),
 ('tl',),
 ('ko',),
 ('et',),
 ('no',),
 ('sv',),
 ('it',),
 ('iw',),
 ('ht',)]

YOUR TURN

How many unique languages are there from 10,000 rows. Feel free to use SQL or Python.



In [11]:

    
# put your code here
# ------------------

statement = """SELECT COUNT(*) FROM (SELECT DISTINCT iso_language 
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs) AS lang_count"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    cursor.execute(statement)
    num_langs = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
print(num_langs)









    



[(40,)]

The other component of diversity is the number of speakers per language (at least for our measure). Again, this can be done in SQL.



In [12]:

    
# use COUNT with a GROUP BY to count the number of speakers per language
statement = """SELECT DISTINCT iso_language, COUNT(*) 
FROM (SELECT iso_language FROM twitter.tweet LIMIT 10000) AS langs GROUP BY iso_language;"""

try:
    connect_str = "dbname='twitter' user='dsa_ro_user' host='dbase.dsa.missouri.edu'password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    
    cursor.execute(statement)
    column_names = [desc[0] for desc in cursor.description]
    num_langs = cursor.fetchall()
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)
    
print(num_langs)









    



[('es', 388), ('th', 12), ('cs', 1), ('tl', 55), ('lt', 4), ('ar', 18), ('sv', 5), ('in', 44), ('nl', 10), ('is', 2), ('fi', 4), ('no', 3), ('ko', 34), ('cy', 5), ('pl', 3), ('da', 4), ('ro', 3), ('eu', 1), ('ht', 16), ('et', 13), ('und', 621), ('pt', 57), ('ru', 4), ('hi', 1), ('fr', 17), ('ja', 79), ('en', 8557), ('de', 17), ('ur', 2), ('tr', 8), ('it', 11), ('zh', 1)]

YOUR TURN

Put the above counts in a data frame object where one column is the language and the other is the number of speakers (which are really represented as tweets by language) for that language.



In [13]:

    
# put your code here
# ------------------

lang_dict = {}
for i in list(range(len(column_names))):
     lang_dict['{}'.format(column_names[i])] = [x[i] for x in num_langs]
        
pd.DataFrame(lang_dict)









    Out[13]:






  
    
      
      count
      iso_language
    
  
  
    
      0
      388
      es
    
    
      1
      12
      th
    
    
      2
      1
      cs
    
    
      3
      55
      tl
    
    
      4
      4
      lt
    
    
      5
      18
      ar
    
    
      6
      5
      sv
    
    
      7
      44
      in
    
    
      8
      10
      nl
    
    
      9
      2
      is
    
    
      10
      4
      fi
    
    
      11
      3
      no
    
    
      12
      34
      ko
    
    
      13
      5
      cy
    
    
      14
      3
      pl
    
    
      15
      4
      da
    
    
      16
      3
      ro
    
    
      17
      1
      eu
    
    
      18
      16
      ht
    
    
      19
      13
      et
    
    
      20
      621
      und
    
    
      21
      57
      pt
    
    
      22
      4
      ru
    
    
      23
      1
      hi
    
    
      24
      17
      fr
    
    
      25
      79
      ja
    
    
      26
      8557
      en
    
    
      27
      17
      de
    
    
      28
      2
      ur
    
    
      29
      8
      tr
    
    
      30
      11
      it
    
    
      31
      1
      zh

These were the basics. There is still a long way to go before we are ready for analysis, but this is a good place to start as we will use these statements as building blocks for the rest of the Twitter notebooks. In the next notebook, we will cover further data manipulation and preparation in order to get it in a state that is ready for analysis. This will include the removal of some rows and more aggregations.

	created_at	from_user	from_user_created_at	from_user_favorites	from_user_followers	from_user_following	from_user_fullname	from_user_id_str	from_user_name	...	job_id	location_geo	location_geo_0	location_geo_1	source	text	to_user	to_user_id_str	to_user_name	tweet_id_str
0	2017-04-12 17:57:18	293690109	2011-05-05 20:04:11	180	570	711	âï¸	293690109	_Butter21	...	210	None	None	None	<a href="http://twitter.com/download/iphone" r...	RT @BleacherReport: MLB commish Rob Manfred wa...	None	None	None	852219087375781889
1	2017-04-12 17:57:18	4859763987	2016-01-29 10:24:41	129	54	331	Reinardo jose silva	4859763987	JoseVerde28	...	223	None	None	None	<a href="http://blackberry.com/twitter" rel="n...	RT @WarOfParlay: Ayer Lamentablemente fue un d...	None	None	None	852219087405150209
2	2017-04-12 17:57:18	3537421936	2015-09-03 20:10:35	1724	330	268	Karen	3537421936	karen_marin02	...	236	None	None	None	<a href="http://twitter.com/download/android" ...	Si no te importa lo que piense la gente, ya di...	None	None	None	852219087442915330
3	2017-04-12 17:57:18	3056458174	2015-02-23 16:59:19	50357	6117	5763	PROMO GANG C.E.O.	3056458174	ceep757	...	284	None	None	None	<a href="http://twitter.com" rel="nofollow">Tw...	RT @TUOWLS_WBB: Our Big 5 award winners... hig...	None	None	None	852219087530991624
4	2017-04-12 17:57:18	849320016315023362	2017-04-04 17:57:26	1067	179	454	Bob Abooey	849320016315023362	BobAbooey8	...	275	None	None	None	<a href="http://twitter.com" rel="nofollow">Tw...	Watch Shia LaBeouf Text People From A Remote C...	None	None	None	852219087631650816
5	2017-04-12 17:57:18	43871279	2009-06-01 11:50:18	71484	2588	2298	absentminded	43871279	mlccm	...	290	None	None	None	<a href="http://twitter.com/download/iphone" r...	RT @nattylumpo88: Ok. Alright. We get it. You ...	None	None	None	852219087845511173
6	2017-04-12 17:57:18	30308791	2009-04-10 20:46:41	13742	473	596	Shaker Pepper	30308791	shakepepper	...	269	None	None	None	<a href="http://twitter.com/#!/download/ipad" ...	RT @cdotharrison: I saw Charlie Murphy at the ...	None	None	None	852219087954493440
7	2017-04-12 17:57:18	46049431	2009-06-10 05:56:50	283	6197	1693	Plushbeds.com	46049431	plushbeds	...	284	None	None	None	<a href="http://www.audiense.com" rel="nofollo...	@HBQ_1 We've got a question for you: about how...	17570566	17570566	HBQ_1	852219087967145984
8	2017-04-12 17:57:18	3431548829	2015-08-19 14:21:23	592	25	106	RRussell	3431548829	RRussell74351	...	255	None	None	None	<a href="http://www.twitter.com" rel="nofollow...	RT @saladinahmed: it's not 'sharia' if you're ...	None	None	None	852219088059412482
9	2017-04-12 17:57:18	356713597	2011-08-17 07:44:59	6372	362	714	Leah	356713597	brLeaHkaway	...	275	None	None	None	<a href="http://twitter.com/download/iphone" r...	RT @celtics: Get ready for the playoffs with t...	None	None	None	852219088113999882

	count	iso_language
0	388	es
1	12	th
2	1	cs
3	55	tl
4	4	lt
5	18	ar
6	5	sv
7	44	in
8	10	nl
9	2	is
10	4	fi
11	3	no
12	34	ko
13	5	cy
14	3	pl
15	4	da
16	3	ro
17	1	eu
18	16	ht
19	13	et
20	621	und
21	57	pt
22	4	ru
23	1	hi
24	17	fr
25	79	ja
26	8557	en
27	17	de
28	2	ur
29	8	tr
30	11	it
31	1	zh