In [ ]:
%matplotlib inline
import time
import calendar
import codecs
import datetime
import sys
import gzip
import string
import glob
import os
# For parsing JSON
import json
Much of the data with which we will work comes in the JavaScript Object Notation (JSON) format. JSON is a lightweight text format that allows one to describe objects by keys and values without needing to specify a schema beforehand (as compared to XML).
Many "RESTful" APIs available on the web today return data in JSON format, and the data we have stored from Twitter follows this rule as well.
Python's JSON support is relatively robust and is included in the language under the json package. This package allows us to read and write JSON to/from a string or file and convert many of Python's types into a text format.
In [ ]:
jsonString = '{"key": "value"}'
# Parse the JSON string
dictFromJson = json.loads(jsonString)
# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dictFromJson)
# Will print the value
print ("Data stored in \"key\":\n", dictFromJson["key"])
# This will cause an error!
print ("Data stored in \"value\":\n", dictFromJson["value"])
A JSON string/file can have many keys and values, but a key should always have a value. We can have values without keys if we're doing arrays, but this can be awkward.
An example of JSON string with multiple keys is below:
{
"name": "Cody",
"occupation": "Student",
"goal": "PhD"
}
Note the comma after the first two values. These commas are needed for valid JSON and to separate keys from other values.
In [ ]:
jsonString = '{ "name": "Cody", "occupation": "PostDoc", "goal": "Tenure" }'
# Parse the JSON string
dictFromJson = json.loads(jsonString)
# Python now has a dictionary representing this data
print ("Resulting dictionary object:\n", dictFromJson)
The above JSON string describes an object whose name is "Cody". How would we describe a list of similar students? Arrays are useful here and are denoted with "[]" rather than the "{}" object notation. For example:
{
"students": [
{
"name": "Cody",
"occupation": "Student",
"goal": "PhD"
},
{
"name": "Scott",
"occupation": "Student",
"goal": "Masters"
}
]
}
Again, note the comma between the "}" and "{" separating the two student objects and how they are both surrounded by "[]".
In [ ]:
jsonString = '{"students": [{"name": "Cody", "occupation": "PostDoc", "goal": "Tenure"}, {"name": "Scott", "occupation": "Student", "goal": "Masters"}]}'
# Parse the JSON string
dictFromJson = json.loads(jsonString)
# Python now has a dictionary representing this data
print ("Resulting array:\n", dictFromJson)
print ("Each student:")
for student in dictFromJson["students"]:
print (student)
A couple of things to note:
As an example:
[
{
"name": "Cody",
"occupation": "Student",
"goal": "PhD"
},
{
"name": "Scott",
"occupation": "Student",
"goal": "Masters",
"completed": true
}
]
In [ ]:
jsonString = '[{"name": "Cody","occupation": "PostDoc","goal": "Tenure"},{"name": "Scott","occupation": "Student","goal": "Masters","completed": true}]'
# Parse the JSON string
arrFromJson = json.loads(jsonString)
# Python now has an array representing this data
print ("Resulting array:\n", arrFromJson)
print ("Each student:")
for student in arrFromJson:
print (student)
We've shown you can have an array as a value, and you can do the same with objects. In fact, one of the powers of JSON is its essentially infinite depth/expressability. You can very easily nest objects within objects, and JSON in the wild relies on this heavily.
An example:
{
"disasters" : [
{
"event": "Nepal Earthquake",
"date": "25 April 2015",
"casualties": 8964,
"magnitude": 7.8,
"affectedAreas": [
{
"country": "Nepal",
"capital": "Kathmandu",
"population": 26494504
},
{
"country": "India",
"capital": "New Dehli",
"population": 1276267000
},
{
"country": "China",
"capital": "Beijing",
"population": 1376049000
},
{
"country": "Bangladesh",
"capital": "Dhaka",
"population": 168957745
}
]
}
]
}
In [ ]:
jsonString = '{"disasters" : [{"event": "Nepal Earthquake","date": "25 April 2015","casualties": 8964,"magnitude": 7.8,"affectedAreas": [{"country": "Nepal","capital": "Kathmandu","population": 26494504},{"country": "India","capital": "New Dehli","population": 1276267000},{"country": "China","capital": "Beijing","population": 1376049000},{"country": "Bangladesh","capital": "Dhaka","population": 168957745}]}]}'
disasters = json.loads(jsonString)
for disaster in disasters["disasters"]:
print (disaster["event"])
print (disaster["date"])
for country in disaster["affectedAreas"]:
print (country["country"])
In [ ]:
exObj = {
"event": "Nepal Earthquake",
"date": "25 April 2015",
"casualties": 8964,
"magnitude": 7.8
}
print ("Python Object:", exObj, "\n")
# now we can convert to JSON
print ("Object JSON:")
print (json.dumps(exObj), "\n")
# We can also pretty-print the JSON
print ("Readable JSON:")
print (json.dumps(exObj, indent=4)) # Indent adds space
In [ ]:
tweetFilename = "first_BlackLivesMatter.json"
# Use Python's os.path.join to account for Windows, OSX/Linux differences
tweetFilePath = os.path.join("..", "00_data", "ferguson", tweetFilename)
print ("Opening", tweetFilePath)
# We use codecs to ensure we open the file in Unicode format,
# which supports larger character encodings
tweetFile = codecs.open(tweetFilePath, "r", "utf8")
# Read in the whole file, which contains ONE tweet and close
tweetFileContent = tweetFile.read()
tweetFile.close()
# Print the raw json
print ("Raw Tweet JSON:\n")
print (tweetFileContent)
# Convert the JSON to a Python object
tweet = json.loads(tweetFileContent)
print ("Tweet Object:\n")
print (tweet)
# We could have done this in one step with json.load()
# called on the open file, but our data files have
# a single tweet JSON per line, so this is more consistent
In [ ]:
# What fields can we see?
print ("Keys:")
for k in sorted(tweet.keys()):
print ("\t", k)
print ("Tweet Text:", tweet["text"])
print ("User Name:", tweet["user"]["screen_name"])
print ("Author:", tweet["user"]["name"])
print("Source:", tweet["source"])
print("Retweets:", tweet["retweet_count"])
print("Favorited:", tweet["favorite_count"])
print("Tweet Location:", tweet["place"])
print("Tweet GPS Coordinates:", tweet["coordinates"])
print("Twitter's Guessed Language:", tweet["lang"])
# Tweets have a list of hashtags, mentions, URLs, and other
# attachments in "entities" field
print ("\n", "Entities:")
for eType in tweet["entities"]:
print ("\t", eType)
for e in tweet["entities"][eType]:
print ("\t\t", e)
In [ ]: