This notebook introduces students to common data formats and how Python can read and write them. How one decides to structure data will ultimately shape the storage and possible analyses. After a discussion of the data, an exploration of a Twitter API response will exemplify JSON and tweets from the March 4th, 2017 #March4Trump in Berkeey will serve as a basis for analysis. At the end, students will collect their own data to explore.
Estimated Time: 180 minutes
Topics Covered:
Parts:
Dependencies:
In [ ]:
! pip install tweepy
! pip install textblob
! pip install plotly
! pip install nltk
Most people are familiar with Microsoft Excel spreadsheet's .xls format, great for storing tabular data. However, Microsoft encodes the .xls format with a lot of information for displaying it in the software environment as well as remembering any formulas you may have used, among other things. The extra information is often not necessary to simply store the raw data, and is not easily readable by other software. A "bare-bones" .xls format is the .csv, or "comma-separated value". You may have encountered this format before. It's not any more complicated than the name. All values are separated by commas to delimit columns, while the lines represent rows.
The table:
| Name | Age | Department | Hometown |
|---|---|---|---|
| Chris | 27 | German | Plymouth |
| Jarrett | 25 | Physics | Newark |
| Sofia | 22 | Chemistry | Boston |
| Esther | 24 | Economics | Oakland |
would be represented as:
Name, Age, Department, Hometown
Chris, 27, German, Plymouth
Jarrett, 25, Physics, Newark
Sofia, 22, Chemistry, Boston
Esther, 24, Economics, Oakland
Notably, the header is not distinguishable except for being the first row. There is also no way to add any metadata or notes unless it fits into a column or row. Nevertheless, .csv is standard for simple data, and is easily read by most software. If you are collaborating with researchers or using different pieces of software you'll want to use this format.
Python can easily dump data into a .csv, the most straight-forward approach would be dumping rows from a list of lists, each sublist being a row in your data.
In [ ]:
import csv
my_data = [['Name', 'Age', 'Department', 'Hometown'],
['Chris', '27', 'German', 'Plymouth'],
['Jarrett', '25', 'Physics', 'Newark',],
['Sofia', '22', 'Chemistry', 'Boston'],
['Esther', '24', 'Economics', 'Oakland']
]
with open("my_data.csv", "w") as f:
writer = csv.writer(f)
writer.writerows(my_data)
Reading a .csv is just as easy:
In [ ]:
with open("my_data.csv", "r") as f:
csv_data = list(csv.reader(f))
print(csv_data)
If you still prefer Excel for analysis, you can go ahead and open this file in Excel!
XML, or 'Extensible Markup Language', much like HTML is structured by tags. Each tag will have a beginning tag and an end tag. The end tag is marked with a / before the tag name. Unlike HTML, XML does not have pre-defined tags that have certain functions, so we have to come up with our own. XML is a great way to structure metadata, and is commonly used for onilne data and annotating corpora. Let's look at an example:
<my-library>
<book>
<title>Harry Potter and the Sorcerer's Stone</title>
<author>J. K. Rowling</author>
<date>1998</date>
<publisher>Scholastic Corporation</publisher>
</book>
<book>
<title>The Hobbit</title>
<author>J. R. R. Tolkien</author>
<date>1937</date>
<publisher>George Allen and Unwin</publisher>
</book>
<book>
<title>To Kill A Mockingbird</title>
<author>Harper Lee</author>
<date>1960</date>
<publisher>J. B. Lippincott and Co.</publisher>
</book>
</my-library>
You could, of course, use a CSV for this data, but when there are several more categories (if I wanted to add films, for example) it can get messy very fast.
In [ ]:
xml_data = '''
<my-library>
<book>
<title>Harry Potter and the Sorcerer's Stone</title>
<author>J. K. Rowling</author>
<date>1998</date>
<publisher>Scholastic Corporation</publisher>
</book>
<book>
<title>The Hobbit</title>
<author>J. R. R. Tolkien</author>
<date>1937</date>
<publisher>George Allen and Unwin</publisher>
</book>
<book>
<title>To Kill A Mockingbird</title>
<author>Harper Lee</author>
<date>1960</date>
<publisher>J. B. Lippincott and Co.</publisher>
</book>
</my-library>
'''
In [ ]:
import xml.etree.ElementTree
e = xml.etree.ElementTree.fromstring(xml_data)
e.findall('book')
In [ ]:
print(e.findall('book')[0][0].text)
print(e.findall('book')[0][1].text)
print(e.findall('book')[0][2].text)
print(e.findall('book')[0][3].text)
print()
print(e.findall('book')[1][0].text)
print(e.findall('book')[1][1].text)
print(e.findall('book')[1][2].text)
print(e.findall('book')[1][3].text)
JSON (JavaScript Object Notation) is a format for structuring and exchanging data. Its syntax is based on JavaScript, but you can still use it in any language, including Python. Its format is somewhat similar to that of a Python dictionary in that it consists of a collection of key-value pairs. JSON, along with XML, are the most popular formats to get data from the internet. Let's look at the same data from the XML example in JSON format.
In [ ]:
json_data = {'my-library': [{'title': "Harry Potter and the Sorcerer's Stone",
'author': 'J. K. Rowling',
'date': '1998',
'publisher': 'Scholastic Corporation'},
{'title': "The Hobbit",
'author': 'J. R. R. Tolkien',
'date': '1937',
'publisher': 'George Allen and Unwin'},
{'title': "To Kill A Mockingbird",
'author': 'Harper Lee',
'date': '1960',
'publisher': 'J. B. Lippincott and Co.'},
]
}
In [ ]:
print(json_data['my-library'][0])
print()
print(json_data['my-library'][1])
print()
print(json_data['my-library'][2])
API stands for Application Programming Interface
a set of rules and procedures that facilitate interactions between computers and their applications
allows users to query a remote database over the internet
take on a variety of formats
majority adhere to a particular style known as Reperesentational State Transfer or REST
"RESTful" APIs are conveinent because we can use them to query databases using URLs
Consider a simple Google search.
Go ahead and search something.
Ever wonder what all that extra stuff in the address bar was all about?
It looks like Google makes its query by taking the search terms, separating each of them with a "+", and appending them to the link:
https://www.google.com/#q=
So that we have
https://www.google.com/#q=search1+search2
So can change our Google search by adding some terms to the URL.
GET: requests a representation of a data resource corresponding to a particular URL. The process of executing the GET method is often referred to as a "GET request" and is the main method used for querying RESTful databases.
HEAD, POST, PUT, DELETE: other common methods, though mostly never used for database querying.
Surfing the Web = Making a bunch of GET Requests
For instance, I open my web browser and type in http://www.wikipedia.org. Once I hit return, I'd see a webpage.
Several different processes occured, however, between me hitting "return" and the page finally being rendered.
{html}
[1] "<!DOCTYPE html>\n<html lang=\"mul\" dir=\"ltr\">\n<head>\n<!-- Sysops: Please do not edit the main template directly; update /temp and synchronise. -->\n<meta charset=\"utf-8\">\n<title>Wikipedia</title>\n<!--[if lt IE 7]><meta http-equiv=\"imagetoolbar\" content=\"no\"><![endif]-->\n<meta name=\"viewport\" content=\"i"
URL we supply must be constructed so that the resulting request can be interpreted and succesfully acted upon by the server.
Likely that the character string must encode search terms and/or filtering parameters, as well as one or more authentication codes.
While the terms are often similar across APIs, most are API-specific.
unlike web browsing, the content of the server's response that is extracted by Curl is unlikely to be HTML code.
will likely be raw text response that can be parsed into one of a few file formats commonly used for data storage.
usual suspects include .csv, .xml, and .json files.
Let's look at an example of restaurant health inspections in Chicago:
In [ ]:
import requests
response = requests.get('https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.json')
In [ ]:
import json
data = json.loads(response.text)
In [ ]:
data.keys()
In [ ]:
data['meta']['view']
It's structured just like our book library data above, but defintely has a lot more information!
This Twitter API is slightly more complicated, but because of this, people have created very useful tools to easily interact with the Twitter API. First, follow the directions in the Install.md file of this repository to get your API credientials.
In [ ]:
import tweepy
# Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
In [ ]:
# search all twitter
results = tweepy.Cursor(
api.search,
q='Berkeley', # query, any word you want found in a tweet
geocode="37.871853,-122.258423,80km", # lat. and long., radius
).items(10)
In [ ]:
# get tweets from specific handle
handle_results = api.user_timeline(screen_name='UCBerkeley', count=10)
In [ ]:
results
The results will return an iterator, which means it has not actually downloaded the data yet, so we must iterate through to save the results.
In [ ]:
results_tweets = []
for t in results:
results_tweets.append(t)
handle_results_tweets = []
for t in handle_results:
handle_results_tweets.append(t)
In [ ]:
print(len(results_tweets))
print()
print(results_tweets)
Wow! Look at all that data! Let's look at one tweet:
In [ ]:
print(results_tweets[0])
What could be useful here?
In [ ]:
print(results_tweets[0].created_at)
print()
print(results_tweets[0].text)
print()
print(results_tweets[0].retweet_count)
print()
print(results_tweets[0].entities['user_mentions'])
print()
print(results_tweets[0].entities['hashtags'])
print()
print(results_tweets[0].entities['urls'])
print()
print(results_tweets[0].user.profile_image_url_https)
print()
print(results_tweets[0].geo)
We can also look at our tweets from the Berkeley handle:
In [ ]:
print(len(handle_results_tweets))
print()
print(handle_results_tweets)
In [ ]:
for t in results_tweets:
print(t.created_at.strftime("%Y-%b-%d %H:%M"))
print(t.text)
print()
In [ ]:
for t in handle_results_tweets:
print(t.created_at.strftime("%Y-%b-%d %H:%M"))
print(t.text)
print()
In [ ]:
from collections import Counter
tweet_words = []
for t in results_tweets:
tweet_words.extend(t.text.split())
Counter(tweet_words).most_common()
In [ ]:
hashtags = []
for t in results_tweets:
for h in t.entities["hashtags"]:
hashtags.append(h['text'])
Counter(hashtags).most_common()
In [ ]:
mentions= []
for t in results_tweets:
for m in t.entities["user_mentions"]:
mentions.append(m["screen_name"])
Counter(mentions).most_common()
In [ ]:
import nltk
nltk.download('punkt')
In [ ]:
from textblob import TextBlob
for t in results_tweets:
blob = TextBlob(t.text)
for sentence in blob.sentences:
pol = sentence.sentiment.polarity
print(pol, sentence)
print()
In [ ]:
import pickle
data = pickle.load(open("tweets.pkl", "rb"))
print(len(data))
With retweets:
In [ ]:
import re
hashtags = []
user_mentions = []
texts = []
pattern = re.compile(r'http\S+')
pattern2 = re.compile(r'(#|@)[^:\s]+')
for t in data:
for h in t.entities["hashtags"]:
hashtags.append(h["text"])
for m in t.entities["user_mentions"]:
user_mentions.append(m["screen_name"])
original_tweet = t.text
text = re.sub(pattern, "", t.text)
text = re.sub(pattern2, "", text)
text = text.strip()
if text.startswith("RT"):
text = ' '.join(text.split(":")[1:]).strip()
texts.append(text)
print(Counter(hashtags).most_common())
print()
print("="*20)
print()
print(Counter(user_mentions).most_common())
Without retweets:
In [ ]:
hashtags = []
user_mentions = []
texts = []
pattern = re.compile(r'http\S+')
pattern2 = re.compile(r'(#|@)[^:\s]+')
for t in data:
if not t.text.startswith("RT"):
for h in t.entities["hashtags"]:
hashtags.append(h["text"])
for m in t.entities["user_mentions"]:
user_mentions.append(m["screen_name"])
original_tweet = t.text
text = re.sub(pattern, "", t.text)
text = re.sub(pattern2, "", text)
text = text.strip()
if text.startswith("RT"):
text = ' '.join(text.split(":")[1:]).strip()
texts.append(text)
print(Counter(hashtags).most_common())
print()
print("="*10)
print()
print(Counter(user_mentions).most_common())
First we need to get the coordinates:
In [ ]:
lats = []
longs = []
tweet_texts = []
for t in data:
if t.geo:
print(t.geo)
lats.append(t.geo['coordinates'][0])
longs.append(t.geo['coordinates'][1])
tweet_texts.append(t.text.replace('\n', '<br>'))
Now we can use the Mapbox API:
In [ ]:
import plotly.plotly as py
from plotly.graph_objs import *
mapbox_access_token = ''
map_data = Data([
Scattermapbox(
lat=lats,
lon=longs,
mode='markers',
marker=Marker(
size=14
),
text=tweet_texts,
)
])
layout = Layout(
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=37.85,
lon=-122.28
),
pitch=0,
zoom=7
),
)
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='Berkeley Tweet Map', validate=False)
First create the rows:
In [ ]:
tweet_rows = []
for t in data:
tweet_data = {}
tweet_data['created'] = t.created_at
tweet_data['user'] = t.user.screen_name
tweet_data['text'] = t.text
tweet_data['retweet_count'] = t.retweet_count
tweet_data['hashtags'] = '; '.join([d['text'] for d in t.entities['hashtags']])
tweet_data['user_mentions'] = '; '.join([d['screen_name'] for d in t.entities['user_mentions']])
tweet_data['urls'] = '; '.join([d['url'] for d in t.entities['urls']])
if t.geo:
tweet_data['coordinates'] = t.geo['coordinates']
else:
tweet_data['coordinates'] = 'NA'
tweet_rows.append(tweet_data)
Now save to CSV:
In [ ]:
import csv
keys = tweet_rows[0].keys()
with open('tweet-data.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(tweet_rows)
Now it's your turn! Get tweets by either searching text or from a specific user and explore the results.
In [ ]:
# search all twitter
my_tweets_result = tweepy.Cursor(
api.search,
q='Berkeley', # query, any word you want found in a tweet
geocode="37.871853,-122.258423,80km", # lat. and long., radius
).items(10)
OR
In [ ]:
# get tweets from specific handle
my_tweets_result = api.user_timeline(screen_name='UCBerkeley', count=10)
In [ ]:
my_tweets = []
for t in my_tweets_result:
my_tweets.append(t)