Data and Twitter Analysis


This notebook introduces students to common data formats and how Python can read and write them. How one decides to structure data will ultimately shape the storage and possible analyses. After a discussion of the data, an exploration of a Twitter API response will exemplify JSON and tweets from the March 4th, 2017 #March4Trump in Berkeey will serve as a basis for analysis. At the end, students will collect their own data to explore.

Estimated Time: 180 minutes


Topics Covered:

  • .xls and .csv formats
  • .html and .xml
  • .json
  • Twitter API
  • Twitter analysis and visualization

Parts:

Dependencies:


In [ ]:
! pip install tweepy
! pip install textblob
! pip install plotly
! pip install nltk

Data Formats and Storage

CSV

Most people are familiar with Microsoft Excel spreadsheet's .xls format, great for storing tabular data. However, Microsoft encodes the .xls format with a lot of information for displaying it in the software environment as well as remembering any formulas you may have used, among other things. The extra information is often not necessary to simply store the raw data, and is not easily readable by other software. A "bare-bones" .xls format is the .csv, or "comma-separated value". You may have encountered this format before. It's not any more complicated than the name. All values are separated by commas to delimit columns, while the lines represent rows.

The table:

Name Age Department Hometown
Chris 27 German Plymouth
Jarrett 25 Physics Newark
Sofia 22 Chemistry Boston
Esther 24 Economics Oakland

would be represented as:

Name, Age, Department, Hometown
Chris, 27, German, Plymouth
Jarrett, 25, Physics, Newark
Sofia, 22, Chemistry, Boston
Esther, 24, Economics, Oakland

Notably, the header is not distinguishable except for being the first row. There is also no way to add any metadata or notes unless it fits into a column or row. Nevertheless, .csv is standard for simple data, and is easily read by most software. If you are collaborating with researchers or using different pieces of software you'll want to use this format.

Python can easily dump data into a .csv, the most straight-forward approach would be dumping rows from a list of lists, each sublist being a row in your data.


In [ ]:
import csv

my_data = [['Name', 'Age', 'Department', 'Hometown'],
            ['Chris', '27', 'German', 'Plymouth'],
            ['Jarrett', '25', 'Physics', 'Newark',],
            ['Sofia', '22', 'Chemistry', 'Boston'],
            ['Esther', '24', 'Economics', 'Oakland']
        ]

with open("my_data.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(my_data)

Reading a .csv is just as easy:


In [ ]:
with open("my_data.csv", "r") as f:
    csv_data = list(csv.reader(f))
    
print(csv_data)

If you still prefer Excel for analysis, you can go ahead and open this file in Excel!


XML

XML, or 'Extensible Markup Language', much like HTML is structured by tags. Each tag will have a beginning tag and an end tag. The end tag is marked with a / before the tag name. Unlike HTML, XML does not have pre-defined tags that have certain functions, so we have to come up with our own. XML is a great way to structure metadata, and is commonly used for onilne data and annotating corpora. Let's look at an example:


<my-library>
    <book>
        <title>Harry Potter and the Sorcerer's Stone</title>
        <author>J. K. Rowling</author>
        <date>1998</date>
        <publisher>Scholastic Corporation</publisher>
    </book>
    <book>
        <title>The Hobbit</title>
        <author>J. R. R. Tolkien</author>
        <date>1937</date>
        <publisher>George Allen and Unwin</publisher>
    </book>
    <book>
        <title>To Kill A Mockingbird</title>
        <author>Harper Lee</author>
        <date>1960</date>
        <publisher>J. B. Lippincott and Co.</publisher>
    </book>
</my-library>

You could, of course, use a CSV for this data, but when there are several more categories (if I wanted to add films, for example) it can get messy very fast.


In [ ]:
xml_data = '''
<my-library>
    <book>
        <title>Harry Potter and the Sorcerer's Stone</title>
        <author>J. K. Rowling</author>
        <date>1998</date>
        <publisher>Scholastic Corporation</publisher>
    </book>
    <book>
        <title>The Hobbit</title>
        <author>J. R. R. Tolkien</author>
        <date>1937</date>
        <publisher>George Allen and Unwin</publisher>
    </book>
    <book>
        <title>To Kill A Mockingbird</title>
        <author>Harper Lee</author>
        <date>1960</date>
        <publisher>J. B. Lippincott and Co.</publisher>
    </book>
</my-library>
'''

In [ ]:
import xml.etree.ElementTree

e = xml.etree.ElementTree.fromstring(xml_data)
e.findall('book')

In [ ]:
print(e.findall('book')[0][0].text)
print(e.findall('book')[0][1].text)
print(e.findall('book')[0][2].text)
print(e.findall('book')[0][3].text)
print()
print(e.findall('book')[1][0].text)
print(e.findall('book')[1][1].text)
print(e.findall('book')[1][2].text)
print(e.findall('book')[1][3].text)

JSON

JSON (JavaScript Object Notation) is a format for structuring and exchanging data. Its syntax is based on JavaScript, but you can still use it in any language, including Python. Its format is somewhat similar to that of a Python dictionary in that it consists of a collection of key-value pairs. JSON, along with XML, are the most popular formats to get data from the internet. Let's look at the same data from the XML example in JSON format.


In [ ]:
json_data = {'my-library': [{'title': "Harry Potter and the Sorcerer's Stone",
                            'author': 'J. K. Rowling',
                            'date': '1998',
                            'publisher': 'Scholastic Corporation'},
                            
                            {'title': "The Hobbit",
                            'author': 'J. R. R. Tolkien',
                            'date': '1937',
                            'publisher': 'George Allen and Unwin'},
                            
                            {'title': "To Kill A Mockingbird",
                            'author': 'Harper Lee',
                            'date': '1960',
                            'publisher': 'J. B. Lippincott and Co.'},
                            ]
            }

In [ ]:
print(json_data['my-library'][0])
print()
print(json_data['my-library'][1])
print()
print(json_data['my-library'][2])

APIs

What is an API?

  • API stands for Application Programming Interface

  • a set of rules and procedures that facilitate interactions between computers and their applications


Web APIs

  • allows users to query a remote database over the internet

  • take on a variety of formats

  • majority adhere to a particular style known as Reperesentational State Transfer or REST

  • "RESTful" APIs are conveinent because we can use them to query databases using URLs


RESTful Web APIs are All Around You...

Consider a simple Google search.

Go ahead and search something.

Ever wonder what all that extra stuff in the address bar was all about?


RESTful Web APIs are All Around You...

It looks like Google makes its query by taking the search terms, separating each of them with a "+", and appending them to the link:

https://www.google.com/#q=

So that we have

https://www.google.com/#q=search1+search2

So can change our Google search by adding some terms to the URL.


Some Basic Terminology: URL

  • Uniform Resource Location
  • a string of characters that, when interpreted via the Hypertext Transfer Protocol (HTTP)
  • points to a data resource, notably files written in Hypertext Markup Language (HTML) or a subset of a database.

Some Basic Terminology: HTTP Methods / Verbs

  • GET: requests a representation of a data resource corresponding to a particular URL. The process of executing the GET method is often referred to as a "GET request" and is the main method used for querying RESTful databases.

  • HEAD, POST, PUT, DELETE: other common methods, though mostly never used for database querying.


How Do GET Requests Work? A Web Browsing Example

  • Surfing the Web = Making a bunch of GET Requests

  • For instance, I open my web browser and type in http://www.wikipedia.org. Once I hit return, I'd see a webpage.

  • Several different processes occured, however, between me hitting "return" and the page finally being rendered.


Step 1: The GET Request

  • web browser took the entered character string
  • used the command-line tool "Curl" to write a properly formatted HTTP GET request
  • submitted it to the server that hosts the Wikipedia homepage.

STEP 2: The Response

  • Wikipedia's server receives this request
  • send back an HTTP response
  • from which Curl extracted the HTML code for the page
{html}
[1] "<!DOCTYPE html>\n<html lang=\"mul\" dir=\"ltr\">\n<head>\n<!-- Sysops: Please do not edit the main template directly; update /temp and synchronise. -->\n<meta charset=\"utf-8\">\n<title>Wikipedia</title>\n<!--[if lt IE 7]><meta http-equiv=\"imagetoolbar\" content=\"no\"><![endif]-->\n<meta name=\"viewport\" content=\"i"

STEP 3: The Formatting

  • raw HTML code was formatted and executed by the web browser
  • rendering the page as seen in the window.

RESTful Database Querying: The GET Request

  • URL we supply must be constructed so that the resulting request can be interpreted and succesfully acted upon by the server.

  • Likely that the character string must encode search terms and/or filtering parameters, as well as one or more authentication codes.

  • While the terms are often similar across APIs, most are API-specific.


RESTful Database Querying: The Response

  • unlike web browsing, the content of the server's response that is extracted by Curl is unlikely to be HTML code.

  • will likely be raw text response that can be parsed into one of a few file formats commonly used for data storage.

  • usual suspects include .csv, .xml, and .json files.


RESTful Database Querying: The Formatting

  • web browser parsed the HTML code,
  • but we need R, Python, or other programming languages to parse the server response
  • and convert it into a format for local storage (e.g. matrices, dataframes, databases, lists, etc.).

Let's look at an example of restaurant health inspections in Chicago:


In [ ]:
import requests

response = requests.get('https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.json')

In [ ]:
import json

data = json.loads(response.text)

In [ ]:
data.keys()

In [ ]:
data['meta']['view']

It's structured just like our book library data above, but defintely has a lot more information!


Twitter API

This Twitter API is slightly more complicated, but because of this, people have created very useful tools to easily interact with the Twitter API. First, follow the directions in the Install.md file of this repository to get your API credientials.


In [ ]:
import tweepy


# Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

In [ ]:
# search all twitter
results = tweepy.Cursor(
    api.search,
    q='Berkeley', # query, any word you want found in a tweet
    geocode="37.871853,-122.258423,80km", # lat. and long., radius
    ).items(10)

In [ ]:
# get tweets from specific handle
handle_results = api.user_timeline(screen_name='UCBerkeley', count=10)

In [ ]:
results

The results will return an iterator, which means it has not actually downloaded the data yet, so we must iterate through to save the results.


In [ ]:
results_tweets = []

for t in results:
    results_tweets.append(t)
    
handle_results_tweets = []

for t in handle_results:
    handle_results_tweets.append(t)

In [ ]:
print(len(results_tweets))
print()
print(results_tweets)

Wow! Look at all that data! Let's look at one tweet:


In [ ]:
print(results_tweets[0])

What could be useful here?


In [ ]:
print(results_tweets[0].created_at)
print()
print(results_tweets[0].text)
print()
print(results_tweets[0].retweet_count)
print()
print(results_tweets[0].entities['user_mentions'])
print()
print(results_tweets[0].entities['hashtags'])
print()
print(results_tweets[0].entities['urls'])
print()
print(results_tweets[0].user.profile_image_url_https)
print()
print(results_tweets[0].geo)

We can also look at our tweets from the Berkeley handle:


In [ ]:
print(len(handle_results_tweets))
print()
print(handle_results_tweets)

In [ ]:
for t in results_tweets:
    print(t.created_at.strftime("%Y-%b-%d %H:%M"))
    print(t.text)
    print()

In [ ]:
for t in handle_results_tweets:
    print(t.created_at.strftime("%Y-%b-%d %H:%M"))
    print(t.text)
    print()

Raw Word Counts


In [ ]:
from collections import Counter

tweet_words = []

for t in results_tweets:
    tweet_words.extend(t.text.split())

Counter(tweet_words).most_common()

Hashtag Counts


In [ ]:
hashtags = []

for t in results_tweets:
    for h in t.entities["hashtags"]:
        hashtags.append(h['text'])
        
Counter(hashtags).most_common()

User Mentions


In [ ]:
mentions= []

for t in results_tweets:
    for m in t.entities["user_mentions"]:
        mentions.append(m["screen_name"])
        
Counter(mentions).most_common()

Sentiment Analysis


In [ ]:
import nltk

nltk.download('punkt')

In [ ]:
from textblob import TextBlob

for t in results_tweets:

    blob = TextBlob(t.text)

    for sentence in blob.sentences:
        pol = sentence.sentiment.polarity
        print(pol, sentence)
    print()

March4Trump Tweets


In [ ]:
import pickle

data = pickle.load(open("tweets.pkl", "rb"))

print(len(data))

With retweets:


In [ ]:
import re

hashtags = []
user_mentions = []
texts = []

pattern = re.compile(r'http\S+')
pattern2 = re.compile(r'(#|@)[^:\s]+')

for t in data:
    for h in t.entities["hashtags"]:
        hashtags.append(h["text"])

    for m in t.entities["user_mentions"]:
        user_mentions.append(m["screen_name"])


    original_tweet = t.text
    text = re.sub(pattern, "", t.text)
    text = re.sub(pattern2, "", text)
    text = text.strip()
    if text.startswith("RT"):
        text = ' '.join(text.split(":")[1:]).strip()

    texts.append(text)
        
print(Counter(hashtags).most_common())
print()
print("="*20)
print()
print(Counter(user_mentions).most_common())

Without retweets:


In [ ]:
hashtags = []
user_mentions = []
texts = []

pattern = re.compile(r'http\S+')
pattern2 = re.compile(r'(#|@)[^:\s]+')

for t in data:
    if not t.text.startswith("RT"):
        for h in t.entities["hashtags"]:
            hashtags.append(h["text"])

        for m in t.entities["user_mentions"]:
            user_mentions.append(m["screen_name"])


        original_tweet = t.text
        text = re.sub(pattern, "", t.text)
        text = re.sub(pattern2, "", text)
        text = text.strip()
        if text.startswith("RT"):
            text = ' '.join(text.split(":")[1:]).strip()

        texts.append(text)
        
print(Counter(hashtags).most_common())
print()
print("="*10)
print()
print(Counter(user_mentions).most_common())

Mapping

First we need to get the coordinates:


In [ ]:
lats = []
longs = []
tweet_texts = []

for t in data:
    if t.geo:
        print(t.geo)
        lats.append(t.geo['coordinates'][0])
        longs.append(t.geo['coordinates'][1])
        tweet_texts.append(t.text.replace('\n', '<br>'))

Now we can use the Mapbox API:


In [ ]:
import plotly.plotly as py
from plotly.graph_objs import *

mapbox_access_token = ''

map_data = Data([
    Scattermapbox(
        lat=lats,
        lon=longs,
        mode='markers',
        marker=Marker(
            size=14
        ),
        text=tweet_texts,
    )
])

layout = Layout(
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=37.85,
            lon=-122.28
        ),
        pitch=0,
        zoom=7
    ),
)

fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='Berkeley Tweet Map', validate=False)

Save to CSV:

First create the rows:


In [ ]:
tweet_rows = []

for t in data:
    tweet_data = {}
    tweet_data['created'] = t.created_at
    tweet_data['user'] = t.user.screen_name
    tweet_data['text'] = t.text
    tweet_data['retweet_count'] = t.retweet_count
    tweet_data['hashtags'] = '; '.join([d['text'] for d in t.entities['hashtags']])
    tweet_data['user_mentions'] = '; '.join([d['screen_name'] for d in t.entities['user_mentions']])
    tweet_data['urls'] = '; '.join([d['url'] for d in t.entities['urls']])
    
    if t.geo:
        tweet_data['coordinates'] = t.geo['coordinates']
    else:
        tweet_data['coordinates'] = 'NA'
    
    tweet_rows.append(tweet_data)

Now save to CSV:


In [ ]:
import csv
keys = tweet_rows[0].keys()

with open('tweet-data.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(tweet_rows)

Challenge

Now it's your turn! Get tweets by either searching text or from a specific user and explore the results.


In [ ]:
# search all twitter
my_tweets_result = tweepy.Cursor(
    api.search,
    q='Berkeley', # query, any word you want found in a tweet
    geocode="37.871853,-122.258423,80km", # lat. and long., radius
    ).items(10)

OR


In [ ]:
# get tweets from specific handle
my_tweets_result = api.user_timeline(screen_name='UCBerkeley', count=10)

In [ ]:
my_tweets = []

for t in my_tweets_result:
    my_tweets.append(t)