On March 9th, Apple revealed its new series of products. That included the price for the long time awaited Apple watch, as well as the new MacBook.
In this assignment you will explore 10k tweets from Twitter's firehose that were collected shortly after the announcement and captures interesting data within moments of announcements for analysis.
The goal will be to better understand the twitter reaction to Apple's announcements.
Your mission will be to complete the Python code in the cells below and execute it until the output looks similar or identical to the output shown. I would recommend to use a temporary notebook to work with the dataset, and when the code is ready and producing the expected output, copypaste it to this notebook. Once is done and validated, just copy the file elsewhere, as the notebook will be the only file to be sent for evaluation. Of course, everything starts by downloading this notebook.
March $20^{th}$.
Twitter is an ideal source of data that can help you understand the reaction to newsworthy events, because it has more than 200M active monthly users who tend to use it to frequently share short informal thoughts about anything and everything. Although Twitter offers a Search API that can be used to query for "historical data", tapping into the firehose with the Streaming API is a preferred option because it provides you the ability to acquire much larger volumes of data with keyword filters in real-time.
We will be using the twitter
package that trivializes the process of tapping into Twitter's Streaming API for easily capturing tweets from the firehose.
$ pip install twitter
Or if running Anaconda:
$ conda install twitter
It's a lot easier to tap into Twitter's firehose than you might imagine if you're using the right library. The code below show you how to create a connection to Twitter's Streaming API and filter the firehose for tweets containing keywords.
The next function makes a seach of the query term query
using the Streaming API, and waits until receive 10 results before ending.
In [1]:
# Nothing to change in this cell
import os
import twitter
# XXX: Go to http://twitter.com/apps/new to create an app and get values
# for these credentials that you'll need to provide in place of these
# empty string values that are defined as placeholders.
#
# See https://vimeo.com/79220146 for a short video that steps you
# through this process
CONSUMER_KEY = os.environ.get("CONSUMER_KEY", "")
CONSUMER_SECRET = os.environ.get("CONSUMER_SECRET", "")
OAUTH_TOKEN = os.env.environ("OAUTH_TOKEN", "")
OAUTH_TOKEN_SECRET = os.environ.get("OAUTH_TOKEN_SECRET", "")
def print_tweets(query, limit=10):
# Authenticate to Twitter with OAuth
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
# Create a connection to the Streaming API
twitter_stream = twitter.TwitterStream(auth=auth)
print('Filtering the public timeline for "{0}"'.format(query))
stream = twitter_stream.statuses.filter(track=query)
# Write one tweet per line as a JSON document.
cont = 0
for tweet in stream:
print(tweet['text'])
if cont >= limit:
break
else:
cont += 1
In [2]:
# Nothing to change in this cell
print_tweets("Apple Watch, Apple Keynote, Apple MacBook")
Using the function above, we could easily write a program to dump search results into a file by serializing the JSON objects returned by the API. But if you don't feel like downloading now all the tweets, you can always use the dump with 10k tweets that it is already collected. Each line in the file is just a JSON dump of a tweet as returned by the Streaming API. When properly formatted, they look just like this:
{
"in_reply_to_status_id": null,
"created_at": "Mon Mar 09 19:48:05 +0000 2015",
"filter_level": "low",
"geo": null,
"place": null,
"entities": {
"trends": [],
"urls": [{
"expanded_url": "http://strawpoll.me/3832278/r",
"indices": [64, 86],
"url": "http://t.co/ZSdFmZgKi6",
"display_url": "strawpoll.me/3832278/r"
}],
"hashtags": [{
"text": "Apple",
"indices": [87, 93]
}, {
"text": "AppleWatch",
"indices": [94, 105]
}, {
"text": "Keynote",
"indices": [106, 114]
}],
"symbols": [],
"user_mentions": []
},
"contributors": null,
"favorited": false,
"retweet_count": 0,
"timestamp_ms": "1425930485002",
"in_reply_to_status_id_str": null,
"id": 575020247423586305,
"in_reply_to_user_id": null,
"user": {
"time_zone": "Athens",
"profile_image_url": "http://pbs.twimg.com/profile_images/562968082952368128/0u4u4UmE_normal.jpeg",
"created_at": "Mon May 14 16:29:41 +0000 2012",
"profile_banner_url": "https://pbs.twimg.com/profile_banners/580024012/1425304201",
"profile_sidebar_fill_color": "DDFFCC",
"following": null,
"profile_background_tile": false,
"listed_count": 14,
"statuses_count": 19519,
"profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/560091394086158337/LDW-IOcI.png",
"default_profile": false,
"profile_background_color": "9AE4E8",
"profile_link_color": "0084B4",
"protected": false,
"name": "Gared",
"favourites_count": 18424,
"is_translator": false,
"profile_use_background_image": true,
"geo_enabled": true,
"screen_name": "edgard22360",
"id": 580024012,
"profile_sidebar_border_color": "BDDCAD",
"url": "https://takethisgame.wordpress.com",
"lang": "fr",
"notifications": null,
"followers_count": 1556,
"default_profile_image": false,
"verified": false,
"profile_text_color": "333333",
"utc_offset": 7200,
"follow_request_sent": null,
"location": "Testeur JV pour @TakeThisGame_",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/562968082952368128/0u4u4UmE_normal.jpeg",
"description": "GEEK Level 21 // Breton // Nintendo // Sony //Cinéphile // Apple Addict // Cookie // Futur CM !",
"friends_count": 518,
"contributors_enabled": false,
"id_str": "580024012",
"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/560091394086158337/LDW-IOcI.png"
},
"lang": "en",
"in_reply_to_screen_name": null,
"text": "Quelle Apple Watch choisiras tu ? // Which one will you choose?\nhttp://t.co/ZSdFmZgKi6\n#Apple #AppleWatch #Keynote",
"coordinates": null,
"possibly_sensitive": false,
"source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"truncated": false,
"retweeted": false,
"id_str": "575020247423586305",
"in_reply_to_user_id_str": null,
"favorite_count": 0
}
In [3]:
# Nothing to change in this cell
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set some options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
Assuming that you've amassed a collection of tweets from the firehose in a line-delimited format, or that you are using the provided dump, one of the easiest ways to load the data into Pandas for analysis is to build a valid JSON array of the tweets.
In [4]:
filename = "data/twitter_apple.json"
file_line_list = ...
data = "[{0}]".format(",".join(file_line_list))
df = pd.read_json(data, orient='records')
df.head(3)
Out[4]:
Let's take a look on the columns.
In [5]:
# Nothing to change in this cell
df.columns
Out[5]:
The null values in this collection of tweets are caused by "limit notices", which Twitter sends to tell you that you're being rate-limited. Let's count how many rows have been rate-limited.
In [6]:
df[...].limit...
Out[6]:
This indicates that we received 15 limit notices and means that there are effectively 15 "rows" in our data frame that have null values for all of the fields we'd have expected to see.
Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of the firehose, and anything beyond that is filtered out with each "limit notice" telling you how many tweets were filtered out. This means that tweets containing "Apple Watch, Apple Keynote, Apple MacBook"
accounted for at least 1% of the total tweet volume at the time this data was being collected.
In order to calculate exactly how many tweets were filtered out across the aggregate, we need to "pop" off the column containing the 15 limit notices and sum up the totals across these limit notices. But first, let's capture the limit notices by indexing into the data frame for non-null fields contained in "limit".
In [7]:
limit_notices = df[...]
limit_notices.limit.head(3)
Out[7]:
Now we can remove the limit notice column from the DataFrame
entirely by filtering out those rows with a null value in the column id
.
In [8]:
df = df[...]
df.limit.mean()
Out[8]:
Something special about our DataFrame
is that some cells contain dictionaries instead of regular values. For example, for those requests that received the rate limitation, a dicitonary with a key track
and the number of tweets rate-limited as a value.
In [9]:
# Nothing to change in this cell
limit_notices.limit
Out[9]:
Therefore, to calculate the number of tweets that were rate-limited we need to sum up all the values of the dicionaries. Another way to do this is extracting the value of the key track
into a new column, or even overriding the limit
column to perform a regular aggregate.
In [10]:
rate_limited_tweets = limit_notices.limit.apply(...)...
print("Number of total tweets that were rate-limited", rate_limited_tweets)
total_notices = ...
print("Total number of limit notices", total_notices)
From this output, we can observe that 158 tweets were not provided out of ~10k, more than 99% of the tweets about "Apple Watch, Apple Keynote, Apple MacBook"
were received for the time period that they were being captured. In order to learn more about the bounds of that time period, let's create a time-based index on the created_at
field of each tweet so that we can perform a time-based analysis.
In [11]:
# Nothing to change
df.set_index('created_at', drop=False, inplace=True)
With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms, etc.
Let's see the first and last tweet collected.
In [12]:
df.sort(..., inplace=True)
print("First tweet timestamp (UTC)", ...)
print("Last tweet timestamp (UTC) ", ...)
It would be probably better to have a wider range of tweets, but there was such a busy time that in a bit more than half an hour we collected 10k tweets. Let's group by minute, calculate the average number of tweets per minute and plot the results.
In [13]:
grouped = df.groupby(...).aggregate(...) # you need to figure out how to group by minute in the index
tweets_minute = grouped[["text"]]
tweets_minute
Out[13]:
In [14]:
tweets_per_minute = ...
print("Average number of tweets per minute:", tweets_per_minute)
print("Average number of tweets per second:", tweets_per_minute / 60)
Every second, on average, around 6,000 tweets are tweeted on Twitter. That roughly means that during the time we collected our data, Apple Keynote took over ~0.1% of all tweets produced. Seen that way it might not be seem like a lot, but 1 message out of 1000 produced in the world was about Apple or their products.
Let's plot a histogram of the number of tweets per minute.
In [15]:
ax = ...plot(kind="bar", grid=False, figsize=(12, 4))
ax.set_xticklabels(tweets_minute.index, rotation=0)
ax.set_ylabel(...)
ax.set_xlabel(...)
ax.set_title(...)
Out[15]:
If we had collected the data during a whole day, that would make much more interesting a time-series analysis, as we could see how the information got viral or not across time. But with only half an hour, there is no much more we can do. Furthermore, process all the tweets in a period of 24 hourse is way more demaning that "only" 10k.
Let's now compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared. First we need to extract the column user
, containing a dictionary per cell, and create a DataFrame
out of that by convertinf eac cell into a Series
.
In [16]:
# Nothing to change in this cell
users = df.user.apply(pd.Series)
users.head(3)
Out[16]:
Now we can count how many tweets per author there are and return the 25 more active (with more tweets).
In [17]:
authors_counts = users.groupby(...)...aggregate...
authors_counts.rename(columns=..., inplace=True)
authors_counts.sort(..., ..., inplace=True)
authors_counts[:25]
Out[17]:
And also count unique authors.
In [18]:
num_unique_authors = ...
print("There are {0} unique authors out of {1} tweets".format(num_unique_authors, ...))
In [19]:
ax = ...plot(..., grid=False, figsize=(12, 4)) # how can we plot with logarithmic adjustments, google it!
ax.set_ylabel(...)
ax.set_xlabel(...)
ax.set_title(...)
ax.set_xlim([0, 10**4])
ax.set_xscale('log')
ax.set_yscale('log')
In [20]:
ax = ...(log=True, grid=False, bins=10, figsize=(12, 4))
ax.set_ylabel(...)
ax.set_xlabel(...)
ax.set_title(...)
Out[20]:
Although we could filter the DataFrame
for coordinates (or locations in user profiles), an even simpler starting point to gain rudimentary insight about where users might be located is to inspect the language field of the tweets and compute the tallies for each language.
In [21]:
df...[:25] # it returns a Series, think about counting values ;)
Out[21]:
A staggering number of English speakers were talking about "Apple" at the time the data was collected. Bearing in mind that it was already evening on Monday in America when Apple Keynote hit the news with the Apple Watch and the new MacBook.
However there are at least 21 tweets with undetermined language, und
, and we want to know which are those languages. As we know, TextBlob
internally uses Google Translate API to detect the language of a text. Twitter might use its own algorithm. But Google might support more languages, therefore, by detecting the language of the tweets using TextBlob
and then comparing which ones are not in the set of recognized languages in the tweets collected, we will know which are those languages missing.
The first step is to detect the language of the tweets and store that into a Series
. Every time you invoke .detect_language()
you are not only creating a TextBlob
object, which consumes time and space, but also performing a HTTP request to the Google service for translation. And that is very expensive. Let's say that the average request takes up to 250 milliseconds (that would be a lot), if we have 10k tweets, that's around 40 minutes. Therefore, before executing language detection in all the tweets, it would be better to test the code in a small scale by slicing 25 tweets.
In [22]:
import textblob as tb
def detect_language(text):
...
languages = df.text[:25].apply(detect_language)
languages...
Out[22]:
When the code is ready just run the next cell. In the meantime, go grab something to eat and get comfortable in the couch until it finishes the execution. Relax, it can take a while but this is how real data analysis work :)
In [23]:
# Are you ready to execute this cell? Go!
languages = df.text.apply(detect_language)
In [24]:
# Let's see those results counting each language
lang_counts = languages...
lang_counts[:25]
Out[24]:
Let's plot a simple pie chart showing the 10 most used languages.
In [25]:
ax = ...plot(
..., figsize=(8,8), autopct='%1.1f%%', shadow=False, startangle=55,
colors=plt.cm.Pastel1(np.linspace(0., 1., len(lang_counts[:10])))
)
ax.set_title(...)
Out[25]:
Comparing the two sets of unique languages codes we can finally know which are the languages not available in twitter.
In [26]:
set(...) ... set(...) # try playing with sets here
Out[26]:
However, there are now languages recognized by twitter that are not recognized by Google. Suprise, surprise...
In [27]:
...
Out[27]:
Let's filter out only the 140 characters of text from tweets where the author speaks English and learn more about the reaction.
In [28]:
en_text = df...
en_text[:25]
Out[28]:
In [29]:
from nltk.tokenize import word_tokenize
en_text["tokens"] = en_text.text.apply(...).values
words = sum(en_text["tokens"], [])
words[:25]
Out[29]:
Now we can build a Series
from the flattened list of words in order to show the 25 most common words.
In [30]:
pd.Series(words)...[:25]
Out[30]:
Not surprisingly, "Apple" is the most frequently occurring token, there are lots of retweets (actually, "quoted retweets") as evidenced by "RT", and lots of stopwords (such as "the", "a", etc.) and characters used in the Twitter universe ("@" is used for designating user names, and "#" for hastags) at the top of the list. Let's further remove some of the noise by filtering stopwords out.
In [31]:
import nltk
nltk.download('stopwords')
english_stopwords = nltk...
clean_words = ... # remember that in english_stopwords words are lowercase
pd.Series(clean_words)...[:25]
Out[31]:
Now data looks a bit cleaner, although we still have those "@" and "#", plus the "http" that is coming from all the URLs contained in the tweets. Removing those is tricky, since we can't just ignore them. If we do that, we might lose valuable information, like who were the more mentioned users, or the most shared hashtag. Instead, we are going to tweak NLTK's PunktWordTokenizer
by modifying some of the internally used regular expressions.
The first one is the regular expression used to express those characters that cannot appear within words. Take a look to the actual regular expression and make the necessary changes to allow "@" and "#" be part of a word.
In [32]:
# Original (?:[?!)\";}\]\*:@\'\({\[])
re_non_word = r"(?:[?!)\";}\]\*:@\'\({\[])"
The second regular expression excludes some characters from starting word tokens. Try to modify it to allow "@" and "#" be part of a word too.
In [33]:
# Original [^\(\"\`{\[:;&\#\*@\)}\]\-,]
re_word_start = r"[^\(\"\`{\[:;&\#\*@\)}\]\-,]"
In [34]:
# Nothing to change in this cell
import re
from nltk.tokenize import PunktWordTokenizer
from nltk.tokenize.punkt import PunktLanguageVars
lang_vars = PunktLanguageVars()
lang_vars._re_word_tokenizer = re.compile(lang_vars._word_tokenize_fmt % {
'NonWord': re_non_word,
'MultiChar': lang_vars._re_multi_char_punct,
'WordStart': re_word_start,
}, re.UNICODE | re.VERBOSE)
tokenizer = PunktWordTokenizer(lang_vars=lang_vars)
def twitter_tokenize(value):
return tokenizer.tokenize(value)
Now we can use the defined function twitter_tokenize
to tokenize our tweets. Once we are sure that usernames and hashtags are safe, we can extended english_stopwords
to also include regular string punctiation characters, plus some other random words that are of no interest or glitches of tokenization.
In [35]:
import string
word_lists = en_text.text.apply(...)
words = sum(word_lists.values, [])
english_stopwords += list(string.punctuation)
english_stopwords += ["http", "https", "...", "'s", "'t"]
clean_words = ...
pd.Series(clean_words)...[:25]
Out[35]:
What a difference removing a little bit of noise can make! We now see much more meaningful data appear at the top of the list. Actually, it can almost be read as a single sentence: "Apple Watch RT new watch $10,000, MacBook gold. @AppStore New buy, coming 4.24.15 t.co/4iiurTDTt9"
. The URL, in fact, takes you to the Apple Watch site.
There is, however, an unexpected term: douchebag
. We are not sure but could be related to the fact that the price of the watch seems to be $10,000
.
Let's now treat the problem as one of discovering statistical collocations to get more insight about the phrase contexts.
In [36]:
text = nltk.Text
text...
Even without any prior analysis on tokenization, it's pretty clear what the topic is about as evidenced by this list of collocations. But what about the context in which these phrases appear? Toward the bottom of the list of commonly occurring words, the words "watch" and "$10,000" appear. The word "favorite" is interesting, because it is usually the basis of an emotional reaction, and we're interested in examining the reaction. And "thanking" could also mean something. What about the word "douchebag"? What might it mean? The concordance will help us to find out.
In [37]:
# Print concordance for watch, favorite, and $10,000
text...
text...
text...
In [38]:
# Print concordance for thanking, and douchebag
text...
text...
It would appear that there is indeed a common thread of amazement in the data, although it's not evident if positive or negative. @AnnaKendrick47
turns out to be Anna Kendrick, an actress that recently appeared as a presenter in 86th Academy Awards, and that has played a role in The Twilight Saga.
But what else can usernames, hashtags, and URLs tell us? All that information is in our column entities
, but remember that entities
is a dictionary containing other keys. Let's now extract the value counts for screen_name
's in user_mentions
, text
's in hashtags
, and expanded_url
's in urls
. Then show the 25 more frequent.
In [39]:
# screen_name's in user_mentions
def list_screen_name_in_user_mentions(user_mention):
...
screen_name_lists = df.entities.apply(list_screen_name_in_user_mentions)
screen_names = pd.Series(sum(screen_name_lists.values, []))...
screen_names[:25]
Out[39]:
As we see above, AnnaKendrick47
is mentioned more than half the times that AppStore
is. Then some tech media companies and blogs, including the funny TheOnion
.
In [40]:
# text's in hashtags
def list_text_in_hastags(hashtag):
...
hastag_lists = df.entities.apply(list_text_in_hastags)
hastags = pd.Series(sum(hastag_lists.values, []))...
hastags[:25]
Out[40]:
About the hashtags, nothing really interesting here. Only mention the hashtag новости
, let's see what the meaning is in English.
In [41]:
tb.TextBlob...string
Out[41]:
Well, it means "news".
Finally, let's take a look to the the URLs.
In [42]:
# expanded_url's in urls
def list_expanded_url_in_urls(url):
...
url_lists = df.entities.apply(list_expanded_url_in_urls)
urls = pd.Series(sum(url_lists.values, []))...
urls[:25]
Out[42]:
A lot of the same: reviews and links to official pages. An excemption is this XKCD comic.
First The Onion, now XKCD... It is starting to see like a lot of tweets had a comical twist, it might be sarcasm or just jokes. Let's take a look to the most popular tweets to confirm it. We could use the number of retweets, retweet_count
, to find those. However, it looks like in our dataset everyone was retweeting by embeding the tweets (using "RT @someone: ..."), and therefore retweet_count
is always zero. There is still a way to find the most retweeted status. But I will leave that to you to figure it out.
In [43]:
retweets = en_text[...]
grouped_retweets = retweets.groupby...aggregate...
grouped_retweets.sort(..., ...)[:25]
Out[43]:
As you can see, there are lots of intersting tweet entities that give you helpful context for the announcement. One particularly notable observation is the appearance of "comedic accounts" relaying a certain amount of humor. And for once we can see why the tweet by @AnnaKendrick47
, embedded below, got that popular.
We should be thanking Apple for launching the $10,000 "apple watch" as the new gold standard in douchebag detection.
— Anna Kendrick (@AnnaKendrick47) March 9, 2015
When you take a closer look at some of the developed news stories, you also see sarcasm, unbelief, and even a bit of spam. Everything has a place on Twitter.
Identifying sarcasm or irony is a very hard task for machines, as it demands a lot of background knowledge. We can, however, take a look on how sentiment evolved over time, and if in average the announcement of the new watch and laptop was positive or negative.
Let's start by adding new columns to our DataFrame
en_text
that captures the sentiment of the tweets. But instead of using the regular polarity analyzer, we will use a NaiveBayesAnalyzer()
.
In [44]:
from textblob.sentiments import NaiveBayesAnalyzer
analyzer = NaiveBayesAnalyzer()
def extract_sentiment(value):
...
return pd.Series([sentiment, pos_prob, neg_prob])
en_text[...] = en_text.text.apply(extract_sentiment)
en_text[["text", "sentiment", "pos_prob", "neg_prob"]].head(3)
Out[44]:
Now can see how many positive and negative tweets are there in the dataset.
In [45]:
sentiments = en_text.groupby(...).aggregate({
...,
...,
...,
})
sentiments
Out[45]:
In [46]:
ax = ...plot(..., figsize=(8, 4))
ax.set_title(...)
ax.set_xlabel(...)
Out[46]:
It seems like the perception of the new products has been mostly positive. But it's the same for both the watch and the laptop? Let's add a new column keyword with value "watch" if the tweet contains the word "watch" (ignore case), "macbook" if it contains "macbook", "both" if it contains both, and np.NaN
otherwise.
In [47]:
def get_keyword(value):
...
en_text["keyword"] = en_text.text.apply(get_keyword)
en_text["keyword"][:25]
Out[47]:
Now we can create a pivot table to see the average values of the distributions of probability and count the number of tweets per keyword and sentiment.
In [48]:
pd.pivot_table(en_text, index=..., aggfunc={
...,
...,
...,
})
Out[48]:
We aspired to learn more about the general reaction to Apple's announcement by taking an initial look at the data from Twitter's firehose, and it's fair to say that we learned a few things about the data without too much effort. Lots more could be discovered, but a few of the themes that we were able to glean included...
What about the sentiment? Did the specific product make any difference in the sentiment? And in the amount of tweets?
What other impressions you got after running the analysis?