[Data, the Humanist's New Best Friend](index.ipynb)
*Assignment 2*
Apple Keynote Perception

On March 9th, Apple revealed its new series of products. That included the price for the long time awaited Apple watch, as well as the new MacBook.

In this assignment you will explore 10k tweets from Twitter's firehose that were collected shortly after the announcement and captures interesting data within moments of announcements for analysis.


*Fanboys be like...*

The goal will be to better understand the twitter reaction to Apple's announcements.

Assignment

Your mission will be to complete the Python code in the cells below and execute it until the output looks similar or identical to the output shown. I would recommend to use a temporary notebook to work with the dataset, and when the code is ready and producing the expected output, copypaste it to this notebook. Once is done and validated, just copy the file elsewhere, as the notebook will be the only file to be sent for evaluation. Of course, everything starts by downloading this notebook.

*No tests here, but put some work into the assignments!*

Deadline

March $20^{th}$.

Data

Twitter is an ideal source of data that can help you understand the reaction to newsworthy events, because it has more than 200M active monthly users who tend to use it to frequently share short informal thoughts about anything and everything. Although Twitter offers a Search API that can be used to query for "historical data", tapping into the firehose with the Streaming API is a preferred option because it provides you the ability to acquire much larger volumes of data with keyword filters in real-time.

We will be using the twitter package that trivializes the process of tapping into Twitter's Streaming API for easily capturing tweets from the firehose.

$ pip install twitter

Or if running Anaconda:

$ conda install twitter

Tapping Twitter's Firehose

It's a lot easier to tap into Twitter's firehose than you might imagine if you're using the right library. The code below show you how to create a connection to Twitter's Streaming API and filter the firehose for tweets containing keywords.

The next function makes a seach of the query term query using the Streaming API, and waits until receive 10 results before ending.


In [1]:
# Nothing to change in this cell
import os
import twitter
# XXX: Go to http://twitter.com/apps/new to create an app and get values
# for these credentials that you'll need to provide in place of these
# empty string values that are defined as placeholders.
#
# See https://vimeo.com/79220146 for a short video that steps you
# through this process
CONSUMER_KEY = os.environ.get("CONSUMER_KEY", "")
CONSUMER_SECRET = os.environ.get("CONSUMER_SECRET", "")
OAUTH_TOKEN = os.env.environ("OAUTH_TOKEN", "")
OAUTH_TOKEN_SECRET = os.environ.get("OAUTH_TOKEN_SECRET", "")

def print_tweets(query, limit=10):
    # Authenticate to Twitter with OAuth
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)

    # Create a connection to the Streaming API
    twitter_stream = twitter.TwitterStream(auth=auth)
    print('Filtering the public timeline for "{0}"'.format(query))

    stream = twitter_stream.statuses.filter(track=query)
    # Write one tweet per line as a JSON document.
    cont = 0
    for tweet in stream:
        print(tweet['text'])
        if cont >= limit:
            break
        else:
            cont += 1

In [2]:
# Nothing to change in this cell
print_tweets("Apple Watch, Apple Keynote, Apple MacBook")


Filtering the public timeline for "Apple Watch, Apple Keynote, Apple MacBook"
HHTV PROMO NEWS Hands On With The Apple Watch -  Apple’s smartwatch was on display again at today’s special Apple ... http://t.co/JKOZ97vhRj
RT @Hoodairmed: Hands On With The Apple Watch http://t.co/GknoVy1zLI http://t.co/bbbZisNvRa
Hands On With The Apple Watch http://t.co/TzmPKPulqd
Will These 2 Stunning Smartwatches Steal Apple Watch's Thunder? - http://t.co/C0IyjitALD.. Related Articles: http://t.co/0yC9xszNjj
Apple представила самый тонкий Macbook с Retina-дисплеем
Obligé j'achète l'apple watch
RT @arden_cho: No thanks apple, last thing I wanna strap to my wrist is an iPhone. Aren't we glued to our phones enough? Real watch: http:/…
Donc en gros Apple a présenté le Macbook et non le macbook air et a haussé les prix de tous les Macbook. Super cool ça
HHTV PROMO NEWS What Your Favorite Apps Look Like On Apple Watch (Plus New Ones!) http://t.co/B3Hl1wGxN8
I actually want to meet the people who will pay $17,000 for an apple watch. I just want to ask why
New post: "Opinion: One thing Android Wear desperately needs to take from Apple Watch" http://t.co/gX6pR9SsRf #breaking #news

Using the function above, we could easily write a program to dump search results into a file by serializing the JSON objects returned by the API. But if you don't feel like downloading now all the tweets, you can always use the dump with 10k tweets that it is already collected. Each line in the file is just a JSON dump of a tweet as returned by the Streaming API. When properly formatted, they look just like this:

    {
        "in_reply_to_status_id": null,
        "created_at": "Mon Mar 09 19:48:05 +0000 2015",
        "filter_level": "low",
        "geo": null,
        "place": null,
        "entities": {
            "trends": [],
            "urls": [{
                "expanded_url": "http://strawpoll.me/3832278/r",
                "indices": [64, 86],
                "url": "http://t.co/ZSdFmZgKi6",
                "display_url": "strawpoll.me/3832278/r"
            }],
            "hashtags": [{
                "text": "Apple",
                "indices": [87, 93]
            }, {
                "text": "AppleWatch",
                "indices": [94, 105]
            }, {
                "text": "Keynote",
                "indices": [106, 114]
            }],
            "symbols": [],
            "user_mentions": []
        },
        "contributors": null,
        "favorited": false,
        "retweet_count": 0,
        "timestamp_ms": "1425930485002",
        "in_reply_to_status_id_str": null,
        "id": 575020247423586305,
        "in_reply_to_user_id": null,
        "user": {
            "time_zone": "Athens",
            "profile_image_url": "http://pbs.twimg.com/profile_images/562968082952368128/0u4u4UmE_normal.jpeg",
            "created_at": "Mon May 14 16:29:41 +0000 2012",
            "profile_banner_url": "https://pbs.twimg.com/profile_banners/580024012/1425304201",
            "profile_sidebar_fill_color": "DDFFCC",
            "following": null,
            "profile_background_tile": false,
            "listed_count": 14,
            "statuses_count": 19519,
            "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/560091394086158337/LDW-IOcI.png",
            "default_profile": false,
            "profile_background_color": "9AE4E8",
            "profile_link_color": "0084B4",
            "protected": false,
            "name": "Gared",
            "favourites_count": 18424,
            "is_translator": false,
            "profile_use_background_image": true,
            "geo_enabled": true,
            "screen_name": "edgard22360",
            "id": 580024012,
            "profile_sidebar_border_color": "BDDCAD",
            "url": "https://takethisgame.wordpress.com",
            "lang": "fr",
            "notifications": null,
            "followers_count": 1556,
            "default_profile_image": false,
            "verified": false,
            "profile_text_color": "333333",
            "utc_offset": 7200,
            "follow_request_sent": null,
            "location": "Testeur JV pour @TakeThisGame_",
            "profile_image_url_https": "https://pbs.twimg.com/profile_images/562968082952368128/0u4u4UmE_normal.jpeg",
            "description": "GEEK Level 21 // Breton // Nintendo // Sony //Cinéphile // Apple Addict // Cookie // Futur CM !",
            "friends_count": 518,
            "contributors_enabled": false,
            "id_str": "580024012",
            "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/560091394086158337/LDW-IOcI.png"
        },
        "lang": "en",
        "in_reply_to_screen_name": null,
        "text": "Quelle Apple Watch choisiras tu ? // Which one will you choose?\nhttp://t.co/ZSdFmZgKi6\n#Apple #AppleWatch #Keynote",
        "coordinates": null,
        "possibly_sensitive": false,
        "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
        "truncated": false,
        "retweeted": false,
        "id_str": "575020247423586305",
        "in_reply_to_user_id_str": null,
        "favorite_count": 0
    } 

Preparation

*For everyone!*

In [3]:
# Nothing to change in this cell
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set some options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)

Assuming that you've amassed a collection of tweets from the firehose in a line-delimited format, or that you are using the provided dump, one of the easiest ways to load the data into Pandas for analysis is to build a valid JSON array of the tweets.


In [4]:
filename = "data/twitter_apple.json"
file_line_list = ...

data = "[{0}]".format(",".join(file_line_list))
df = pd.read_json(data, orient='records')
df.head(3)


Out[4]:
contributors coordinates created_at entities extended_entities favorite_count favorited filter_level geo id ... place possibly_sensitive retweet_count retweeted retweeted_status source text timestamp_ms truncated user
0 NaN None 2015-03-09 21:01:01 {'trends': [], 'urls': [{'expanded_url': 'http... NaN 0 0 low None 5.750386e+17 ... None 0 0 0 NaN <a href="http://ifttt.com" rel="nofollow">IFTT... 'Gold On My MacBook' is the perfect rap song f... 1.425935e+12 0 {'profile_link_color': '009999', 'contributors...
1 NaN None 2015-03-09 21:08:06 {'urls': [], 'symbols': [], 'media': [{'indice... {'media': [{'indices': [125, 140], 'url': 'htt... 0 0 low None 5.750404e+17 ... None 0 0 0 {'extended_entities': {'media': [{'id_str': '5... <a href="http://twitter.com/download/android" ... RT @MisterC00l: Noyer son Apple Watch en or à ... 1.425935e+12 0 {'profile_link_color': '0084B4', 'contributors...
2 NaN None 2015-03-09 21:21:47 {'urls': [{'expanded_url': 'http://phon.es/ylj... {'media': [{'indices': [104, 126], 'url': 'htt... 0 0 low None 5.750438e+17 ... None 0 0 0 {'extended_entities': {'media': [{'id_str': '5... <a href="http://twitter.com/download/android" ... RT @androidcentral: Apple's new MacBook cable ... 1.425936e+12 0 {'profile_link_color': '0084B4', 'contributors...

3 rows × 28 columns

Let's take a look on the columns.


In [5]:
# Nothing to change in this cell
df.columns


Out[5]:
Index(['contributors', 'coordinates', 'created_at', 'entities', 'extended_entities', 'favorite_count', 'favorited', 'filter_level', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'lang', 'limit', 'place', 'possibly_sensitive', 'retweet_count', 'retweeted', 'retweeted_status', 'source', 'text', 'timestamp_ms', 'truncated', 'user'], dtype='object')

The null values in this collection of tweets are caused by "limit notices", which Twitter sends to tell you that you're being rate-limited. Let's count how many rows have been rate-limited.


In [6]:
df[...].limit...


Out[6]:
15

This indicates that we received 15 limit notices and means that there are effectively 15 "rows" in our data frame that have null values for all of the fields we'd have expected to see.

Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of the firehose, and anything beyond that is filtered out with each "limit notice" telling you how many tweets were filtered out. This means that tweets containing "Apple Watch, Apple Keynote, Apple MacBook" accounted for at least 1% of the total tweet volume at the time this data was being collected.

Analysis

In order to calculate exactly how many tweets were filtered out across the aggregate, we need to "pop" off the column containing the 15 limit notices and sum up the totals across these limit notices. But first, let's capture the limit notices by indexing into the data frame for non-null fields contained in "limit".


In [7]:
limit_notices = df[...]
limit_notices.limit.head(3)


Out[7]:
1284    {'track': 40}
2878     {'track': 1}
2899     {'track': 9}
Name: limit, dtype: object

Now we can remove the limit notice column from the DataFrame entirely by filtering out those rows with a null value in the column id.


In [8]:
df = df[...]
df.limit.mean()


Out[8]:
nan

Something special about our DataFrame is that some cells contain dictionaries instead of regular values. For example, for those requests that received the rate limitation, a dicitonary with a key track and the number of tweets rate-limited as a value.


In [9]:
# Nothing to change in this cell
limit_notices.limit


Out[9]:
1284    {'track': 40}
2878     {'track': 1}
2899     {'track': 9}
4694     {'track': 7}
4714     {'track': 6}
6176    {'track': 10}
6314    {'track': 15}
6493    {'track': 11}
7023     {'track': 5}
7483    {'track': 19}
8000     {'track': 3}
8153     {'track': 6}
8552     {'track': 1}
9457     {'track': 4}
9557    {'track': 21}
Name: limit, dtype: object

Therefore, to calculate the number of tweets that were rate-limited we need to sum up all the values of the dicionaries. Another way to do this is extracting the value of the key track into a new column, or even overriding the limit column to perform a regular aggregate.


In [10]:
rate_limited_tweets = limit_notices.limit.apply(...)...
print("Number of total tweets that were rate-limited", rate_limited_tweets)

total_notices = ...
print("Total number of limit notices", total_notices)


Number of total tweets that were rate-limited 158
Total number of limit notices 15

From this output, we can observe that 158 tweets were not provided out of ~10k, more than 99% of the tweets about "Apple Watch, Apple Keynote, Apple MacBook" were received for the time period that they were being captured. In order to learn more about the bounds of that time period, let's create a time-based index on the created_at field of each tweet so that we can perform a time-based analysis.


In [11]:
# Nothing to change
df.set_index('created_at', drop=False, inplace=True)

With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms, etc.

Let's see the first and last tweet collected.


In [12]:
df.sort(..., inplace=True)
print("First tweet timestamp (UTC)", ...)
print("Last tweet timestamp (UTC) ", ...)


First tweet timestamp (UTC) 2015-03-09 21:00:45
Last tweet timestamp (UTC)  2015-03-09 21:34:39

It would be probably better to have a wider range of tweets, but there was such a busy time that in a bit more than half an hour we collected 10k tweets. Let's group by minute, calculate the average number of tweets per minute and plot the results.


In [13]:
grouped = df.groupby(...).aggregate(...) # you need to figure out how to group by minute in the index
tweets_minute = grouped[["text"]]
tweets_minute


Out[13]:
text
0 127
1 573
2 545
3 556
4 495
5 444
6 471
7 492
8 324
19 199
20 386
21 397
22 414
23 395
24 401
25 355
26 286
27 387
28 385
29 363
30 455
31 397
32 417
33 420
34 301

In [14]:
tweets_per_minute = ...
print("Average number of tweets per minute:", tweets_per_minute)
print("Average number of tweets per second:", tweets_per_minute / 60)


Average number of tweets per minute: 399.4
Average number of tweets per second: 6.65666666667

Every second, on average, around 6,000 tweets are tweeted on Twitter. That roughly means that during the time we collected our data, Apple Keynote took over ~0.1% of all tweets produced. Seen that way it might not be seem like a lot, but 1 message out of 1000 produced in the world was about Apple or their products.

Let's plot a histogram of the number of tweets per minute.


In [15]:
ax = ...plot(kind="bar", grid=False, figsize=(12, 4))
ax.set_xticklabels(tweets_minute.index, rotation=0)
ax.set_ylabel(...)
ax.set_xlabel(...)
ax.set_title(...)


Out[15]:
<matplotlib.text.Text at 0x7fcbb6962320>

If we had collected the data during a whole day, that would make much more interesting a time-series analysis, as we could see how the information got viral or not across time. But with only half an hour, there is no much more we can do. Furthermore, process all the tweets in a period of 24 hourse is way more demaning that "only" 10k.

Let's now compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared. First we need to extract the column user, containing a dictionary per cell, and create a DataFrame out of that by convertinf eac cell into a Series.


In [16]:
# Nothing to change in this cell
users = df.user.apply(pd.Series)
users.head(3)


Out[16]:
contributors_enabled created_at default_profile default_profile_image description favourites_count follow_request_sent followers_count following friends_count ... profile_sidebar_fill_color profile_text_color profile_use_background_image protected screen_name statuses_count time_zone url utc_offset verified
created_at
2015-03-09 21:00:45 False Thu Jun 04 20:59:45 +0000 2009 True False Realzalea bihotzez, software librea filosofiaz... 94 None 100 None 81 ... DDEEF6 333333 True False kristiansanz 2619 Madrid None 3600 False
2015-03-09 21:00:45 False Fri Aug 14 17:12:23 +0000 2009 False False Spécialiste dans le développement et le référe... 18 None 184 None 131 ... ACACAC 000000 True False sudmedia66 50127 Paris http://www.sud-media66.fr 3600 False
2015-03-09 21:00:45 False Sat Jan 22 18:47:53 +0000 2011 False False None 2413 None 130 None 168 ... DDEEF6 333333 True False ssantoss93 11073 Madrid None 3600 False

3 rows × 38 columns

Now we can count how many tweets per author there are and return the 25 more active (with more tweets).


In [17]:
authors_counts = users.groupby(...)...aggregate...
authors_counts.rename(columns=..., inplace=True)
authors_counts.sort(..., ..., inplace=True)
authors_counts[:25]


Out[17]:
tweets
screen_name
asdplayer55 21
iphone_np 13
lexinerus 11
WhinyAppleWatch 10
niftytech_news 10
world_latest 9
JapanTechFeeds 8
asdplayer42 8
technews_today 8
thinkb4talk07 8
gachinko2 8
smartclinic56 8
anghel30637137 7
sonuise 7
GET2GETHERKIM 7
iStantApple 7
aohhAREEYA 6
ios7italia 6
MetroPcTechs 6
Goem9 6
HerodenMark 6
LinkBuilding7 6
SquireStocks 6
Elia_Hicks 5
matematikogreni 5

And also count unique authors.


In [18]:
num_unique_authors = ...
print("There are {0} unique authors out of {1} tweets".format(num_unique_authors, ...))


There are 8535 unique authors out of 9985 tweets

Visualizations

*Oh, Dawson, only you.*

Now we need to get a better intution about the underlying distrubution of tweets and authors, so let's take a quick look at a frequency plot and histogram. We'll use logarithmic adjustments in both cases.


In [19]:
ax = ...plot(..., grid=False, figsize=(12, 4))  # how can we plot with logarithmic adjustments, google it!
ax.set_ylabel(...)
ax.set_xlabel(...)
ax.set_title(...)
ax.set_xlim([0, 10**4])
ax.set_xscale('log')
ax.set_yscale('log')



In [20]:
ax = ...(log=True, grid=False, bins=10, figsize=(12, 4))
ax.set_ylabel(...)
ax.set_xlabel(...)
ax.set_title(...)


Out[20]:
<matplotlib.text.Text at 0x7fcbb59a29e8>

Although we could filter the DataFrame for coordinates (or locations in user profiles), an even simpler starting point to gain rudimentary insight about where users might be located is to inspect the language field of the tweets and compute the tallies for each language.


In [21]:
df...[:25]  # it returns a Series, think about counting values ;)


Out[21]:
en     6680
es      847
ja      686
ru      480
fr      363
tr      246
de      155
it      104
pt       79
in       70
zh       48
th       46
nl       44
und      21
pl       17
ar       14
el       13
sk       12
sv       10
bg        6
hu        4
et        4
is        4
ko        4
no        4
dtype: int64

A staggering number of English speakers were talking about "Apple" at the time the data was collected. Bearing in mind that it was already evening on Monday in America when Apple Keynote hit the news with the Apple Watch and the new MacBook.

However there are at least 21 tweets with undetermined language, und, and we want to know which are those languages. As we know, TextBlob internally uses Google Translate API to detect the language of a text. Twitter might use its own algorithm. But Google might support more languages, therefore, by detecting the language of the tweets using TextBlob and then comparing which ones are not in the set of recognized languages in the tweets collected, we will know which are those languages missing.

The first step is to detect the language of the tweets and store that into a Series. Every time you invoke .detect_language() you are not only creating a TextBlob object, which consumes time and space, but also performing a HTTP request to the Google service for translation. And that is very expensive. Let's say that the average request takes up to 250 milliseconds (that would be a lot), if we have 10k tweets, that's around 40 minutes. Therefore, before executing language detection in all the tweets, it would be better to test the code in a small scale by slicing 25 tweets.


In [22]:
import textblob as tb

def detect_language(text):
    ...

languages = df.text[:25].apply(detect_language)
languages...


Out[22]:
en    12
es     5
fr     3
ru     3
ja     2
dtype: int64

When the code is ready just run the next cell. In the meantime, go grab something to eat and get comfortable in the couch until it finishes the execution. Relax, it can take a while but this is how real data analysis work :)

*Good for you, mate*

In [23]:
# Are you ready to execute this cell? Go!
languages = df.text.apply(detect_language)

In [24]:
# Let's see those results counting each language
lang_counts = languages...
lang_counts[:25]


Out[24]:
en       6795
es        816
ja        676
ru        479
fr        318
tr        243
de        147
it         93
pt         81
id         73
th         51
zh-CN      48
nl         42
ar         16
el         13
pl         12
gl         11
ca         10
sv         10
cs          7
da          5
hu          4
ko          4
no          3
et          2
dtype: int64

Let's plot a simple pie chart showing the 10 most used languages.


In [25]:
ax = ...plot(
    ..., figsize=(8,8), autopct='%1.1f%%', shadow=False, startangle=55,
    colors=plt.cm.Pastel1(np.linspace(0., 1., len(lang_counts[:10])))
)
ax.set_title(...)


Out[25]:
<matplotlib.text.Text at 0x7fcbb60b30b8>

Comparing the two sets of unique languages codes we can finally know which are the languages not available in twitter.


In [26]:
set(...) ... set(...)  # try playing with sets here


Out[26]:
{'az',
 'ca',
 'cs',
 'gl',
 'id',
 'ig',
 'la',
 'mk',
 'mt',
 'sl',
 'sq',
 'uz',
 'yo',
 'zh-CN'}

However, there are now languages recognized by twitter that are not recognized by Google. Suprise, surprise...


In [27]:
...


Out[27]:
{'ht', 'in', 'sk', 'tl', 'uk', 'und', 'zh'}

Let's filter out only the 140 characters of text from tweets where the author speaks English and learn more about the reaction.


In [28]:
en_text = df...
en_text[:25]


Out[28]:
text
created_at
2015-03-09 21:00:46 Why I wound up buying a Pebble​ Watch instead....
2015-03-09 21:00:46 RT @BobScottCPA: How is the Apple Watch "state...
2015-03-09 21:00:46 #AppleWatch Can't wait to wear one :) http://t...
2015-03-09 21:00:46 RT @MEMMOSdubai: Should you buy the $10,000 go...
2015-03-09 21:00:46 #setting4success Hands On With The Apple Watch...
2015-03-09 21:00:46 #Apple redefines itself as a luxury brand htt...
2015-03-09 21:00:46 RT @guardiantech: #Apple's new #MacBook comes ...
2015-03-09 21:00:46 RT @Fallonam: You guys we did this. We gave Ap...
2015-03-09 21:00:46 What Your Favorite Apps Look Like On Apple Wat...
2015-03-09 21:00:47 RT @TechCrunch: Apple Watch Will Ship On April...
2015-03-09 21:00:47 RT @selenalarson: For the price of a high-end ...
2015-03-09 21:00:47 RT @petercoffee: What if trusted data, from an...
2015-03-09 21:00:47 Apple Watch and new MacBook announcement: all ...
2015-03-09 21:00:47 RT @JamesTylerESPN: Louis van Gaal: definitely...
2015-03-09 21:00:48 @ZenTriathlon I'll just tow my Apple watch beh...
2015-03-09 21:00:48 $10k for a watch? ABSOLUTELY, APPLE! #nawwwwwt
2015-03-09 21:00:48 Someone give me $10,000 for a gold Apple watch...
2015-03-09 21:00:48 "@AppStore: The Watch is coming. 4.24.15. htt...
2015-03-09 21:00:48 Check Out The Apple Watch [video] http://t.co/...
2015-03-09 21:00:48 RT @edaccessible: Use This Ingenious Trick to ...
2015-03-09 21:00:48 CNET: Apple Watch hands-on: Release date April...
2015-03-09 21:00:48 RT @PRNews: Top #PR Takeaways From the Apple W...
2015-03-09 21:00:49 RT @ClaraJeffery: If you spend $10k on a first...
2015-03-09 21:00:49 RT @AppStore: Mac in its purest form ever. The...
2015-03-09 21:00:49 RT @mashabletech: Should you buy the $10,000 g...

Reception

Let's now tokenize those tweets and get the counts of the words to get an initial glance about what's being talked about. The first step is to apply tokenization to each cell and flatten the Series's values into a single list.


In [29]:
from nltk.tokenize import word_tokenize

en_text["tokens"] = en_text.text.apply(...).values
words = sum(en_text["tokens"], [])
words[:25]


Out[29]:
['Why',
 'I',
 'wound',
 'up',
 'buying',
 'a',
 'Pebble\u200b',
 'Watch',
 'instead..',
 'Apple',
 'Watch',
 'vs.',
 'the',
 'competition',
 ':',
 'Where',
 'does',
 'it',
 'stand',
 '?',
 'http',
 ':',
 '//t.co/VyqVd17Y7T',
 'RT',
 '@']

Now we can build a Series from the flattened list of words in order to show the 25 most common words.


In [30]:
pd.Series(words)...[:25]


Out[30]:
:          10773
Apple       6579
http        6011
Watch       4896
the         3534
@           3451
RT          2534
.           2238
#           2072
,           2006
to          1520
The         1437
$           1346
watch       1274
new         1246
a           1177
for         1127
is          1068
MacBook     1007
's           930
in           851
I            835
of           803
?            767
you          728
dtype: int64

Not surprisingly, "Apple" is the most frequently occurring token, there are lots of retweets (actually, "quoted retweets") as evidenced by "RT", and lots of stopwords (such as "the", "a", etc.) and characters used in the Twitter universe ("@" is used for designating user names, and "#" for hastags) at the top of the list. Let's further remove some of the noise by filtering stopwords out.


In [31]:
import nltk
nltk.download('stopwords')

english_stopwords = nltk...
clean_words = ...  # remember that in english_stopwords words are lowercase
        
pd.Series(clean_words)...[:25]


[nltk_data] Downloading package stopwords to /home/versae/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[31]:
:           10773
Apple        6579
http         6011
Watch        4896
@            3451
RT           2534
.            2238
#            2072
,            2006
$            1346
watch        1274
new          1246
MacBook      1007
's            930
?             767
...           658
apple         658
''            634
10,000        627
``            583
!             544
gold          536
-             506
AppStore      418
https         416
dtype: int64

Now data looks a bit cleaner, although we still have those "@" and "#", plus the "http" that is coming from all the URLs contained in the tweets. Removing those is tricky, since we can't just ignore them. If we do that, we might lose valuable information, like who were the more mentioned users, or the most shared hashtag. Instead, we are going to tweak NLTK's PunktWordTokenizer by modifying some of the internally used regular expressions.

The first one is the regular expression used to express those characters that cannot appear within words. Take a look to the actual regular expression and make the necessary changes to allow "@" and "#" be part of a word.


In [32]:
# Original (?:[?!)\";}\]\*:@\'\({\[])
re_non_word = r"(?:[?!)\";}\]\*:@\'\({\[])"

The second regular expression excludes some characters from starting word tokens. Try to modify it to allow "@" and "#" be part of a word too.


In [33]:
# Original [^\(\"\`{\[:;&\#\*@\)}\]\-,]
re_word_start = r"[^\(\"\`{\[:;&\#\*@\)}\]\-,]"

In [34]:
# Nothing to change in this cell
import re
from nltk.tokenize import PunktWordTokenizer
from nltk.tokenize.punkt import PunktLanguageVars

lang_vars = PunktLanguageVars()
lang_vars._re_word_tokenizer = re.compile(lang_vars._word_tokenize_fmt % {
    'NonWord':   re_non_word,
    'MultiChar': lang_vars._re_multi_char_punct,
    'WordStart': re_word_start,
}, re.UNICODE | re.VERBOSE)
tokenizer = PunktWordTokenizer(lang_vars=lang_vars)

def twitter_tokenize(value):
    return tokenizer.tokenize(value)

Now we can use the defined function twitter_tokenize to tokenize our tweets. Once we are sure that usernames and hashtags are safe, we can extended english_stopwords to also include regular string punctiation characters, plus some other random words that are of no interest or glitches of tokenization.


In [35]:
import string

word_lists = en_text.text.apply(...)
words = sum(word_lists.values, [])

english_stopwords += list(string.punctuation)
english_stopwords += ["http", "https", "...", "'s", "'t"]

clean_words = ...
pd.Series(clean_words)...[:25]


Out[35]:
Apple                6411
Watch                4764
RT                   2533
new                  1243
watch                1155
MacBook               881
$10,000               617
apple                 550
gold                  535
App                   385
@AppStore             378
New                   361
buy                   359
Look                  351
coming.               341
4.24.15.              335
event                 307
//t.co/4iiurTDTt9     289
Apple’s               285
First                 280
look                  255
standard              255
launching             254
douchebag             253
thanking              252
dtype: int64

What a difference removing a little bit of noise can make! We now see much more meaningful data appear at the top of the list. Actually, it can almost be read as a single sentence: "Apple Watch RT new watch $10,000, MacBook gold. @AppStore New buy, coming 4.24.15 t.co/4iiurTDTt9". The URL, in fact, takes you to the Apple Watch site.

There is, however, an unexpected term: douchebag. We are not sure but could be related to the fact that the price of the watch seems to be $10,000.

Let's now treat the problem as one of discovering statistical collocations to get more insight about the phrase contexts.


In [36]:
text = nltk.Text
text...


Apple Watch; coming. 4.24.15.; apple watch; 4.24.15.
//t.co/4iiurTDTt9; standard douchebag; @AnnaKendrick47 thanking;
douchebag detection.; //t.co/4iiurTDTt9 //t.co/3Bedz37DAy; gold
standard; launching $10,000; Companion App; Favorite Apps; Watch
coming.; $10,000 apple; First Enterprise; new MacBook; internet
reacts; Salesforce First; App Jump; Enterprise App

Even without any prior analysis on tokenization, it's pretty clear what the topic is about as evidenced by this list of collocations. But what about the context in which these phrases appear? Toward the bottom of the list of commonly occurring words, the words "watch" and "$10,000" appear. The word "favorite" is interesting, because it is usually the basis of an emotional reaction, and we're interested in examining the reaction. And "thanking" could also mean something. What about the word "douchebag"? What might it mean? The concordance will help us to find out.


In [37]:
# Print concordance for watch, favorite, and $10,000
text...
text...
text...


Displaying 25 of 5943 matches:
                                     Watch instead .. Apple Watch vs. competiti
                                     Watch vs. competition stand //t.co/VyqVd17
                                     Watch state art even beam time ceiling lik
 @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart //t.co/Crn3LPr7Ve 
ai #UAE #setting4success Hands Apple Watch //t.co/VRGA6NaWXP #Mobile #Tablets #
nam guys this. gave Apple power sell watch $10k. blame. #AppleLive #AppleWat… F
leWat… Favorite Apps Look Like Apple Watch Plus New Ones //t.co/XIweD7HZ8f RT @
t.co/XIweD7HZ8f RT @TechCrunch Apple Watch Ship April 24 Cost $349 $10,000+ Dep
T @selenalarson price high-end Apple Watch fund entire Charity Water project tr
 //t.co/Zybh0uto7B @Salesforce Apple Watch new MacBook announcement news Apple’
s van Gaal definitely type buy Apple Watch leave park bench. @ZenTriathlon 'll 
k bench. @ZenTriathlon 'll tow Apple watch behind little waterproof boat $10k w
h behind little waterproof boat $10k watch ABSOLUTELY APPLE #nawwwwwt Someone g
wwwt Someone give $10,000 gold Apple watch @AppStore Watch coming. 4.24.15. //t
e $10,000 gold Apple watch @AppStore Watch coming. 4.24.15. //t.co/DQuKRR3Bur /
mGlLA @VictoriaDemenko 😂 Check Apple Watch video //t.co/8QCihNEuAV RT @edaccess
e Ingenious Trick Choose Right Apple Watch Size //t.co/QMVOCKHRjx @TIME CNET Ap
e //t.co/QMVOCKHRjx @TIME CNET Apple Watch hands-on Release date April 24 price
q RT @PRNews Top #PR Takeaways Apple Watch Preview //t.co/TMNfdPZSc0 RT @ClaraJ
raJeffery spend $10k first-gen Apple watch check every priority life. RT @AppSt
@mashabletech buy $10,000 gold Apple Watch helpful flowchart @MaxKnoblauch //t.
wM //t.co/C… RT @charlesarthur Apple Watch watch way first iPhone phone. people
.co/C… RT @charlesarthur Apple Watch watch way first iPhone phone. people want 
irst iPhone phone. people want apple watch start saving $10,000 //t.co/SEzAGJeK
eKEy Australia among first get Apple Watch //t.co/Syx3GPCuvS RT @AppStore There
Displaying 25 of 145 matches:
$10k. blame. #AppleLive #AppleWat… Favorite Apps Look Like Apple Watch Plus Ne
wristwatch histo //t.co/ZS1WpLr3dF Favorite Apps Look Like Apple Watch Plus Ne
cBook //t.co/OoLf5Wqe5q //t.co/59… Favorite Apps Look Like Apple Watch Plus Ne
nk @tim_cook much apple watch cost Favorite Apps Look Like Apple Watch Plus Ne
ptYjuc March 09 2015 04 55PM #What Favorite Apps Look Like Apple Watch Plus Ne
today Apple news //t.co/ARphq3OmgT Favorite Apps Look Like Apple Watch Plus Ne
pple new MacBook //t.co/dLfNvRFk4n Favorite Apps Look Like Apple Watch Plus Ne
Apple Watch font //t.co/pjz2OJ5zZv Favorite Apps Look Like Apple Watch Plus Ne
pple Watch event //t.co/9OZCPIQZku Favorite Apps Look Like Apple Watch Plus Ne
l @danielas_bot @ThisIsFusion //t… Favorite Apps Look Like Apple Watch Plus Ne
Apple Watch apps //t.co/qbN3jyT8hn Favorite Apps Look Like Apple Watch Plus Ne
/t.co/Wg9I3EprvT //t.co/ye5W7t5d2x Favorite Apps Look Like Apple Watch Plus Ne
pple Watch pushed back month haha. Favorite Apps Look Like Apple Watch Plus Ne
/t.co/n4Y2ZacIwB //t.co/dch3Kh0SUn Favorite Apps Look Like Apple Watch Plus Ne
/t.co/U62E2iyNT6 //t.co/jRtWddN0Wi Favorite Apps Look Like Apple Watch Plus Ne
tible Apple Watch cases size. Swap Favorite Apps Look Like Apple Watch Plus Ne
r person wearing $600 Apple watch. Favorite Apps Look Like Apple Watch Plus Ne
pple new MacBook //t.co/CNA4Q564bf Favorite Apps Look Like Apple Watch Plus Ne
ant @lilyhnewman //t.co/bn3LWYOJsE Favorite Apps Look Like Apple Watch Plus Ne
down PC Magazine //t.co/kCCuRpOLpm Favorite Apps Look Like Apple Watch Plus Ne
 -- Live Blog 2- //t.co/MMwNSRzrrD Favorite Apps Look Like Apple Watch Plus Ne
s start April 10 //t.co/un2pGcE2i4 Favorite Apps Look Like Apple Watch Plus Ne
t.co/T9quScjxGI $17K beauty yours. Favorite Apps Look Like Apple Watch Plus Ne
ch Plus New Ones //t.co/X0uDYzCNTb Favorite Apps Look Like Apple Watch Plus Ne
cial Apple event //t.co/zsyo8mFx1A Favorite Apps Look Like Apple Watch Plus Ne
Displaying 25 of 617 matches:
t.co/7rL7SBLEvy RT @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart 
LUTELY APPLE #nawwwwwt Someone give $10,000 gold Apple watch @AppStore Watch co
.co/jRtWddN0Wi RT @mashabletech buy $10,000 gold Apple Watch helpful flowchart 
eople want apple watch start saving $10,000 //t.co/SEzAGJeKEy Australia among f
Kendrick47 thanking Apple launching $10,000 apple watch new gold standard douch
.co/X878RvKjty RT @cjwerleman think $10,000 Apple watch popular amongst muggers
00+ RT @wesleyfenlon Yeah could buy $10,000 Apple Watch could buy 20,000 Reeses
t.co/5Plg8S4rVe RT @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart 
Kendrick47 thanking Apple launching $10,000 apple watch new gold standard douch
t.co/ZkpBsfg4zX RT @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart 
H9WlF Apple Watch ranges price $349 $10,000 sale April 24 //t.co/8BOiosAnED RT 
t.co/X7RwknBuOG RT @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart 
Kendrick47 thanking Apple launching $10,000 apple watch new gold standard douch
t.co/e5TtdeRet4 RT @LATimesGraphics $10,000 Apple Watch Here’s many Apple produ
t.co/fqenQ08jOw RT @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart 
WatkT Haha apple watch edition gold $10,000 hay pewwww RT @AppStore Mac purest 
V5DfYA9wOX RT @TheSeanBrewster glad $10,000 Apple watch look like poor person w
keginn sorry Apple one watch 'd pay $10,000 //t.co/TC7PyMUXfY Live Apple reveal
 //t.co/UoP0u2iELL RT @mashable buy $10,000 gold Apple Watch flowchart help //t
 //t.co/ZLrwdxpTvF RT @mashable buy $10,000 gold Apple Watch flowchart help //t
RT @andylassner way hell 'm getting $10,000 Apple watch unless really rich talk
Kendrick47 thanking Apple launching $10,000 apple watch new gold standard douch
Kendrick47 thanking Apple launching $10,000 apple watch new gold standard douch
t.co/GbYeFKRicE RT @MEMMOSdubai buy $10,000 gold Apple Watch helpful flowchart 
Kendrick47 thanking Apple launching $10,000 apple watch new gold standard douch

In [38]:
# Print concordance for thanking, and douchebag
text...
text...


Displaying 25 of 252 matches:
t.co/7ICkQWTHGC RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/3Bedz37DAy RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/wbOQ9uWhI7 RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
c @TheEllenShow RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
 Apple money ?” RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/Mzr5A5Uufw RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
ebag detection. RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/v4Q41A5QGO RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
iJy0pozOa5 THIS&gt @AnnaKendrick47 thanking Apple launching $10,000 apple watc
XyyPf7qf #apple RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
//t.co/5U0xPydGyN “@AnnaKendrick47 thanking Apple launching $10,000 apple watc
ife like iPhon… RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
f2wLGnvMF #tech RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
rice bracket xD RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/UKxSvbWzqN RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/zuCMt8nU4N RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/W94WGibGiV RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
 makes immortal RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
cking Internet. RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/D4ck0CuCHA RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/iugi0MUGuf RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
ng money stuff. RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/jSHGFf4yIU RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
t.co/Te5Pb58msM RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
ebag detection. RT @AnnaKendrick47 thanking Apple launching $10,000 apple watc
Displaying 25 of 253 matches:
,000 apple watch new gold standard douchebag detection. update iOS Apple puts n
,000 apple watch new gold standard douchebag detection. Apple releases new vide
,000 apple watch new gold standard douchebag detection. #telemarketing Apple im
,000 apple watch new gold standard douchebag detection. #telemarketing Apple Wa
,000 apple watch new gold standard douchebag detection. RT @edaccessible Use In
,000 apple watch new gold standard douchebag detection. RT @AnnaKendrick47 than
,000 apple watch new gold standard douchebag detection. RT @luufycom New Apple 
,000 apple watch new gold standard douchebag detection. Apple Watch proof human
,000 apple watch new gold standard douchebag detection. new MacBook shows San F
,000 apple watch new gold standard douchebag detection. Apple Watch sounds like
,000 apple watch new gold standard douchebag detection.” RT @WarrenHolstein For
,000 apple watch new gold standard douchebag detection. need Apple watch last t
,000 apple watch new gold standard douchebag detection. RT @holman Conference s
,000 apple watch new gold standard douchebag detection. #mktg new MacBook shows
,000 apple watch new gold standard douchebag detection. iOS 8.2 come Apple watc
,000 apple watch new gold standard douchebag detection. Get Free Apple Watch //
,000 apple watch new gold standard douchebag detection. RT @TechCrunch Apple Wa
,000 apple watch new gold standard douchebag detection. Apple watch hasn impres
,000 apple watch new gold standard douchebag detection. RT @CityNews 5 things n
,000 apple watch new gold standard douchebag detection. RT @desusnice New life 
,000 apple watch new gold standard douchebag detection. Amazing Apple set buy t
,000 apple watch new gold standard douchebag detection. Hey apple think make ne
,000 apple watch new gold standard douchebag detection. Favorite Apps Look Like
,000 apple watch new gold standard douchebag detection. RT @AnnaKendrick47 than
,000 apple watch new gold standard douchebag detection. RT @Joemon871 @tldtoday
*$10,000 watch*

It would appear that there is indeed a common thread of amazement in the data, although it's not evident if positive or negative. @AnnaKendrick47 turns out to be Anna Kendrick, an actress that recently appeared as a presenter in 86th Academy Awards, and that has played a role in The Twilight Saga.

But what else can usernames, hashtags, and URLs tell us? All that information is in our column entities, but remember that entities is a dictionary containing other keys. Let's now extract the value counts for screen_name's in user_mentions, text's in hashtags, and expanded_url's in urls. Then show the 25 more frequent.


In [39]:
# screen_name's in user_mentions
def list_screen_name_in_user_mentions(user_mention):
    ...

screen_name_lists = df.entities.apply(list_screen_name_in_user_mentions)
screen_names = pd.Series(sum(screen_name_lists.values, []))...
screen_names[:25]


Out[39]:
AppStore          421
AnnaKendrick47    254
sosyalansmedya    149
verge              81
TechCrunch         66
Chris_Vanouu       64
gizmodojapan       57
YouTube            47
MEMMOSdubai        46
mashable           44
wylsacom           39
TIME               38
sageboggs          38
engadget           37
TheOnion           31
tomstandage        29
waltmossberg       29
ComplexMag         28
edaccessible       27
desusnice          25
androidcentral     24
CNET               23
Recode             22
shutupmikeginn     21
BMWi               20
dtype: int64

As we see above, AnnaKendrick47 is mentioned more than half the times that AppStore is. Then some tech media companies and blogs, including the funny TheOnion.


In [40]:
# text's in hashtags
def list_text_in_hastags(hashtag):
    ...

hastag_lists = df.entities.apply(list_text_in_hastags)
hastags = pd.Series(sum(hastag_lists.values, []))...
hastags[:25]


Out[40]:
Apple              243
AppleWatch         190
apple              147
AppleLive          131
tech               118
MacBook             79
news                75
applewatch          67
Tech                51
technology          49
watch               49
AppleEvent          29
wwwhatsnew          28
News                25
changepenang        24
Dubai               23
Technology          23
UAE                 23
Watch               23
Macbook             21
AppleWatchEvent     21
iphone              18
android             18
tecnologia          16
новости             16
dtype: int64

About the hashtags, nothing really interesting here. Only mention the hashtag новости, let's see what the meaning is in English.


In [41]:
tb.TextBlob...string


Out[41]:
'news'

Well, it means "news".

Finally, let's take a look to the the URLs.


In [42]:
# expanded_url's in urls
def list_expanded_url_in_urls(url):
    ...

url_lists = df.entities.apply(list_expanded_url_in_urls)
urls = pd.Series(sum(url_lists.values, []))...
urls[:25]


Out[42]:
http://apple.com/watch                                          338
https://amp.twimg.com/v/ee08d048-6b15-4544-9dda-46b991a954f7    262
http://ift.tt/1BkhBpV                                            49
http://apple.com/macbook                                         49
http://ift.tt/196oFw0                                            43
http://ift.tt/1NBugtG                                            42
http://ift.tt/1F4vwSD                                            41
http://ift.tt/1AacV0Z                                            41
http://apple-watch-news.com                                      40
http://japan.cnet.com/sp/apple_watch/35061515/?tag=as.rss        34
http://bit.ly/1KMXXck                                            33
http://ift.tt/1MmmwbR                                            33
http://ift.tt/1Aab7VR                                            32
http://ift.tt/1FBpBm6                                            32
http://ift.tt/1wm5sAA                                            32
http://theverge.com/e/7941194                                    32
http://ift.tt/1Mmmvoe                                            31
http://apple.com/live                                            30
http://ift.tt/1Aab8ZJ                                            29
http://engt.co/1Aad1FU                                           29
http://xkcd.com/1420/                                            29
http://trib.al/OKvbiFO                                           28
http://ln.is/time.com/3737949/app/lHrJg                          27
http://read.bi/1AXKvrd                                           23
http://ift.tt/196fWtT                                            23
dtype: int64

A lot of the same: reviews and links to official pages. An excemption is this XKCD comic.

*[Watches](http://xkcd.com/1420/)*

First The Onion, now XKCD... It is starting to see like a lot of tweets had a comical twist, it might be sarcasm or just jokes. Let's take a look to the most popular tweets to confirm it. We could use the number of retweets, retweet_count, to find those. However, it looks like in our dataset everyone was retweeting by embeding the tweets (using "RT @someone: ..."), and therefore retweet_count is always zero. There is still a way to find the most retweeted status. But I will leave that to you to figure it out.


In [43]:
retweets = en_text[...]
grouped_retweets = retweets.groupby...aggregate...
grouped_retweets.sort(..., ...)[:25]


Out[43]:
text
text
RT @AnnaKendrick47: We should be thanking Apple for launching the $10,000 "apple watch" as the new gold standard in douchebag detection. 246
RT @AppStore: The Watch is coming. 4.24.15. http://t.co/4iiurTDTt9\nhttps://t.co/3Bedz37DAy 225
RT @sageboggs: Is there a way to sync my Google Glass to my Apple Watch? I want to ensure I won't kiss anyone for the rest of my life 38
RT @AppStore: Mac in its purest form ever. The new MacBook. http://t.co/U62E2iyNT6 http://t.co/jRtWddN0Wi 33
RT @verge: 10 things you can buy instead of a $10,000 Apple Watch http://t.co/skzHFWDCX0 http://t.co/RrZxqxN188 28
RT @tomstandage: Genius from xkcd, as Apple prepares to launch its Watch http://t.co/OzcVuUoYJP http://t.co/nvhMdfw9vy 27
RT @ComplexMag: Everything you need to know about the Apple Watch and the New Gold MacBook: http://t.co/aylpX2MQi4 http://t.co/ORUNZwS0nh 27
RT @edaccessible: Use This Ingenious Trick to Choose the Right Apple Watch Size http://t.co/QMVOCKHRjx @TIME 27
RT @MEMMOSdubai: Should you buy the $10,000 gold Apple Watch? A helpful flowchart http://t.co/Crn3LPr7Ve @MEMMOSDubai #Dubai #UAE 23
RT @shutupmikeginn: sorry Apple, there's only one watch I'd pay $10,000 for: http://t.co/TC7PyMUXfY 21
RT @AppStore: The Watch is coming. 4.24.15. http://t.co/4iiurTDTt9\nhttps://t.co/kkGlBBJIB9 18
RT @WSJD: .@GeoffreyFowler says today's Apple Watch event was about "quantity over quality": http://t.co/FT8MI4sqlC http://t.co/c9v44NU4V3 18
RT @RANsquawk: Loving my Apple Watch $AAPL http://t.co/c75QFoDSS3 18
RT @TheOnion: From The Archives: Interim Apple Chief Under Fire After Unveiling Grotesque New MacBook http://t.co/OoLf5Wqe5q http://t.co/59… 17
RT @AppStore: The Watch is coming. 4.24.15. http://t.co/4iiurTDTt9\nhttps://t.co/JY0qjXldir 17
RT @desusnice: JIGS UP RT @engadget: These $79 dongles will add more ports to Apple's new MacBook http://t.co/dLfNvRFk4n 16
RT @AppStore: The Watch is coming. 4.24.15. http://t.co/4iiurTDTt9\nhttps://t.co/Z18Ue9rp7X 16
RT @SarcasticRover: I cost 250,000 Apple Watch Editions. 16
RT @waltmossberg: Apple Watch Selection Flowchart (Comic) http://t.co/TNfdhECT9P via @Recode http://t.co/jSHGFf4yIU 16
RT @robfee: Just preordered my $10,000 Apple Watch and I couldn't be more excited! http://t.co/r9qJ1lWIv1 15
RT @andylassner: No way in hell I'm getting a $10,000 Apple watch unless a really rich talk show host buys me one. \n\ncc @TheEllenShow 13
RT @Rachelskirts: And nine—nine Apple Watch Editions were gifted to the race of men, who above all else desire power. 13
RT @AppStore: The Watch is coming. 4.24.15. http://t.co/4iiurTDTt9\nhttps://t.co/hI2YFOSf0f 12
RT @bmw: .@BMWi is excited to provide one of the apps for Apple Watch when it becomes available in April. Stay tuned! http://t.co/CSfKmneOAT 12
RT @TheTweetOfGod: The Apple Watch may become so addictive it keeps people from looking at what's truly important in life, like their iPhon… 12

As you can see, there are lots of intersting tweet entities that give you helpful context for the announcement. One particularly notable observation is the appearance of "comedic accounts" relaying a certain amount of humor. And for once we can see why the tweet by @AnnaKendrick47, embedded below, got that popular.

When you take a closer look at some of the developed news stories, you also see sarcasm, unbelief, and even a bit of spam. Everything has a place on Twitter.

Sentiment

Identifying sarcasm or irony is a very hard task for machines, as it demands a lot of background knowledge. We can, however, take a look on how sentiment evolved over time, and if in average the announcement of the new watch and laptop was positive or negative.

Let's start by adding new columns to our DataFrame en_text that captures the sentiment of the tweets. But instead of using the regular polarity analyzer, we will use a NaiveBayesAnalyzer().


In [44]:
from textblob.sentiments import NaiveBayesAnalyzer
analyzer = NaiveBayesAnalyzer()

def extract_sentiment(value):
    ...
    return pd.Series([sentiment, pos_prob, neg_prob])

en_text[...] = en_text.text.apply(extract_sentiment)
en_text[["text", "sentiment", "pos_prob", "neg_prob"]].head(3)


Out[44]:
text sentiment pos_prob neg_prob
created_at
2015-03-09 21:00:46 Why I wound up buying a Pebble​ Watch instead.... pos 0.949290 0.050710
2015-03-09 21:00:46 RT @BobScottCPA: How is the Apple Watch "state... neg 0.260906 0.739094
2015-03-09 21:00:46 #AppleWatch Can't wait to wear one :) http://t... neg 0.443928 0.556072

Now can see how many positive and negative tweets are there in the dataset.


In [45]:
sentiments = en_text.groupby(...).aggregate({
    ...,
    ...,
    ...,
})
sentiments


Out[45]:
pos_prob text neg_prob
sentiment
neg 0.379798 1057 0.620202
pos 0.799055 5623 0.200945

In [46]:
ax = ...plot(..., figsize=(8, 4))
ax.set_title(...)
ax.set_xlabel(...)


Out[46]:
<matplotlib.text.Text at 0x7fcb9d5b3400>

It seems like the perception of the new products has been mostly positive. But it's the same for both the watch and the laptop? Let's add a new column keyword with value "watch" if the tweet contains the word "watch" (ignore case), "macbook" if it contains "macbook", "both" if it contains both, and np.NaN otherwise.


In [47]:
def get_keyword(value):
    ...

en_text["keyword"] = en_text.text.apply(get_keyword)
en_text["keyword"][:25]


Out[47]:
created_at
2015-03-09 21:00:46      watch
2015-03-09 21:00:46      watch
2015-03-09 21:00:46      watch
2015-03-09 21:00:46      watch
2015-03-09 21:00:46      watch
2015-03-09 21:00:46        NaN
2015-03-09 21:00:46    macbook
2015-03-09 21:00:46      watch
2015-03-09 21:00:46      watch
2015-03-09 21:00:47      watch
2015-03-09 21:00:47      watch
2015-03-09 21:00:47        NaN
2015-03-09 21:00:47       both
2015-03-09 21:00:47      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:48      watch
2015-03-09 21:00:49      watch
2015-03-09 21:00:49    macbook
2015-03-09 21:00:49      watch
Name: keyword, dtype: object

Now we can create a pivot table to see the average values of the distributions of probability and count the number of tweets per keyword and sentiment.


In [48]:
pd.pivot_table(en_text, index=..., aggfunc={
    ...,
    ...,
    ...,
})


Out[48]:
neg_prob pos_prob text
keyword sentiment
both neg 0.734580 0.265420 9
pos 0.140977 0.859023 235
macbook neg 0.624145 0.375855 140
pos 0.183667 0.816333 724
watch neg 0.616332 0.383668 872
pos 0.206179 0.793821 4592

Conclusions

We aspired to learn more about the general reaction to Apple's announcement by taking an initial look at the data from Twitter's firehose, and it's fair to say that we learned a few things about the data without too much effort. Lots more could be discovered, but a few of the themes that we were able to glean included...

What about the sentiment? Did the specific product make any difference in the sentiment? And in the amount of tweets?

What other impressions you got after running the analysis?


*The real apple watch*