Data Exploration

Goals

Introduction to Sentiment Analysis use case
How to quickly understand a real world data set

Introduction

A useful application of machine learning is "sentiment analysis". Here we are trying to determine if a person feels positively or negatively about what they're writing about. One important application of sentiment analysis is for marketing departments to understand what people are saying about them on social media. Nearly every medium or large company with any sort of social media presence does some sort of sentiment analysis like the task we are about to do.

Here we have a collection of tweets from the tech conference SXSW talking about apple brands. These tweet are hand labeled by humans using a tool I built called CrowdFlower. Our goal is to build a classifier that can generalize the human labels to more tweets.

The labels are what's known as training data, and we're going to use it to teach our classifier what text is positive sentiment and what text is negative sentiment.

Let's take a look at our data. Machine learning classes tend to talk mostly about algroithms, but in practice, machine learning practicioners usually spend most of their time looking at their data.

This is a real data set, not a toy one, and I've left it uncleanup up so you will have to work through a few of the messy issues that almost always happen in the real world.



In [2]:

    
# Our data file is in ../scikit/tweet.csv
# in a Comma Separated Values format
# this command uses the shell to print out the first ten lines
!head ../scikit/tweets.csv









    



tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion
"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion
@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion
@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion
"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion
@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd,,No emotion toward brand or product
,,No emotion toward brand or product
"#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan",Android,Positive emotion
Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB,iPad or iPhone App,Positive emotion

Ok, that looks good - if a little messy. Let's open the file with some python

Loading Data



In [2]:

    
import pandas as pd    # this loads the pandas library, a very useful data exploration library
import numpy as np     # this loads numpy, a very useful numerical computing library

# Puts tweets into a data frame
df = pd.read_csv('../scikit/tweets.csv') # read the file into a pandas data frame
print(df.head())       # print the first few rows of the data frame









    



                                          tweet_text  \
0  .@wesley83 I have a 3G iPhone. After 3 hrs twe...   
1  @jessedee Know about @fludapp ? Awesome iPad/i...   
2  @swonderlin Can not wait for #iPad 2 also. The...   
3  @sxsw I hope this year's festival isn't as cra...   
4  @sxtxstate great stuff on Fri #SXSW: Marissa M...   

  emotion_in_tweet_is_directed_at  \
0                          iPhone   
1              iPad or iPhone App   
2                            iPad   
3              iPad or iPhone App   
4                          Google   

  is_there_an_emotion_directed_at_a_brand_or_product  
0                                   Negative emotion  
1                                   Positive emotion  
2                                   Positive emotion  
3                                   Negative emotion  
4                                   Positive emotion

Data frames are pretty cool, for example I can index the column by name.



In [3]:

    
tweets = df['tweet_text'] # sets tweets to be the first column, titled 'tweet_text'
print(tweets.head())









    



0    .@wesley83 I have a 3G iPhone. After 3 hrs twe...
1    @jessedee Know about @fludapp ? Awesome iPad/i...
2    @swonderlin Can not wait for #iPad 2 also. The...
3    @sxsw I hope this year's festival isn't as cra...
4    @sxtxstate great stuff on Fri #SXSW: Marissa M...
Name: tweet_text, dtype: object

Check for understanding

Some questions that I immediately asked myself (and you should too)

How many rows are in our data set?
How many different types of labels are there? What are they?
What year was this data collected?

If you were my student and you were sitting in front of me, I would make you actually do this. Unfortunately I can't force you to answer these questions yourself, but you will have more fun and learn more if you do.

You will probably need to google around a little to figure out how to use the dataframe to answer these questions. You can check out the cool pandas tutorial at https://pandas.pydata.org/pandas-docs/stable/10min.html - it will be useful for many things besides this tutorial!

Question 1: How many rows in the dataset?



In [19]:

    
print(tweets.shape) # print the shape of the variable tweets

Looks like there are 9093 rows in our dataset

Question 2: How many different types of labels are there? What are they?



In [12]:

    
# we make target the list of labels from the third column
target = df['is_there_an_emotion_directed_at_a_brand_or_product']

# describe is a cool function for quick data exploration
target.describe()









    Out[12]:





count                                   9093
unique                                     4
top       No emotion toward brand or product
freq                                    5389
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: object

Hmmm... looks like there are 4 values for the sentiment of the tweets with "No emotion toward brand or product" being the most common.



In [26]:

    
target.value_counts()









    Out[26]:





No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

Interesting, there is a label "I can't tell" along with "Positive emotion", "Negative emotion" and "No emotion toward brand or product"

Question 3: What year was this data collected?



In [27]:

    
tweets[0]









    Out[27]:





'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

Hm, it's a 3G iphone, when was that? 2010?



In [42]:

    
tweets[200]









    Out[42]:





"rt ' It's 4 p.m. and the #iPad2 line at the Apple store is longer and wider – about 250 people! Only one more hour. ' #sxsw"

Ok - the ipad2 was released in 2011, these tweets must be from 2011.

Data Cleanup

If we dig into the data set one thing we'll notice is that some of the tweets are actually empty.



In [5]:

    
print(tweets[6])

nan

It is best practice to not change the input data. It's better to clearly show the ways that you've modified your data in your code. In this case, we can use pandas to easily pull out the rows where the tweets are empty. Here we are indexing into our data frame with the results of a pd.notnull function - this notation is really convenient.



In [8]:

    
fixed_tweets = tweets[pd.notnull(tweets)]

We also need to remove the same rows of labels so that our "tweets" and "target" lists have the same length.



In [14]:

    
fixed_target = target[pd.notnull(tweets)]

Take a second to think about why I wrote fixed_target = target[pd.notnull(tweets)] instead of fixed_target = target[pd.notnull(target)]

Key Takeaways

The most important thing to do when building a machine learning model is to actually look at your data.
Clean up your data in code, not in the original file

Questions

How messy is this data? It was labeled by humans - how many mislabels?
Why is there a "Can't Tell" label - what kind of tweets get that?
Are all the tweets in English?



In [ ]: