A useful application of machine learning is "sentiment analysis". Here we are trying to determine if a person feels positively or negatively about what they're writing about. One important application of sentiment analysis is for marketing departments to understand what people are saying about them on social media. Nearly every medium or large company with any sort of social media presence does some sort of sentiment analysis like the task we are about to do.
Here we have a collection of tweets from the tech conference SXSW talking about apple brands. These tweet are hand labeled by humans using a tool I built called CrowdFlower. Our goal is to build a classifier that can generalize the human labels to more tweets.
The labels are what's known as training data, and we're going to use it to teach our classifier what text is positive sentiment and what text is negative sentiment.
Let's take a look at our data. Machine learning classes tend to talk mostly about algroithms, but in practice, machine learning practicioners usually spend most of their time looking at their data.
This is a real data set, not a toy one, and I've left it uncleanup up so you will have to work through a few of the messy issues that almost always happen in the real world.
In [2]:
# Our data file is in ../scikit/tweet.csv
# in a Comma Separated Values format
# this command uses the shell to print out the first ten lines
!head ../scikit/tweets.csv
In [2]:
import pandas as pd # this loads the pandas library, a very useful data exploration library
import numpy as np # this loads numpy, a very useful numerical computing library
# Puts tweets into a data frame
df = pd.read_csv('../scikit/tweets.csv') # read the file into a pandas data frame
print(df.head()) # print the first few rows of the data frame
Data frames are pretty cool, for example I can index the column by name.
In [3]:
tweets = df['tweet_text'] # sets tweets to be the first column, titled 'tweet_text'
print(tweets.head())
Some questions that I immediately asked myself (and you should too)
If you were my student and you were sitting in front of me, I would make you actually do this. Unfortunately I can't force you to answer these questions yourself, but you will have more fun and learn more if you do.
You will probably need to google around a little to figure out how to use the dataframe to answer these questions. You can check out the cool pandas tutorial at https://pandas.pydata.org/pandas-docs/stable/10min.html - it will be useful for many things besides this tutorial!
In [19]:
print(tweets.shape) # print the shape of the variable tweets
Looks like there are 9093 rows in our dataset
In [12]:
# we make target the list of labels from the third column
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
# describe is a cool function for quick data exploration
target.describe()
Out[12]:
Hmmm... looks like there are 4 values for the sentiment of the tweets with "No emotion toward brand or product" being the most common.
In [26]:
target.value_counts()
Out[26]:
Interesting, there is a label "I can't tell" along with "Positive emotion", "Negative emotion" and "No emotion toward brand or product"
In [27]:
tweets[0]
Out[27]:
Hm, it's a 3G iphone, when was that? 2010?
In [42]:
tweets[200]
Out[42]:
In [5]:
print(tweets[6])
It is best practice to not change the input data. It's better to clearly show the ways that you've modified your data in your code. In this case, we can use pandas to easily pull out the rows where the tweets are empty. Here we are indexing into our data frame with the results of a pd.notnull function - this notation is really convenient.
In [8]:
fixed_tweets = tweets[pd.notnull(tweets)]
We also need to remove the same rows of labels so that our "tweets" and "target" lists have the same length.
In [14]:
fixed_target = target[pd.notnull(tweets)]
Take a second to think about why I wrote fixed_target = target[pd.notnull(tweets)] instead of fixed_target = target[pd.notnull(target)]
In [ ]: