Exercise 17

Analyze how travelers expressed their feelings on Twitter

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").



In [1]:

    
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

# read the data and set the datetime as the index
tweets = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Tweets.zip', index_col=0)

tweets.head()









    Out[1]:







  
    
      
      airline_sentiment
      airline_sentiment_confidence
      negativereason
      negativereason_confidence
      airline
      airline_sentiment_gold
      name
      negativereason_gold
      retweet_count
      text
      tweet_coord
      tweet_created
      tweet_location
      user_timezone
    
    
      tweet_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      570306133677760513
      neutral
      1.0000
      NaN
      NaN
      Virgin America
      NaN
      cairdin
      NaN
      0
      @VirginAmerica What @dhepburn said.
      NaN
      2015-02-24 11:35:52 -0800
      NaN
      Eastern Time (US & Canada)
    
    
      570301130888122368
      positive
      0.3486
      NaN
      0.0000
      Virgin America
      NaN
      jnardino
      NaN
      0
      @VirginAmerica plus you've added commercials t...
      NaN
      2015-02-24 11:15:59 -0800
      NaN
      Pacific Time (US & Canada)
    
    
      570301083672813571
      neutral
      0.6837
      NaN
      NaN
      Virgin America
      NaN
      yvonnalynn
      NaN
      0
      @VirginAmerica I didn't today... Must mean I n...
      NaN
      2015-02-24 11:15:48 -0800
      Lets Play
      Central Time (US & Canada)
    
    
      570301031407624196
      negative
      1.0000
      Bad Flight
      0.7033
      Virgin America
      NaN
      jnardino
      NaN
      0
      @VirginAmerica it's really aggressive to blast...
      NaN
      2015-02-24 11:15:36 -0800
      NaN
      Pacific Time (US & Canada)
    
    
      570300817074462722
      negative
      1.0000
      Can't Tell
      1.0000
      Virgin America
      NaN
      jnardino
      NaN
      0
      @VirginAmerica and it's a really big bad thing...
      NaN
      2015-02-24 11:14:45 -0800
      NaN
      Pacific Time (US & Canada)



In [5]:

    
tweets.shape









    Out[5]:





(14640, 14)

Proportion of tweets with each sentiment



In [6]:

    
tweets['airline_sentiment'].value_counts()









    Out[6]:





negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

Proportion of tweets per airline



In [7]:

    
tweets['airline'].value_counts()









    Out[7]:





United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America     504
Name: airline, dtype: int64



In [11]:

    
pd.Series(tweets["airline"]).value_counts().plot(kind = "bar",figsize=(8,6),rot = 0)









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f395cfa79e8>



In [12]:

    
pd.crosstab(index = tweets["airline"],columns = tweets["airline_sentiment"]).plot(kind='bar',figsize=(10, 6),alpha=0.5,rot=0,stacked=True,title="Sentiment by airline")









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f395dd457f0>

Exercise 17.1

Predict the sentiment using CountVectorizer

use Random Forest classifier



In [32]:

    
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer



In [18]:

    
X = tweets['text']
y = tweets['airline_sentiment'].map({'negative':-1,'neutral':0,'positive':1})

Exercise 17.2

Remove stopwords, then predict the sentiment using CountVectorizer.

use Random Forest classifier



In [ ]:

Exercise 17.3

Increase n_grams size (with and without stopwords), then predict the sentiment using CountVectorizer

use Random Forest classifier



In [ ]:

Exercise 17.4

Predict the sentiment using TfidfVectorizer.

use Random Forest classifier



In [ ]:

	airline_sentiment	airline_sentiment_confidence	negativereason	negativereason_confidence	airline	airline_sentiment_gold	name	negativereason_gold	retweet_count	text	tweet_coord	tweet_created	tweet_location	user_timezone
tweet_id
570306133677760513	neutral	1.0000	NaN	NaN	Virgin America	NaN	cairdin	NaN	0	@VirginAmerica What @dhepburn said.	NaN	2015-02-24 11:35:52 -0800	NaN	Eastern Time (US & Canada)
570301130888122368	positive	0.3486	NaN	0.0000	Virgin America	NaN	jnardino	NaN	0	@VirginAmerica plus you've added commercials t...	NaN	2015-02-24 11:15:59 -0800	NaN	Pacific Time (US & Canada)
570301083672813571	neutral	0.6837	NaN	NaN	Virgin America	NaN	yvonnalynn	NaN	0	@VirginAmerica I didn't today... Must mean I n...	NaN	2015-02-24 11:15:48 -0800	Lets Play	Central Time (US & Canada)
570301031407624196	negative	1.0000	Bad Flight	0.7033	Virgin America	NaN	jnardino	NaN	0	@VirginAmerica it's really aggressive to blast...	NaN	2015-02-24 11:15:36 -0800	NaN	Pacific Time (US & Canada)
570300817074462722	negative	1.0000	Can't Tell	1.0000	Virgin America	NaN	jnardino	NaN	0	@VirginAmerica and it's a really big bad thing...	NaN	2015-02-24 11:14:45 -0800	NaN	Pacific Time (US & Canada)