Bag of words meets bag of popcorn

A tutorial in text mining and NLP

Please first download the data from here: https://www.kaggle.com/c/word2vec-nlp-tutorial/data

Let's first import all the libraries we will need



In [1]:

    
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from os.path import join
from bs4 import BeautifulSoup

If you are missing bs4 or nltk you can install them via:

pip install bs4
pip install nltk
python -m nltk.downloader all

Setup an I/O directory and put your downloaded data there, we will call this `root_dir` in the following.

Let's now load the data:

(make sure you change the `root_dir` to your own path)



In [2]:

    
root_dir = '/Users/arman/kaggledata/popcorn'

dfTrain = pd.read_csv(join(root_dir,'labeledTrainData.tsv'),header=0,\
                    delimiter="\t",quoting=3)

dfTest = pd.read_csv(join(root_dir,'testData.tsv'), header=0,\
                   delimiter="\t", quoting=3 )

Let's take a quick look at the data:



In [3]:

    
dfTrain.head(5)









    Out[3]:






  
    
      
      id
      sentiment
      review
    
  
  
    
      0
      "5814_8"
      1
      "With all this stuff going down at the moment ...
    
    
      1
      "2381_9"
      1
      "\"The Classic War of the Worlds\" by Timothy ...
    
    
      2
      "7759_3"
      0
      "The film starts with a manager (Nicholas Bell...
    
    
      3
      "3630_4"
      0
      "It must be assumed that those who praised thi...
    
    
      4
      "9495_8"
      1
      "Superbly trashy and wondrously unpretentious ...



In [4]:

    
dfTest.head(5)









    Out[4]:






  
    
      
      id
      review
    
  
  
    
      0
      "12311_10"
      "Naturally in a film who's main themes are of ...
    
    
      1
      "8348_2"
      "This movie is a disaster within a disaster fi...
    
    
      2
      "5828_4"
      "All in all, this is a movie for kids. We saw ...
    
    
      3
      "7186_2"
      "Afraid of the Dark left me with the impressio...
    
    
      4
      "12128_7"
      "A very accurate depiction of small time mob l...

In particular note that the `review` column has some html tags:



In [5]:

    
dfTrain['review'][11]









    Out[5]:





'"Although I generally do not like remakes believing that remakes are waste of time; this film is an exception. I didn\'t actually know so far until reading the previous comment that this was a remake, so my opinion is purely about the actual film and not a comparison.<br /><br />The story and the way it is written is no question: it is Capote. There is no need for more words.<br /><br />The play of Anthony Edwards and Eric Roberts is superb. I have seen some movies with them, each in one or the other. I was certain that they are good actors and in case of Eric I always wondered why his sister is the number 1 famous star and not her brother. This time this certainty is raised to fact, no question. His play, just as well as the play of Mr. Edwards is clearly the top of all their profession.<br /><br />I recommend this film to be on your top 50 films to see and keep on your DVD shelves."'

Our target is to use `sentiment` column to predict the same for the test set:



In [6]:

    
target = dfTrain['sentiment']

Now we need some sort of "cleaning" processes, we simply eliminate all the non-alphabet characters and use BeautifulSoup library to extract the text content, Let's put everything together in a function:



In [7]:

    
def review_to_wordlist(review, remove_stopwords=False, split=False):
    """
    Simple text cleaning function, 
    uses BeautifulSoup to extract text content from html
    removes all non-alphabet
    converts to lower case
    can remove stopwords
    can perform simple tokenization using split by whitespace
    """
        
    review_text = BeautifulSoup(review, 'lxml').get_text()
    
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
    words = review_text.lower().split()
    
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    if split:      
        return(words)
    else:
        return(' '.join(words))

Before proceeding, let's test what our function does: on the review example above:



In [8]:

    
review_to_wordlist(dfTrain['review'][11])









    Out[8]:





'although i generally do not like remakes believing that remakes are waste of time this film is an exception i didn t actually know so far until reading the previous comment that this was a remake so my opinion is purely about the actual film and not a comparison the story and the way it is written is no question it is capote there is no need for more words the play of anthony edwards and eric roberts is superb i have seen some movies with them each in one or the other i was certain that they are good actors and in case of eric i always wondered why his sister is the number famous star and not her brother this time this certainty is raised to fact no question his play just as well as the play of mr edwards is clearly the top of all their profession i recommend this film to be on your top films to see and keep on your dvd shelves'

and with the `remove_stopwords` flag on, it will give us:



In [9]:

    
review_to_wordlist(dfTrain['review'][11],remove_stopwords=True)









    Out[9]:





'although generally like remakes believing remakes waste time film exception actually know far reading previous comment remake opinion purely actual film comparison story way written question capote need words play anthony edwards eric roberts superb seen movies one certain good actors case eric always wondered sister number famous star brother time certainty raised fact question play well play mr edwards clearly top profession recommend film top films see keep dvd shelves'

and with `split` flag on, it can actually perform a simple tokenization:



In [10]:

    
token = review_to_wordlist(dfTrain['review'][11],remove_stopwords=True, split=True)
print(token)









    



['although', 'generally', 'like', 'remakes', 'believing', 'remakes', 'waste', 'time', 'film', 'exception', 'actually', 'know', 'far', 'reading', 'previous', 'comment', 'remake', 'opinion', 'purely', 'actual', 'film', 'comparison', 'story', 'way', 'written', 'question', 'capote', 'need', 'words', 'play', 'anthony', 'edwards', 'eric', 'roberts', 'superb', 'seen', 'movies', 'one', 'certain', 'good', 'actors', 'case', 'eric', 'always', 'wondered', 'sister', 'number', 'famous', 'star', 'brother', 'time', 'certainty', 'raised', 'fact', 'question', 'play', 'well', 'play', 'mr', 'edwards', 'clearly', 'top', 'profession', 'recommend', 'film', 'top', 'films', 'see', 'keep', 'dvd', 'shelves']

Notice the words

reading, purely, written, raised, films, clearly

that all need stemming, but let's for now continue with what we have

Let's now apply our cleaning process to the review columns:



In [11]:

    
dfTrain['review'] =  dfTrain['review'].map(review_to_wordlist)
dfTest['review'] =  dfTest['review'].map(review_to_wordlist)
train_len = len(dfTrain)

Our corpus is all of the reviews:



In [12]:

    
corpus = list(dfTrain['review']) + list(dfTest['review'])

Not let's use sklearn's tf-idf vectorizer with unigram and bigrams, and a log TF function (`sublinear_tf=True`)

Note that we can remove the `stop_words` here



In [13]:

    
tfv = TfidfVectorizer(min_df=3,  max_features=None, ngram_range=(1, 2),\
                      use_idf=True,smooth_idf=True,sublinear_tf=True,\
                      stop_words = 'english')

    
tfv.fit(corpus)









    Out[13]:





TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=True,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

We can now use the object `tfv` to build the tf-idf vector-space representation of the reviews, the transformation returns a sparse scipy matrix

Note: Following can take upto 1 min



In [14]:

    
X_all = tfv.transform(corpus)

Notice the shape of the `X_all` matrix:



In [15]:

    
print(X_all.shape)









    



(50000, 302723)

So it created about 300K numerical features! (the total count of words in the corpus + number of unique bigrams)

It is highly sparse though (which allows python to use scipy's sparse matrix representation and keep everything on the RAM!)

Now let's split the `X_all` matrix back to our `train` and `test` set:



In [16]:

    
train = X_all[:train_len]
test = X_all[train_len:]

We now use a Logistic Regression model to fit to the numerical features, (LR is quite safe here to use for such a high number of features, to use tree based models we definitely need feature selection)

Let's perform a simple 5-fold cross-validation using AUC score and also fine tune one of the parameters of the LR model, the penalty constant `c`



In [17]:

    
Cs = [1,3,10,30,100,300]
for c in Cs:
    clf = LogisticRegression(penalty='l2', dual=True, tol=0.0001,\
                         C=c, fit_intercept=True, intercept_scaling=1.0,\
                         class_weight=None, random_state=None)
                         
    print("c:",c,"   score:", np.mean(cross_val_score(clf, train, target,\
                            cv=5, scoring='roc_auc')))









    



c: 1    score: 0.956977312
c: 3    score: 0.961362944
c: 10    score: 0.962991712
c: 30    score: 0.963238496
c: 100    score: 0.96315872
c: 300    score: 0.96300336

Our CV experiment suggests that `c = 30` is the best choice, so we use our best model to fit to the entire train set now:



In [18]:

    
clf = LogisticRegression(penalty='l2', dual=True, tol=0.0001,\
                         C=30, fit_intercept=True, intercept_scaling=1.0,\
                         class_weight=None, random_state=None)

clf.fit(train,target)









    Out[18]:





LogisticRegression(C=30, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1.0, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

and finally predicting for test set and storing the results



In [19]:

    
preds = clf.predict_proba(test)[:,1]
dfOut = pd.DataFrame( data={"id":dfTest["id"], "sentiment":preds} )
dfOut.to_csv(join(root_dir,'submission.csv'), index=False, quoting=3)

If you submit the output file you should get the LB score of: 0.95687 which is far better than the word2vec results 0.88

So what else can be done to improve the score:

Stemming
Better tokenization, ("!" has sentimental value!)
Dimensionality reduction and building new features. For example finding a list of positive and negative sentiment words (See: [1] [2] ) and using the cosine similarity of those to the review
Feature selection (to be used for tree based models) (For example see recursive feature elimination tools in sklearn: http://scikit-learn.org/stable/modules/feature_selection.html
Ensembling these results with other models (random forest, SVM, adaBoost, xgboost etc...) See Kaggle's ensembling guide: http://mlwave.com/kaggle-ensembling-guide/



In [ ]:

	id	sentiment	review
0	"5814_8"	1	"With all this stuff going down at the moment ...
1	"2381_9"	1	"\"The Classic War of the Worlds\" by Timothy ...
2	"7759_3"	0	"The film starts with a manager (Nicholas Bell...
3	"3630_4"	0	"It must be assumed that those who praised thi...
4	"9495_8"	1	"Superbly trashy and wondrously unpretentious ...

	id	review
0	"12311_10"	"Naturally in a film who's main themes are of ...
1	"8348_2"	"This movie is a disaster within a disaster fi...
2	"5828_4"	"All in all, this is a movie for kids. We saw ...
3	"7186_2"	"Afraid of the Dark left me with the impressio...
4	"12128_7"	"A very accurate depiction of small time mob l...

Bag of words meets bag of popcorn

A tutorial in text mining and NLP

Let's first import all the libraries we will need

If you are missing bs4 or nltk you can install them via:

Setup an I/O directory and put your downloaded data there, we will call this root_dir in the following.

Let's now load the data:

(make sure you change the root_dir to your own path)

Let's take a quick look at the data:

In particular note that the review column has some html tags:

Our target is to use sentiment column to predict the same for the test set:

Now we need some sort of "cleaning" processes, we simply eliminate all the non-alphabet characters and use BeautifulSoup library to extract the text content, Let's put everything together in a function:

Before proceeding, let's test what our function does: on the review example above:

and with the remove_stopwords flag on, it will give us:

and with split flag on, it can actually perform a simple tokenization:

Notice the words

that all need stemming, but let's for now continue with what we have

Let's now apply our cleaning process to the review columns:

Our corpus is all of the reviews:

Not let's use sklearn's tf-idf vectorizer with unigram and bigrams, and a log TF function (sublinear_tf=True)

Note that we can remove the stop_words here

We can now use the object tfv to build the tf-idf vector-space representation of the reviews, the transformation returns a sparse scipy matrix

Note: Following can take upto 1 min

Notice the shape of the X_all matrix:

So it created about 300K numerical features! (the total count of words in the corpus + number of unique bigrams)

It is highly sparse though (which allows python to use scipy's sparse matrix representation and keep everything on the RAM!)

Now let's split the X_all matrix back to our train and test set:

We now use a Logistic Regression model to fit to the numerical features, (LR is quite safe here to use for such a high number of features, to use tree based models we definitely need feature selection)

Let's perform a simple 5-fold cross-validation using AUC score and also fine tune one of the parameters of the LR model, the penalty constant c

Our CV experiment suggests that c = 30 is the best choice, so we use our best model to fit to the entire train set now:

and finally predicting for test set and storing the results

If you submit the output file you should get the LB score of: 0.95687 which is far better than the word2vec results 0.88

So what else can be done to improve the score:

Setup an I/O directory and put your downloaded data there, we will call this `root_dir` in the following.

(make sure you change the `root_dir` to your own path)

In particular note that the `review` column has some html tags:

Our target is to use `sentiment` column to predict the same for the test set:

and with the `remove_stopwords` flag on, it will give us:

and with `split` flag on, it can actually perform a simple tokenization:

Not let's use sklearn's tf-idf vectorizer with unigram and bigrams, and a log TF function (`sublinear_tf=True`)

Note that we can remove the `stop_words` here

We can now use the object `tfv` to build the tf-idf vector-space representation of the reviews, the transformation returns a sparse scipy matrix

Notice the shape of the `X_all` matrix:

Now let's split the `X_all` matrix back to our `train` and `test` set:

Let's perform a simple 5-fold cross-validation using AUC score and also fine tune one of the parameters of the LR model, the penalty constant `c`

Our CV experiment suggests that `c = 30` is the best choice, so we use our best model to fit to the entire train set now: