CS109: IMDB review data set

We gathered data from Andrew L. Maas from Standford University, based on IMDB movies review (describe this more). The data was automatically split into test and train sets, with each set containing polarized movie reviews (each review was a text file) in subdirectories. Since the creators of the original data set were not interested in predicted Box Office Scores, they didn't bother to save the names of the movies, only the URLs that the reviews were scraped from on IMDB.com. Thus we had to go back into all those URLs and scrape the movie names from the top of the pages.


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from bs4 import BeautifulSoup
import requests
import csv
import os
import random
import sys
import json
sys.path.insert(0, '/aclImdb/')

Below we write a function to scrape an IMDB url and return a movie name.


In [2]:
# function to get name of movie from each URL
def get_movie(url):
    '''
    Scrapes a given URL from IMDB.com. The URL's page contains many reviews for one particular movie.
    This function returns the name of that  movie. 
    '''
    pageText = requests.get(url)
    # Keep asking for the page until you get it. Sleep if necessary.
    while (pageText==None):
        time.sleep(5)
        pageText = requests.get(url)
    soup = BeautifulSoup(pageText.text,"html.parser")
    # Some of our URL's are expired! Return None if so.
    if soup == None or soup.find("div",attrs={"id":"tn15title"}) == None:
        return None
    return soup.find("div",attrs={"id":"tn15title"}).find("a").get_text()

Now let's get the list of URLs for each of our data sets: both positive and negative for train and test.


In [3]:
# get all urls for train and test, neg and pos
with open('aclImdb/train/urls_pos.txt','r') as f:
    train_pos_urls = f.readlines()
    
with open('aclImdb/train/urls_neg.txt','r') as f:
    train_neg_urls = f.readlines()

with open('aclImdb/test/urls_pos.txt','r') as f:
    test_pos_urls = f.readlines()
    
with open('aclImdb/test/urls_neg.txt','r') as f:
    test_neg_urls = f.readlines()

Let's see how long each list is.


In [4]:
print len(train_pos_urls), len(train_neg_urls), len(test_pos_urls), len(test_neg_urls)


12500 12500 12500 12500

There are 12500 reviews in each sub data set. Each review has a corresponding URL. However, the URL lists have duplicates, as two reviews can be for the same movie and thus be found on the same IMDB webpage.

We would like to save the URLs and their associated movies into a dictionary for later use. This way we can do all the scraping up front. Let's define a function which does this scraping for a given set of URLs.


In [5]:
def make_url_dict(url_list):
    '''
    Input: List of URLs.
    Output: Dictionary of URL: movie based on scraped movie title.
    '''
    url_dict = dict(zip(url_list, [None]*len(url_list)))
    index = 0
    for url in url_list:
        if url_dict[url] == None:
            url_dict[url] = get_movie(url)
        # Every once in awhile, let us know how many URLs we have digested out of 12,500 total.
        if random.random() < 0.001:
            print index
        index += 1
        time.sleep(0.001)

Let's make a dictionary of stored movie names for each subdata set, saving into a JSON file so we only have to do this once.


In [ ]:
%time
train_pos_dict = make_url_dict(train_pos_urls)
fp = open("url_movie_train_pos.json","w")
json.dump(train_pos_dict, fp)
fp.close()

If we did this right for training positives, the length of the dictionary keys should be equal to the number of unique URLs in its URL list.


In [ ]:
print len(train_pos_dict.keys()), len(list(set(list(train_pos_urls))))

In [ ]:
%time
train_neg_dict = make_url_dict(train_neg_urls)
fp = open("url_movie_train_neg.json","w")
json.dump(train_neg_dict, fp)
fp.close()

In [ ]:
%time
test_pos_dict = make_url_dict(test_pos_urls)
fp = open("url_movie_test_pos.json","w")
json.dump(test_pos_dict, fp)
fp.close()

In [ ]:
%time
test_neg_dict = make_url_dict(test_neg_urls)
fp = open("url_movie_test_neg.json","w")
json.dump(test_neg_dict, fp)
fp.close()

In [ ]:
# Reload
with open("url_movie_tr_pos.json", "r") as fd:
    train_pos_dict = json.load(fd)
with open("url_movie_train_neg.json", "r") as fd:
    train_neg_dict = json.load(fd)
with open("url_movie_test_pos.json", "r") as fd:
    test_pos_dict = json.load(fd)
with open("url_movie_test_neg.json", "r") as fd:
    test_neg_dict = json.load(fd)

Now that we have saved movie names associated with each URL, we can finally create our data table of reviews. We will define a function data_collect which iterates over our directories, making a pandas dataframe out of all the reviews in a particular category (e.g. Test Set, Positive Reviews).


In [ ]:
def data_collect(directory, pos, url_dict, url_list):
    '''
    Inputs: 
        directory: Directory to collect reviews from. ex) 'aclImdb/train/pos/'
        Pos: True or False, depending on whether the reviews are labelled positive or not.
        url_dict: the relevant URL-Movie dictionary (created above) for the particular category
        url_list: the list of URLs for that particular category
    '''
    # Column names for the data frame
    review_df = pd.DataFrame(columns=['movie_id', 'stars', 'positive', 'text', 'url', 'movie_name'])
    # Crawl over the directory, attaining relevant data for each of the .txt review files.
    train_pos_names = list(os.walk(directory))[0][2]
    for review in train_pos_names:
        # Andrew L. Maas's stanford group encoded the reviewID and number of stars for a review in the file's name.
        # For example, "0_10.txt" means reviewID 0 received 10 stars. The reviews are in the same order as the URLs,
        # so the reviewID is precisely the location of that movie's URL in the respective URL list.
        stars = int(review.split("_")[1].split(".")[0])
        movieID = int(review.split("_")[0]) #everything before the underscore
        fp = open('%(dir)s%(review)s' % {'dir': directory, 'review': review}, 'r')
        text = fp.read()
        url = url_list[movieID]
        movie_name = url_dict[url]
        reviewDict = {'movie_id': [movieID], 'stars': [stars], 'positive': [pos], 'text': [text], 'url': [url], 'movie_name': [movie_name]}
        review_df = review_df.append(pd.DataFrame(reviewDict))
    return review_df

Data Collection

Now we are ready to collect all our data. Let's first collect the training data into a DataFrame.


In [ ]:
# First get the positive reviews for the train_df.
train_df = data_collect('aclImdb/train/pos/', True, train_pos_dict, train_pos_urls)
# Then append the negative reviews
train_df = train_df.append(data_collect('aclImdb/train/neg/', False, train_neg_dict, train_neg_urls))

Now we'll create a testing data frame.


In [ ]:
# First get the positive reviews for the train_df.
test_df = data_collect('aclImdb/test/pos/', True, test_pos_dict, test_pos_urls)
# Then append the negative reviews
test_df = test_df.append(data_collect('aclImdb/test/neg/', False, test_neg_dict, test_neg_urls))

Let's create a dictionary out of each dataframe so that we can save each in JSON format.


In [ ]:
train_df_dict = {feature: train_df[feature].values.tolist() for feature in train_df.columns.values}
test_df_dict = {feature: test_df[feature].values.tolist() for feature in test_df.columns.values}
# Train
fp = open("train_df_dict.json","w")
json.dump(train_df_dict, fp)
fp.close()
# Test
fp = open("test_df_dict.json","w")
json.dump(test_df_dict, fp)
fp.close()

Let's reopen.


In [ ]:
with open("train_df_dict.json", "r") as fd:
    train_df_dict = json.load(fd)
with open("test_df_dict.json", "r") as fd:
    test_df_dict = json.load(fd)

Data Cleaning


In [ ]: