Case Study 3 : Textual analysis of movie reviews

Due Date: April 6, 2016 5:59PM

TEAM Members:

Helen Hong Haley Huang Tom Meagher Tyler Reese

Desired outcome of the case study.

In this case study we will look at movie reviews from the v2.0 polarity dataset comes from the http://www.cs.cornell.edu/people/pabo/movie-review-data.
- It contains written reviews of movies divided into positive and negative reviews.
As in Case Study 2 idea is to analyze the data set, make conjectures, support or refute those conjectures with data, and tell a story about the data!

Required Readings:

This case study will be based upon the scikit-learn Python library
We will build upon the turtorial "Working With Text Data" which can be found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Case study assumptions:

You have access to a python installation

Required Python libraries:

Numpy (www.numpy.org) (should already be installed from Case Study 2)
Matplotlib (matplotlib.org) (should already be installed from Case Study 2)
Scikit-learn (scikit-learn.org) (avaiable from Enthought Canopy)
You are also welcome to use the Python Natural Language Processing Toolkit (www.nltk.org) (though it is not required).

NOTE

Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

Problem 1 (20 points): Complete Exercise 2: Sentiment Analysis on movie reviews from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Assuming that you have downloaded the scikit-learn source code:
- The data can be downloaded using doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py
- A skeleton for the solution can be found in doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py
- A completed solution can be found in doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py
It is ok to use the solution provided in the scikit-learn distribution as a starting place for your work.

Modify the solution to Exercise 2 so that it can run in this iPython notebook

This will likely involved moving around data files and/or small modifications to the script.
Upgrade Scikit Learn on Mac OS X

Load the Data from Source



In [20]:

    
import os
import tarfile
from contextlib import closing
try:
    from urllib import urlopen
except ImportError:
    from urllib.request import urlopen


URL = ("http://www.cs.cornell.edu/people/pabo/"
       "movie-review-data/review_polarity.tar.gz")

ARCHIVE_NAME = URL.rsplit('/', 1)[1]
DATA_FOLDER = "txt_sentoken"


if not os.path.exists(DATA_FOLDER):

    if not os.path.exists(ARCHIVE_NAME):
        print("Downloading dataset from %s (3 MB)" % URL)
        opener = urlopen(URL)
        with open(ARCHIVE_NAME, 'wb') as archive:
            archive.write(opener.read())

    print("Decompressing %s" % ARCHIVE_NAME)
    with closing(tarfile.open(ARCHIVE_NAME, "r:gz")) as archive:
        archive.extractall(path='.')
    os.remove(ARCHIVE_NAME)
else:
    print("Dataset already exists")









    



Dataset already exists

Global Imports



In [5]:

    
import numpy as np
import pandas as pa
import matplotlib.pylab as py
import matplotlib.pyplot as plt
import scipy
from time import time

%matplotlib inline

Load data



In [6]:

    
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split
from sklearn import metrics

dataset = load_files('txt_sentoken', shuffle=False)
print("n_samples: %d" % len(dataset.data))









    



n_samples: 2000

Split data into training (75%) and testing (25%) sets



In [7]:

    
docs_train, docs_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.25, random_state=None)

Build pipeline



In [8]:

    
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

# Vectorizer / classifier pipeline that filters out tokens that are too rare or too frequent
pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
    ('clf', LinearSVC(C=1000)),
])

Grid search



In [9]:

    
# Find out whether unigrams or bigrams are more useful.
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1)
# Fit pipeline on training set using grid search for the parameters
grid_search.fit(docs_train, y_train)

# Print cross-validated scores for each parameter set explored by the grid search
print(grid_search.grid_scores_)

# Predict outcome on testing set and store it in a variable named y_predicted
y_predicted = grid_search.predict(docs_test)

# Print classification report
print(metrics.classification_report(y_test, y_predicted,
                                    target_names=dataset.target_names))









    



[mean: 0.82200, std: 0.00748, params: {'vect__ngram_range': (1, 1)}, mean: 0.83533, std: 0.00984, params: {'vect__ngram_range': (1, 2)}]
             precision    recall  f1-score   support

        neg       0.90      0.88      0.89       247
        pos       0.88      0.91      0.89       253

avg / total       0.89      0.89      0.89       500

Print and plot confusion matrix



In [15]:

    
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)
plt.matshow(cm)
plt.colorbar()
plt.title('Confusion matrix')
plt.ylabel('True')
plt.xlabel('Predicted')
plt.show()

Problem 2 (20 points): Explore the scikit-learn TfidVectorizer class

Read the documentation for the TfidVectorizer class at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Define the term frequency–inverse document frequency (TF-IDF) statistic (http://en.wikipedia.org/wiki/Tf%E2%80%93idf will likely help).
Run the TfidVectorizer class on the training data above (docs_train).
Explore the min_df and max_df parameters of TfidVectorizer. What do they mean? How do they change the features you get?
Explore the ngram_range parameter of TfidVectorizer. What does it mean? How does it change the features you get? (Note, large values of ngram_range may take a long time to run!)

Parameters in tf-idf Vectorizer

min_df: filter all terms with frequency lower than this value in any document
max-df: filter all terms with frequecy greater than this value in any document, used to filter out stop words.
n-gram range: If n-gram range = (m,M), build a vocabulary of ALL n-grams of length m through M

Test tf-idf vectorizer object on training set



In [12]:

    
tfidfv = TfidfVectorizer()
tfidfv = tfidfv.set_params(max_df=0.75, max_features= 5000, use_idf= True, smooth_idf=True, sublinear_tf = True)



In [13]:

    
t0 = time()
vectors = tfidfv.fit_transform(docs_train) 
print("done in %0.3fs" % (time() - t0))









    



done in 2.181s

Explore how the min_df and max_df change the number of features we get



In [15]:

    
import numpy as np
value_range=np.arange(0.01,0.99,0.01)

# Calculate the number of features in the library for each value of min_df and max_df in the given range
y1=[TfidfVectorizer(min_df=x).fit_transform(docs_train).shape[1] for x in value_range]
y2=[TfidfVectorizer(max_df=x).fit_transform(docs_train).shape[1] for x in value_range]

# Plot min_df and max_df versus the number of tokens in the vocabulary
from ggplot import *
print qplot(value_range,y=y1,geom='line')+xlab('min_df')+ylab('features')
print qplot(value_range,y=y2,geom='line')+xlab('max_df')+ylab('features')









    



C:\Users\Tyler\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    












    



<ggplot: (24655318)>






    












    



<ggplot: (23096818)>

Explore how the ngram_range change the number of features we get



In [16]:

    
x=[1 for i in range(10)]
y=np.arange(10)+1

# We allow ngram_range to range in the form (1 , ngram) for ngram between 1 and 10
parameter=zip(x,y)

# Calculate the number of tokens in the vocabulary
y3=[TfidfVectorizer(ngram_range=i).fit_transform(docs_train).shape[1] for i in parameter]

# Plot the number of features verses the (1,ngram) range
fig=plt.figure(figsize=(8,6))
plt.plot([1,2,3,4,5,6,7,8,9,10],y3,'b--o')
plt.xlabel('ngram')
plt.ylabel('features')









    Out[16]:





<matplotlib.text.Text at 0x164a1f28>

Observe how the parameters min_df , max_df, and ngram_range affect the predictiona ccuracy of classification algorithms.



In [4]:

    
#setting max_df and n_gram_range as default, we choose min_df in [1,2,3,4,5] seperately, 
#and store the corresponding Xtrain and Xtest into min_df_data array.

min_df_data=[(TfidfVectorizer(min_df=i).fit_transform(docs_train).toarray(),
TfidfVectorizer(min_df=i).fit(docs_train).transform(docs_test).toarray()) for i in [1,3,5,7]]



In [5]:

    
#setting min_df and n_gram_range as default, we choose max_df in [0.40,0.5, 0.60, 0.7] seperately, 
#and store the corresponding Xtrain and Xtest into max_df_data array.

max_df_data=[(TfidfVectorizer(max_df=i).fit_transform(docs_train).toarray(),
TfidfVectorizer(max_df=i).fit(docs_train).transform(docs_test).toarray()) for i in [0.40,0.5, 0.60, 0.7]]



In [6]:

    
#setting min_df and max_df as default, we choose ngram_range in [(1,1),(1,2)] seperately, 
#and store the corresponding Xtrain and Xtest into ngram_range_data array.

ngram_range_data=[(TfidfVectorizer(ngram_range=i).fit_transform(docs_train),
TfidfVectorizer(ngram_range=i).fit(docs_train).transform(docs_test)) for i in [(1,1),(1,2)]]



In [17]:

    
# explore parameters in tfidf for both linear SVC and KNN
param_grid = [
  {'C': [1]},
   ]
grid_search = GridSearchCV(LinearSVC(), param_grid, n_jobs=1, verbose=1)

# For each XTrain and XTest generated above (for the varying parameters) fit a linear SVC on XTrain and use that to predict
# on X_Test
min_df_fit=[grid_search.fit(i[0],y_train).predict(i[1]) for i in min_df_data ]
max_df_fit=[grid_search.fit(i[0],y_train).predict(i[1]) for i in max_df_data ]
ngram_range_fit=[grid_search.fit(i[0],y_train).predict(i[1]) for i in ngram_range_data]

# Determine the prediction accuracy for each model (separated per-parameter)

min_df_svc_score=[metrics.accuracy_score(min_df_fit[i],y_test) for i in range(4)]
max_df_svc_score=[metrics.accuracy_score(max_df_fit[i],y_test) for i in range(4)]
ngram_range_svc_score=[metrics.accuracy_score(ngram_range_fit[i],y_test) for i in range(2)]









    



Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits





    



[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.4min finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   30.8s finished






    



Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits





    



[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.0s finished






    



Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits





    



[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   47.4s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   34.3s finished






    



Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits





    



[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 10.2min finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.5min finished






    



Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits





    



[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   13.3s finished



In [18]:

    
from sklearn.neighbors import KNeighborsClassifier
param_grid = [
  {'n_neighbors': [1,4]},
   ]
grid_search1 = GridSearchCV(KNeighborsClassifier(), param_grid, n_jobs=1, verbose=1)

# For each XTrain and XTest generated above (for the varying parameters) fit KNN on XTrain and use that to predict
# on X_Test.  We also try K = 1 and 4.

min_df_fit1=[grid_search1.fit(i[0],y_train).predict(i[1]) for i in min_df_data ]
max_df_fit1=[grid_search1.fit(i[0],y_train).predict(i[1]) for i in max_df_data ]
ngram_range_fit1=[grid_search1.fit(i[0],y_train).predict(i[1]) for i in ngram_range_data]









    



Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits





    



[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 13.6min finished
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.5min finished






    



Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits





    



[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  3.1min finished
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  3.1min finished






    



Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits





    



[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  9.0min finished
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.9min finished






    



Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits





    



[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 11.8min finished
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  9.3min finished






    



Fitting 3 folds for each of 2 candidates, totalling 6 fits
Fitting 3 folds for each of 2 candidates, totalling 6 fits





    



[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    5.6s finished
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    6.7s finished



In [19]:

    
# Determine the prediction accuracy for each model (separated per-parameter)

min_df_knn_score=[metrics.accuracy_score(min_df_fit1[i],y_test) for i in range(4)]
max_df_knn_score=[metrics.accuracy_score(max_df_fit1[i],y_test) for i in range(4)]
ngram_range_knn_score=[metrics.accuracy_score(ngram_range_fit1[i],y_test) for i in range(2)]



In [20]:

    
import matplotlib.pyplot as plt 

# Plot prediction accuracy of KNN and SVC models versus the min_df value.

fig=plt.figure(figsize=(8,6))
plt.plot([1,3,5,7], min_df_svc_score, 'bo--',label='svm')
plt.plot([1,3,5,7], min_df_knn_score, 'ro--',label='knn')
plt.legend(loc='best')
plt.xlabel('min_df')
plt.ylabel('score')









    Out[20]:





<matplotlib.text.Text at 0xe1f1400>



In [21]:

    
fig=plt.figure(figsize=(8,6))

# Plot prediction accuracy of KNN and SVC models versus the max_df value.

plt.plot([0.40,0.5, 0.60, 0.7], max_df_svc_score, 'bo--',label='svm')
plt.plot([0.40,0.5, 0.60, 0.7], max_df_knn_score, 'ro--',label='knn')
plt.legend(loc='best')
plt.xlabel('max_df')
plt.ylabel('score')









    Out[21]:





<matplotlib.text.Text at 0xe25c7b8>



In [22]:

    
fig=plt.figure(figsize=(8,6))

# Plot prediction accuracy of KNN and SVC models versus the ngram_range.

plt.plot([1,2], ngram_range_svc_score, 'bo--',label='svm')
plt.plot([1,2], ngram_range_knn_score, 'ro--',label='knn')
plt.legend(loc='best')
plt.xlabel('ngram_range = (1,ngram)')
plt.ylabel('score')









    Out[22]:





<matplotlib.text.Text at 0xe26edd8>

Problem 3 (20 points): Machine learning algorithms

Based upon Problem 2 pick some parameters for TfidfVectorizer
- "fit" your TfidfVectorizer using docs_train
- Compute "Xtrain", a Tf-idf-weighted document-term matrix using the transform function on docs_train
- Compute "Xtest", a Tf-idf-weighted document-term matrix using the transform function on docs_test
- Note, be sure to use the same Tf-idf-weighted class ("fit" using docs_train) to transform both docs_test and docs_train
Examine two classifiers provided by scikit-learn
- LinearSVC
- KNeighborsClassifier
- Try a number of different parameter settings for each and judge your performance using a confusion matrix (see Problem 1 for an example).
Does one classifier, or one set of parameters work better?
- Why do you think it might be working better?
For a particular choice of parameters and classifier, look at 2 examples where the prediction was incorrect.
- Can you conjecture on why the classifier made a mistake for this prediction?

Fit TfidVectorizer using docs_train, and compute "Xtrain" and "Xtest"



In [25]:

    
# This is all done in the following line of code.  Note that we never explicitly define Xtrain and Xtest.  Rather, data is of
# form data = [X train , X test].  Thus Xtrain = data[0] and Xtest = data[1]
data=[TfidfVectorizer().fit_transform(docs_train).toarray(), TfidfVectorizer().fit(docs_train).transform(docs_test).toarray()]

K-Nearest Neighbors



In [26]:

    
from sklearn.neighbors import KNeighborsClassifier

# We use K-values ranging from 1-10
k=[1,2,3,4,5,6,7,8,9,10]

# Train a model on the trainng set and use that model to predict on the testing set
predicted_knn=[KNeighborsClassifier(n_neighbors=i).fit(data[0],y_train).predict(data[1]) for i in k]

#Compute accuracy on the testing set for each value of k
score_knn=[metrics.accuracy_score(predicted_knn[i],y_test) for i in range(10)]

# Plot accuracy on the test set vs. k
fig=plt.figure(figsize=(8,6))
plt.plot([1,2,3,4,5,6,7,8,9,10], score_knn, 'bo--',label='knn')
plt.xlabel('K')
plt.ylabel('score')









    Out[26]:





<matplotlib.text.Text at 0x44367a20>



In [30]:

    
# Make predictions based on the best model above
y_predicted = predicted_knn[0]

# Print and plot a confusion matrix
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

Linear SVC



In [27]:

    
# We try "penalty" parameters given as follows:
C=[.01,.05,.1,.5,1,2,3,4,10,20]

# Train a model on the trainng set and use that model to predict on the testing set
predicted_svm=[LinearSVC(C=i).fit(data[0],y_train).predict(data[1]) for i in C]

#Compute accuracy on the testing set for each value of penalty C
score_svm=[metrics.accuracy_score(predicted_svm[i],y_test) for i in range(10)]

# Plot accuracy on the test set vs. C
fig=plt.figure(figsize=(8,6))
plt.plot([.01,.05,.1,.5,1,2,3,4,10,20], score_svm, 'bo--',label='svm')
plt.xlabel('C')
plt.ylabel('score')









    Out[27]:





<matplotlib.text.Text at 0x61a323c8>



In [36]:

    
# Use the best model above to make predictions on the test set
y_predicted = predicted_svm[9]

# Print and plot a confusion matrix
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

Mis-classified Reviews



In [28]:

    
# We choose our most successful SVC model above, and print both the predicted and true classifications on the test set.
print predicted_svm[9]
print
print y_test









    



[0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 1 1 0 0 0
 0 1 1 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1
 1 1 0 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 0 1 1
 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 1 0
 1 1 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0
 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0
 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 1
 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0
 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 1
 1 0 0 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 1
 0 1 1 1 1 0 0 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 1
 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1
 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 1 0 0
 1 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0]

[0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0
 1 1 1 0 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1
 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1
 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1
 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0
 0 1 0 0 0 1 0 1 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 0
 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 1
 1 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 1
 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1
 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1
 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1
 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1
 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0
 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0]



In [31]:

    
# Here is a false-positive:
print docs_test[1]









    



susan granger's review of " session 9 " ( usa films ) 
sometimes you just get more than your bargained for . . . like when boston-based hazmat elimination , run by scottish actor peter mullan and his trusty assistant , david caruso , assures a town engineer ( paul guilifoyle ) that they can remove insidious asbestos fibers from a victorian hospital facility in a week . 
erected in 1871 , deserted and decomposing since 1985 , the danvers mental hospital , is one of the most malevolent " locations " ever chosen for a film . 
the structure is so massive - with its labyrinth of rubble-strewn corridors , collapsing floors , stagnant pools of water , isolation cells , and ominous surgical chambers where experimental pre-frontal lobotomies were performed - that their task seems impossible within that time frame . 
and each member of their inexperienced crew ( stephan gevedon , brandon sexton iii , and josh lucas ) is coping with his own personal demons as , one by one , their minds seem to be affected by the grim areas in which they're working . 
the film's title is derived from salvaged reel-to-reel audio-recorded sessions involving the demonic possession of a young woman who is suffering from multiple personalities . 
by the time session 9 occurs so do dreadful disasters . 
filmmaker brad anderson obviously envisioned this as a gruesome chainsaw-massacre-type ghost story but the script lacks structure and isn't particularly scary . 
the conclusion is more ludicrous than convincing . 
on the granger movie gauge of 1 to 10 , " session 9 " is a dark , gloomy 4 . silly me . . . at 
first , i thought that the original name of the danvers lunatic asylum bore some reference to mrs . danvers , the creepy housekeeper played by judith anderson in alfred hitchcock's truly terrifying " rebecca " that also involved a cavernous mansion called manderley .



In [32]:

    
#Here is a False-Negative:
print docs_test[9]









    



this summer , one of the most racially charged novels in john grisham's series , a time to kill , was made into a major motion picture . 
on january 3 of this year , director rob reiner basically re-released the film under the title of ghosts of mississippi . 
based on the true story of 1963 civil rights leader medgar evars' assassination , ghosts of mississippi revolves around the 25-year legal battle faced by myrlie evars ( whoopi goldberg , sister act ) and her quest to have her husband's obvious assassin and racist byron de la beckwith ( james woods , casino ) jailed . 
so she turns to assistant district attorney and prosecutor bobby delaughter ( alec baldwin , heaven's prisoners ) to imprison the former kkk member . 
ghosts sets its tone with an opening montage of images from african-american history , from slave-ship miseries to life in the racist south of the 1960's . 
but all too soon , the white folks take over , intoning lines like " what's america got to do with anything ? 
this is mississippi ! " 
as beckwith , james woods , with his head larded with latex most of the time as an old man , teeters between portraying evil and its character . 
meanwhile , goldberg turns in a very serious and weepy performance as the wife who wouldn't let her husband's death rest until she got the conviction . 
both deserve serious oscar-consideration . 
this brings us to the dull performance of baldwin . 
let's face it , trying to match matthew mcconaughey's wonderful acting in a time to kill is basically impossible . 
and baldwin is living proof of this , as no emotions could be felt . 
it seemed as if he actually had to struggle to shed a single tear . 
either poor acting or poor directing , but something definitely went wrong . 
another strange mishap was the fact that goldberg's facial features didn't change , as she looked the same in the courtroom as she did holding her husband's dead body 25-years earlier . 
yet woods' was plastered with enough make up to make him look like goldberg's father . 
at least the make-up was realistic . 
with some emotional moments in the poorly written script , ghosts of mississippi lacked in heart , when its predecessor , a time to kill , brought tears to everyone's eyes . 
don't get me wrong , the movie wasn't all that bad , but if you've seen grisham's masterpiece , then don't expect this one to be an excellent film . 
 ,

Problem 4 (20 points): Open Ended Question: Finding the right plot

Can you find a two dimensional plot in which the positive and negative reviews are separated?
- This problem is hard since you will likely have thousands of features for review, and you will need to transform these thousands of features into just two numbers (so that you can make a 2D plot).
Note, I was not able to find such a plot myself!
- So, this problem is about trying but perhaps not necessarily succeeding!
I tried two things, neither of which worked very well.
- I first plotted the length of the review versus the number of features we compute that are in that review
- Second I used Principle Component Analysis on a subset of the features.
Can you do better than I did!?

First, build a new set of predictors, based on the text-structure of each document.



In [33]:

    
# Total number of words with more than one letter in the review
total_words = [len([words for words in review.split() if len(words)>1]) for review in dataset.data]

# Total number of sentences
total_sentences = [len(review.split('.'))+1 for review in dataset.data]

# Average number of words per sentence
average_sentence_length = [len([words for words in review.split() if len(words)>1])/ float((len(review.split('\n')))) for review in dataset.data]

# Total number of words endint in n't in the document
number_of_not_contractions = [review.count("'t") for review in dataset.data]

# Total occurences of the word "not"
number_of_nots = [review.count("not") for review in dataset.data]

# Number of "not" and "n't" occurences
total_nots = [number_of_not_contractions[i] + number_of_nots[i] for i in range(len(number_of_not_contractions))]

number_of_contractions = [review.count("'") for review in dataset.data]


# Determine number of words in the last sentence
last_sentence = [review.split('\n')[len(review.split('\n'))-2] for review in dataset.data]
last_sentence_length = [len([words for words in sen.split( ) if len(words) > 1]) for sen in last_sentence]

# Number of words in the first sentence of each review
first_sentence = [review.split('\n')[0] for review in dataset.data]
first_sentence_length = [len([words for words in sen.split( ) if len(words) > 1]) for sen in first_sentence]

# Number of words in the longest sentence
longest_sentence = [max([len([words for words in sen.split( ) if len(words) > 1]) for sen in [sentences for sentences in review.split('\n') 
                                                                                              if len(sentences) > 3]] 
                             ) for review in dataset.data]

# Number of words in the shortest sentence
shortest_sentence = [min([len([words for words in sen.split( ) if len(words) > 1]) for sen in [sentences for sentences in review.split('\n') 
                                                                                              if len(sentences) > 3]] 
                             ) for review in dataset.data]

# Standard deviation of sentence length (in words)
sent_dev = [np.std([len([words for words in sen.split( ) if len(words) > 1]) for sen in [sentences for sentences in review.split('\n') 
                                                                                              if len(sentences) > 3]] 
                             ) for review in dataset.data]

# Total number of occurences of () or ... or ?

number_of_parenth = [review.count("(") for review in dataset.data]
number_of_elips = [review.count(". . .") for review in dataset.data]
number_of_questions = [review.count("?") for review in dataset.data]
number_of_punc = [number_of_parenth[i]+number_of_elips[i]+number_of_questions[i] for i in range(len(number_of_parenth))]

# Percent of all leters that are vowels
percent_vowels = [(review.count('a')+ review.count('e') + review.count('i') + review.count('o') + review.count('u'))/
                 float(len(review)) for review in dataset.data]

# Percent of words that start with vowels
percent_start_vowels = [(review.count(' a')+ review.count(' e') + review.count(' i') + review.count(' o') + review.count(' u'))/
                 float(len(review)) for review in dataset.data]


total_you = [review.count('you') for review in dataset.data]


# Count the number of negative-connotation prefixes which occur.
no_dis = [review.count(' dis')for review in dataset.data]
no_un = [review.count(' un')for review in dataset.data]
no_in = [review.count(' in')for review in dataset.data]
no_il = [review.count(' il')for review in dataset.data]
no_im = [review.count(' im')for review in dataset.data]
no_sub = [review.count(' sub')for review in dataset.data]
no_under = [review.count(' under')for review in dataset.data]
no_non = [review.count(' non')for review in dataset.data]

neg_prefix = [no_dis[i]+ no_un[i] + no_in[i] + no_il[i] + no_im[i] + no_sub[i] + no_under[i] + no_non[i] for i in range(len(no_dis))]

# Given a string st, this function finds the occurence of substring subst1 or subst2 which occurs closest to the beginning of st.
def first_occ(st,subst1,subst2):
    if st.find(subst1) > 0:
        if st.find(subst2) > 0:
            return min(st.find(subst1),st.find(subst2))
        else:
            return st.find(subst1)
    else:
        return st.find(subst2)

# Locate the first "not" or "n't" in the review
first_not = [first_occ(review,"not","'t")/float(len(review)) for review in dataset.data]

# Locate the last "not" or "n't" in the review
last_not = [first_occ(review[::-1],"ton","t'")/float(len(review)) for review in dataset.data]

# Determine the occurence of "not" or "n't" which is closest to the beginning or end of the review.
min_not = np.minimum(np.asarray(first_not),np.asarray(last_not))



In [34]:

    
# Store this new data in a data frame
import pandas as pd

newdata = {'Review Type': dataset.target,'Total Words': total_words, 
           'Total Sentences': total_sentences,'Average Sentence Length': average_sentence_length,
           'Number of not Contractions': number_of_not_contractions,'Total number of Nots': total_nots,'Last Sentence Length':last_sentence_length,
           'First Sentence Length': first_sentence_length,'Longest Sentence':longest_sentence,
           'Shortest Sentence':shortest_sentence, 'Number of Contractions': number_of_contractions, 'Number of () ... or ?': number_of_punc,
            'Sentence Deviation': sent_dev,#'Number of Questions': number_of_questions, 'Number of ...': number_of_elips, 
          'Number of Negative Prefixes': neg_prefix,#'Percent Vowels': percent_vowels, 'Percent Start Vowels': percent_start_vowels,
          'Total You': total_you, 'Closest Not': min_not}
data = pd.DataFrame(newdata, columns = ['Review Type','Total Words', 
           'Total Sentences','Average Sentence Length',
           'Number of not Contractions','Total number of Nots',
            'Last Sentence Length',
           'First Sentence Length','Longest Sentence',
           'Shortest Sentence','Number of Contractions','Number of () ... or ?','Sentence Deviation',#'Number of Questions', 'Number of ...', 
                                        'Number of Negative Prefixes',#'Percent Vowels', 'Percent Start Vowels',
                                        'Total You','Closest Not'])

data









    Out[34]:






  
    
      
      Review Type
      Total Words
      Total Sentences
      Average Sentence Length
      Number of not Contractions
      Total number of Nots
      Last Sentence Length
      First Sentence Length
      Longest Sentence
      Shortest Sentence
      Number of Contractions
      Number of () ... or ?
      Sentence Deviation
      Number of Negative Prefixes
      Total You
      Closest Not
    
  
  
    
      0
      0
      676
      36
      18.777778
      6
      9
      27
      12
      53
      1
      16
      19
      14.163880
      20
      5
      0.127133
    
    
      1
      0
      227
      16
      16.214286
      3
      4
      8
      6
      48
      4
      13
      3
      13.471425
      6
      3
      0.093431
    
    
      2
      0
      472
      24
      19.666667
      5
      12
      6
      19
      47
      6
      15
      7
      9.449498
      15
      3
      0.066362
    
    
      3
      0
      456
      22
      22.800000
      2
      3
      11
      26
      48
      3
      25
      10
      12.013151
      9
      2
      0.138614
    
    
      4
      0
      698
      39
      18.368421
      1
      12
      14
      23
      42
      3
      15
      10
      8.801604
      34
      2
      0.032820
    
    
      5
      0
      641
      37
      17.805556
      0
      4
      11
      18
      34
      3
      3
      3
      6.497346
      23
      0
      0.035796
    
    
      6
      0
      525
      26
      18.750000
      3
      6
      11
      11
      40
      2
      17
      6
      10.945431
      30
      7
      0.039525
    
    
      7
      0
      540
      25
      19.285714
      3
      9
      10
      9
      79
      3
      21
      15
      16.149762
      30
      4
      0.007316
    
    
      8
      0
      684
      31
      20.117647
      7
      12
      9
      8
      56
      3
      26
      14
      11.531965
      33
      0
      0.172497
    
    
      9
      0
      739
      57
      14.780000
      5
      11
      25
      15
      48
      1
      22
      37
      11.446124
      17
      11
      0.090270
    
    
      10
      0
      674
      55
      19.257143
      2
      8
      16
      30
      40
      3
      21
      14
      9.620660
      31
      2
      0.075315
    
    
      11
      0
      511
      28
      18.925926
      2
      2
      7
      27
      54
      4
      16
      4
      12.421886
      19
      0
      0.338516
    
    
      12
      0
      448
      19
      23.578947
      2
      5
      29
      21
      52
      7
      16
      11
      12.319311
      14
      0
      0.023708
    
    
      13
      0
      886
      54
      19.260870
      6
      9
      15
      13
      55
      2
      14
      22
      10.502533
      34
      4
      0.008229
    
    
      14
      0
      470
      31
      13.823529
      9
      12
      13
      8
      54
      1
      28
      8
      11.977328
      20
      12
      0.001295
    
    
      15
      0
      572
      36
      19.724138
      3
      7
      17
      4
      61
      0
      18
      17
      12.231191
      19
      2
      0.031470
    
    
      16
      0
      615
      33
      17.571429
      5
      9
      1
      8
      44
      1
      24
      8
      10.589583
      25
      6
      0.027838
    
    
      17
      0
      653
      42
      18.657143
      8
      12
      32
      31
      74
      1
      20
      13
      17.639827
      30
      12
      0.045921
    
    
      18
      0
      420
      30
      15.555556
      2
      7
      1
      19
      45
      1
      17
      11
      10.829775
      7
      6
      0.046643
    
    
      19
      0
      678
      38
      16.536585
      9
      11
      11
      18
      40
      3
      20
      7
      9.134960
      17
      6
      0.060606
    
    
      20
      0
      640
      34
      18.823529
      2
      6
      14
      24
      52
      1
      18
      8
      13.103179
      27
      1
      0.025503
    
    
      21
      0
      505
      21
      18.703704
      5
      8
      17
      31
      42
      1
      18
      10
      11.059883
      29
      3
      0.004813
    
    
      22
      0
      621
      44
      15.525000
      4
      10
      33
      12
      33
      1
      18
      4
      8.373905
      20
      0
      0.072801
    
    
      23
      0
      1036
      44
      28.000000
      5
      10
      46
      20
      84
      3
      37
      10
      18.395517
      36
      2
      0.022955
    
    
      24
      0
      681
      32
      20.636364
      0
      5
      8
      11
      38
      5
      12
      6
      8.979958
      27
      1
      0.093594
    
    
      25
      0
      541
      32
      16.906250
      2
      3
      7
      12
      33
      7
      10
      3
      6.996952
      20
      14
      0.023072
    
    
      26
      0
      469
      23
      19.541667
      2
      4
      25
      3
      40
      3
      8
      3
      8.406722
      14
      2
      0.125735
    
    
      27
      0
      1233
      45
      28.022727
      3
      9
      29
      29
      68
      2
      23
      28
      14.436387
      40
      6
      0.024522
    
    
      28
      0
      494
      20
      24.700000
      2
      3
      8
      13
      71
      3
      19
      9
      15.894388
      22
      0
      0.078313
    
    
      29
      0
      310
      23
      14.761905
      0
      2
      1
      18
      33
      1
      8
      5
      9.243917
      8
      0
      0.067465
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1970
      1
      626
      27
      21.586207
      0
      1
      19
      19
      43
      2
      6
      10
      10.731177
      26
      3
      0.215154
    
    
      1971
      1
      579
      29
      19.965517
      3
      7
      24
      42
      42
      4
      21
      3
      9.878019
      20
      7
      0.051432
    
    
      1972
      1
      507
      39
      13.342105
      1
      5
      36
      21
      38
      3
      18
      8
      8.529329
      19
      5
      0.248134
    
    
      1973
      1
      1141
      59
      19.672414
      7
      18
      8
      15
      50
      2
      32
      5
      11.244478
      36
      17
      0.030540
    
    
      1974
      1
      471
      22
      21.409091
      2
      7
      1
      12
      52
      1
      22
      4
      13.650646
      16
      1
      0.000971
    
    
      1975
      1
      918
      62
      15.559322
      6
      12
      20
      7
      33
      6
      15
      4
      5.249427
      26
      1
      0.077534
    
    
      1976
      1
      236
      18
      13.882353
      1
      5
      5
      11
      36
      5
      3
      1
      7.284401
      7
      7
      0.032258
    
    
      1977
      1
      394
      28
      14.592593
      6
      10
      20
      9
      28
      2
      12
      7
      6.230769
      18
      0
      0.006319
    
    
      1978
      1
      760
      46
      16.888889
      8
      13
      15
      10
      45
      1
      19
      2
      10.382160
      20
      5
      0.028694
    
    
      1979
      1
      821
      36
      22.189189
      4
      11
      17
      11
      72
      4
      27
      8
      12.558351
      26
      14
      0.064870
    
    
      1980
      1
      445
      26
      17.800000
      3
      7
      2
      30
      38
      2
      5
      3
      9.539301
      9
      3
      0.101234
    
    
      1981
      1
      676
      29
      23.310345
      3
      9
      33
      18
      44
      2
      23
      13
      10.466662
      22
      3
      0.035849
    
    
      1982
      1
      583
      45
      13.558140
      0
      9
      11
      7
      42
      5
      0
      2
      7.660046
      24
      4
      0.036169
    
    
      1983
      1
      751
      39
      18.317073
      6
      9
      34
      21
      40
      2
      31
      5
      9.941045
      35
      2
      0.088296
    
    
      1984
      1
      525
      25
      21.000000
      2
      5
      9
      25
      47
      8
      10
      14
      9.107289
      25
      2
      0.030145
    
    
      1985
      1
      603
      46
      13.704545
      2
      4
      13
      9
      63
      1
      13
      6
      10.375482
      26
      0
      0.122060
    
    
      1986
      1
      1132
      58
      16.895522
      1
      25
      40
      17
      44
      1
      10
      13
      10.098368
      29
      31
      0.010112
    
    
      1987
      1
      180
      17
      12.857143
      0
      3
      9
      14
      22
      2
      4
      3
      6.099762
      7
      2
      0.241728
    
    
      1988
      1
      847
      45
      19.250000
      6
      13
      15
      24
      42
      2
      22
      8
      7.840073
      30
      7
      0.025314
    
    
      1989
      1
      743
      28
      27.518519
      2
      10
      31
      9
      63
      9
      9
      5
      11.066836
      32
      0
      0.023743
    
    
      1990
      1
      948
      63
      14.584615
      7
      13
      7
      6
      44
      3
      38
      10
      7.601141
      32
      5
      0.086086
    
    
      1991
      1
      577
      30
      20.607143
      4
      10
      8
      5
      43
      5
      12
      8
      10.165840
      18
      2
      0.001416
    
    
      1992
      1
      1346
      84
      15.651163
      13
      20
      19
      23
      37
      3
      19
      17
      7.079128
      35
      3
      0.016371
    
    
      1993
      1
      602
      35
      23.153846
      4
      7
      21
      8
      44
      4
      17
      18
      10.740279
      22
      13
      0.038695
    
    
      1994
      1
      493
      36
      15.903226
      4
      6
      10
      31
      47
      2
      8
      13
      12.695625
      15
      7
      0.116979
    
    
      1995
      1
      732
      47
      16.636364
      5
      9
      13
      3
      39
      1
      16
      8
      8.105377
      18
      3
      0.085121
    
    
      1996
      1
      311
      26
      12.440000
      3
      5
      5
      13
      35
      4
      12
      1
      6.919051
      8
      3
      0.013048
    
    
      1997
      1
      1020
      54
      19.615385
      1
      7
      60
      30
      60
      5
      12
      4
      8.915727
      47
      7
      0.020898
    
    
      1998
      1
      579
      40
      15.648649
      3
      6
      7
      15
      37
      1
      8
      3
      8.699856
      14
      1
      0.099915
    
    
      1999
      1
      959
      62
      15.467742
      6
      9
      11
      14
      42
      5
      29
      6
      7.400397
      28
      3
      0.098205
    
  

2000 rows × 16 columns



In [35]:

    
# Normalize the Data.
Udata = data.drop('Review Type', 1)
Udata_norm =(Udata - Udata.min()) / (Udata.max() - Udata.min())

data_array = Udata_norm.as_matrix(columns = None)



In [36]:

    
# Train a decision tree on the normalized data.
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth = 6)
clf = clf.fit(Udata_norm,data['Review Type'])

features = list(Udata.columns)


# Split the data into Negative and Positive subsets.
Neg = Udata_norm.ix[Udata.index[data['Review Type']==0]]
Pos = Udata_norm.ix[Udata.index[data['Review Type']==1]]



In [37]:

    
# The following code was obtained via GitHub.  It prints a description of a classification tree.
def print_decision_tree(tree, feature_names=None, offset_unit='    '):
    '''Plots textual representation of rules of a decision tree
    tree: scikit-learn representation of tree
    feature_names: list of feature names. They are set to f1,f2,f3,... if not specified
    offset_unit: a string of offset of the conditional block'''

    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value
    if feature_names is None:
        features  = ['f%d'%i for i in tree.tree_.feature]
    else:
        features  = feature_names       

    def recurse(left, right, threshold, features, node, depth=0):
            offset = offset_unit*depth
            if (threshold[node] != -2):
                    print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node],depth+1)
                    print(offset+"} else {")
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node],depth+1)
                    print(offset+"}")
            else:
                    print(offset+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0,0)
print_decision_tree(clf, offset_unit = '   ')









    



if ( f8 <= 0.0833333358169 ) {
   if ( f2 <= 0.397718280554 ) {
      if ( f11 <= 0.349428236485 ) {
         if ( f2 <= 0.305362761021 ) {
            if ( f9 <= 0.201298698783 ) {
               if ( f6 <= 0.170588240027 ) {
                  return [[ 34.  41.]]
               } else {
                  return [[ 44.  24.]]
               }
            } else {
               if ( f0 <= 0.669308662415 ) {
                  return [[ 78.  21.]]
               } else {
                  return [[ 0.  3.]]
               }
            }
         } else {
            if ( f1 <= 0.103351950645 ) {
               if ( f13 <= 0.205128222704 ) {
                  return [[ 5.  0.]]
               } else {
                  return [[ 0.  1.]]
               }
            } else {
               if ( f4 <= 0.202702701092 ) {
                  return [[ 14.  44.]]
               } else {
                  return [[ 23.  23.]]
               }
            }
         }
      } else {
         if ( f7 <= 0.216535434127 ) {
            if ( f5 <= 0.0360824726522 ) {
               if ( f7 <= 0.177165359259 ) {
                  return [[ 1.  5.]]
               } else {
                  return [[ 8.  0.]]
               }
            } else {
               if ( f6 <= 0.335294127464 ) {
                  return [[ 41.   1.]]
               } else {
                  return [[ 3.  1.]]
               }
            }
         } else {
            if ( f8 <= 0.0500000044703 ) {
               if ( f13 <= 0.730769276619 ) {
                  return [[ 98.  26.]]
               } else {
                  return [[ 0.  2.]]
               }
            } else {
               if ( f10 <= 0.132911384106 ) {
                  return [[ 23.   7.]]
               } else {
                  return [[  4.  12.]]
               }
            }
         }
      }
   } else {
      if ( f11 <= 0.37861007452 ) {
         if ( f4 <= 0.148648649454 ) {
            return [[  0.  12.]]
         } else {
            if ( f7 <= 0.208661422133 ) {
               if ( f6 <= 0.123529419303 ) {
                  return [[ 1.  2.]]
               } else {
                  return [[ 4.  0.]]
               }
            } else {
               if ( f2 <= 0.441754043102 ) {
                  return [[  0.  10.]]
               } else {
                  return [[ 4.  6.]]
               }
            }
         }
      } else {
         if ( f6 <= 0.182352945209 ) {
            if ( f5 <= 0.201030924916 ) {
               if ( f14 <= 0.0956551134586 ) {
                  return [[  9.  15.]]
               } else {
                  return [[ 13.   4.]]
               }
            } else {
               if ( f14 <= 0.0033819056116 ) {
                  return [[ 2.  0.]]
               } else {
                  return [[  7.  27.]]
               }
            }
         } else {
            if ( f9 <= 0.240259736776 ) {
               if ( f7 <= 0.216535434127 ) {
                  return [[ 0.  3.]]
               } else {
                  return [[ 38.   8.]]
               }
            } else {
               if ( f6 <= 0.523529410362 ) {
                  return [[ 15.  22.]]
               } else {
                  return [[ 7.  1.]]
               }
            }
         }
      }
   }
} else {
   if ( f0 <= 0.343901365995 ) {
      if ( f3 <= 0.113636367023 ) {
         if ( f11 <= 0.476569116116 ) {
            if ( f2 <= 0.174074590206 ) {
               return [[ 7.  0.]]
            } else {
               if ( f8 <= 0.416666686535 ) {
                  return [[ 212.  344.]]
               } else {
                  return [[ 13.   5.]]
               }
            }
         } else {
            if ( f2 <= 0.616909265518 ) {
               if ( f13 <= 0.166666671634 ) {
                  return [[ 28.   9.]]
               } else {
                  return [[ 0.  2.]]
               }
            } else {
               if ( f5 <= 0.0670103058219 ) {
                  return [[ 2.  0.]]
               } else {
                  return [[ 2.  8.]]
               }
            }
         }
      } else {
         if ( f0 <= 0.216424480081 ) {
            if ( f2 <= 0.193574130535 ) {
               return [[ 0.  3.]]
            } else {
               if ( f7 <= 0.070866137743 ) {
                  return [[ 0.  3.]]
               } else {
                  return [[ 90.  33.]]
               }
            }
         } else {
            if ( f1 <= 0.25977653265 ) {
               if ( f2 <= 0.266240358353 ) {
                  return [[ 8.  0.]]
               } else {
                  return [[ 106.   97.]]
               }
            } else {
               if ( f11 <= 0.222144022584 ) {
                  return [[ 2.  1.]]
               } else {
                  return [[  0.  10.]]
               }
            }
         }
      }
   } else {
      if ( f2 <= 0.252200692892 ) {
         return [[ 5.  0.]]
      } else {
         if ( f5 <= 0.71649479866 ) {
            if ( f6 <= 0.264705896378 ) {
               if ( f0 <= 0.483487457037 ) {
                  return [[ 35.  69.]]
               } else {
                  return [[  5.  37.]]
               }
            } else {
               if ( f7 <= 0.653543293476 ) {
                  return [[  5.  57.]]
               } else {
                  return [[ 2.  1.]]
               }
            }
         } else {
            return [[ 2.  0.]]
         }
      }
   }
}

Now, based on the printed tree, manually construct the desired scatter plots, as described in the report.



In [39]:

    
#one

Neg1a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] <= .34943,
                        np.logical_and(Neg['Average Sentence Length']<=.30536,
                        Neg['Number of Contractions']<= .2013))))]]
Neg1b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] <= .34943,
                        np.logical_and(Neg['Average Sentence Length']<=.30536,
                        Neg['Number of Contractions']> .2013))))]]
Pos1a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .34943,
                        np.logical_and(Pos['Average Sentence Length']<=.30536,
                        Pos['Number of Contractions']<= .2013))))]]
Pos1b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .34943,
                        np.logical_and(Pos['Average Sentence Length']<=.30536,
                        Pos['Number of Contractions'] > .2013))))]]

py.plot(Neg1a['Number of Contractions'],Neg1a['First Sentence Length'],'ro')
py.plot(Neg1b['Number of Contractions'],-Neg1b['Total Words'],'ro')
py.plot(Pos1a['Number of Contractions'],Pos1a['First Sentence Length'],'go')
py.plot(Pos1b['Number of Contractions'],-Pos1b['Total Words'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=0.2013, ymin=-0.8, ymax = 1, linewidth=1, color='k')









    Out[39]:





<matplotlib.lines.Line2D at 0x65986e80>



In [40]:

    
#Two

Neg2a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] <= .34943,
                        np.logical_and(Neg['Average Sentence Length']>.30536,
                        Neg['Total Sentences']<= .103352))))]]
Neg2b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] <= .34943,
                        np.logical_and(Neg['Average Sentence Length']>.30536,
                        Neg['Total Sentences']> .10335))))]]
Pos2a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .34943,
                        np.logical_and(Pos['Average Sentence Length']>.30536,
                        Pos['Total Sentences']<= .103352))))]]
Pos2b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .34943,
                        np.logical_and(Pos['Average Sentence Length']>.30536,
                        Pos['Total Sentences'] > .10335))))]]

py.plot(Neg2a['Total Sentences'],Neg2a['Number of not Contractions'],'ro')
py.plot(Neg2b['Total Sentences'],-Neg2b['Total number of Nots'],'ro')
py.plot(Pos2a['Total Sentences'],Pos2a['Number of not Contractions'],'go')
py.plot(Pos2b['Total Sentences'],-Pos2b['Total number of Nots'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.10335, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[40]:





<matplotlib.lines.Line2D at 0x632aa2e8>



In [41]:

    
#Three

Neg3a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] > .34943,
                        np.logical_and(Neg['Longest Sentence']<=.216535,
                        Neg['Last Sentence Length']<= .03608))))]]
Neg3b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] > .34943,
                        np.logical_and(Neg['Longest Sentence']<=.216535,
                        Neg['Last Sentence Length']> .03608))))]]
Pos3a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .34943,
                        np.logical_and(Pos['Longest Sentence']<=.216535,
                        Pos['Last Sentence Length']<= .03608))))]]
Pos3b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .34943,
                        np.logical_and(Pos['Longest Sentence']<=.216535,
                        Pos['Last Sentence Length']> .03608))))]]

py.plot(Neg3a['Last Sentence Length'],Neg3a['Longest Sentence'],'ro')
py.plot(Neg3b['Last Sentence Length'],-Neg3b['First Sentence Length'],'ro')
py.plot(Pos3a['Last Sentence Length'],Pos3a['Longest Sentence'],'go')
py.plot(Pos3b['Last Sentence Length'],-Pos3b['First Sentence Length'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.03608, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[41]:





<matplotlib.lines.Line2D at 0x113a19b0>



In [42]:

    
#four

Neg4a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] > .34943,
                        np.logical_and(Neg['Longest Sentence']>.216535,
                        Neg['Shortest Sentence']<= .05))))]]
Neg4b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] <= .39771, 
                        np.logical_and(Neg['Sentence Deviation'] > .34943,
                        np.logical_and(Neg['Longest Sentence']>.216535,
                        Neg['Shortest Sentence']> .05))))]]
Pos4a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .34943,
                        np.logical_and(Pos['Longest Sentence']>.216535,
                        Pos['Shortest Sentence']<= .05))))]]
Pos4b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] <= .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .34943,
                        np.logical_and(Pos['Longest Sentence']>.216535,
                        Pos['Shortest Sentence']> .05))))]]

py.plot(Neg4a['Shortest Sentence'],Neg4a['Total You'],'ro')
py.plot(Neg4b['Shortest Sentence'],-Neg4b['Number of () ... or ?'],'ro')
py.plot(Pos4a['Shortest Sentence'],Pos4a['Total You'],'go')
py.plot(Pos4b['Shortest Sentence'],-Pos4b['Number of () ... or ?'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.05, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[42]:





<matplotlib.lines.Line2D at 0x62ff44e0>



In [43]:

    
#five

Neg5a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation'] <= .3786,
                        np.logical_and(Neg['Total number of Nots']<= .14865,
                        Neg['Total number of Nots']<= .14865))))]]
Neg5b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation']<= .3786,
                        np.logical_and(Neg['Total number of Nots']<= .14865,
                        Neg['Total number of Nots']<= .14865))))]]
Pos5a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .3786,
                        np.logical_and(Pos['Total number of Nots']<= .14865,
                        Pos['Total number of Nots']<= .14865))))]]
Pos5b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .3786,
                        np.logical_and(Pos['Total number of Nots']<= .14865,
                        Pos['Total number of Nots']<= .14865))))]]

py.plot(Neg5a['Sentence Deviation'],Neg5a['Total number of Nots'],'ro')
#py.plot(Neg5b['Sentence Deviation'],-Neg5b['Total number of Nots'],'ro')
py.plot(Pos5a['Sentence Deviation'],Pos5a['Total number of Nots'],'go')
#py.plot(Pos3b['Shortest Sentence'],-Pos3b['Number of () ... or ?'],'go')
#plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
#plt.axvline(x=.05, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[43]:





[<matplotlib.lines.Line2D at 0x11498fd0>]



In [44]:

    
#Six

Neg6a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation'] <= .3786,
                        np.logical_and(Neg['Total number of Nots']> .14865,
                        Neg['Longest Sentence']<= .20866))))]]
Neg6b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation']<= .3786,
                        np.logical_and(Neg['Total number of Nots']> .14865,
                        Neg['Longest Sentence']> .20866))))]]
Pos6a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .3786,
                        np.logical_and(Pos['Total number of Nots']> .14865,
                        Pos['Longest Sentence']<= .20866))))]]
Pos6b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] <= .3786,
                        np.logical_and(Pos['Total number of Nots']> .14865,
                        Pos['Longest Sentence']> .20866))))]]

py.plot(Neg6a['Longest Sentence'],Neg6a['Number of Negative Prefixes'],'ro')
py.plot(Neg6b['Longest Sentence'],-Neg6b['Average Sentence Length'],'ro')
py.plot(Pos6a['Longest Sentence'],Pos6a['Number of Negative Prefixes'],'go')
py.plot(Pos6b['Longest Sentence'],-Pos6b['Average Sentence Length'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.20866, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[44]:





<matplotlib.lines.Line2D at 0x62ce29e8>



In [45]:

    
#Seven

Neg7a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation'] > .3786,
                        np.logical_and(Neg['First Sentence Length']<= .18235,
                        Neg['Last Sentence Length']<= .201))))]]
Neg7b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation']> .3786,
                        np.logical_and(Neg['First Sentence Length']<= .18235,
                        Neg['Last Sentence Length']> .201))))]]
Pos7a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .3786,
                        np.logical_and(Pos['First Sentence Length']<= .18235,
                        Pos['Last Sentence Length']<= .201))))]]
Pos7b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .3786,
                        np.logical_and(Pos['First Sentence Length']<= .18235,
                        Pos['Last Sentence Length']> .201))))]]

py.plot(Neg7a['Last Sentence Length'],Neg7a['Closest Not'],'ro')
py.plot(Neg7b['Last Sentence Length'],-Neg7b['Closest Not'],'ro')
py.plot(Pos7a['Last Sentence Length'],Pos7a['Closest Not'],'go')
py.plot(Pos7b['Last Sentence Length'],-Pos7b['Closest Not'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.201, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[45]:





<matplotlib.lines.Line2D at 0x117a5400>



In [46]:

    
#Eight

Neg8a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation'] > .3786,
                        np.logical_and(Neg['First Sentence Length']> .18235,
                        Neg['Number of Contractions']<= .24026))))]]
Neg8b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']<= .08333, 
                        np.logical_and(Neg['Average Sentence Length'] > .39771, 
                        np.logical_and(Neg['Sentence Deviation']> .3786,
                        np.logical_and(Neg['First Sentence Length']> .18235,
                        Neg['Number of Contractions']> .24026))))]]
Pos8a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .3786,
                        np.logical_and(Pos['First Sentence Length']> .18235,
                        Pos['Number of Contractions']<= .24026))))]]
Pos8b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']<= .08333, 
                        np.logical_and(Pos['Average Sentence Length'] > .39771, 
                        np.logical_and(Pos['Sentence Deviation'] > .3786,
                        np.logical_and(Pos['First Sentence Length']> .18235,
                        Pos['Number of Contractions']> .24026))))]]

py.plot(Neg8a['Number of Contractions'],Neg8a['Longest Sentence'],'ro')
py.plot(Neg8b['Number of Contractions'],-Neg8b['First Sentence Length'],'ro')
py.plot(Pos8a['Number of Contractions'],Pos8a['Longest Sentence'],'go')
py.plot(Pos8b['Number of Contractions'],-Pos8b['First Sentence Length'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.24026, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[46]:





<matplotlib.lines.Line2D at 0x60704c50>



In [47]:

    
#Nine

Neg9a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] <= .113636,
                        np.logical_and(Neg['Sentence Deviation']<= .47657,
                        Neg['Average Sentence Length']<= .17407))))]]
Neg9b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] <= .113636,
                        np.logical_and(Neg['Sentence Deviation']<= .47657,
                        Neg['Average Sentence Length']> .17407))))]]
Pos9a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] <= .113636,
                        np.logical_and(Pos['Sentence Deviation']<= .47657,
                        Pos['Average Sentence Length']<= .17407))))]]
Pos9b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] <= .113636,
                        np.logical_and(Pos['Sentence Deviation']<= .47657,
                        Pos['Average Sentence Length']> .17407))))]]

py.plot(Neg9a['Average Sentence Length'],Neg9a['Average Sentence Length'],'ro')
py.plot(Neg9b['Average Sentence Length'],-Neg9b['Shortest Sentence'],'ro')
py.plot(Pos9a['Average Sentence Length'],Pos9a['Average Sentence Length'],'go')
py.plot(Pos9b['Average Sentence Length'],-Pos9b['Shortest Sentence'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.17407, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[47]:





<matplotlib.lines.Line2D at 0x61300160>



In [48]:

    
#ten

Neg10a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] <= .113636,
                        np.logical_and(Neg['Sentence Deviation']> .47657,
                        Neg['Average Sentence Length']<= .6169))))]]
Neg10b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] <= .113636,
                        np.logical_and(Neg['Sentence Deviation']> .47657,
                        Neg['Average Sentence Length']> .6169))))]]
Pos10a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] <= .113636,
                        np.logical_and(Pos['Sentence Deviation']> .47657,
                        Pos['Average Sentence Length']<= .6169))))]]
Pos10b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] <= .113636,
                        np.logical_and(Pos['Sentence Deviation']> .47657,
                        Pos['Average Sentence Length']> .6169))))]]

py.plot(Neg10a['Average Sentence Length'],Neg10a['Total You'],'ro')
py.plot(Neg10b['Average Sentence Length'],-Neg10b['Last Sentence Length'],'ro')
py.plot(Pos10a['Average Sentence Length'],Pos10a['Total You'],'go')
py.plot(Pos10b['Average Sentence Length'],-Pos10b['Last Sentence Length'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.6169, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[48]:





<matplotlib.lines.Line2D at 0x617c4e80>



In [49]:

    
#eleven

Neg11a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] > .113636,
                        np.logical_and(Neg['Total Words']<= .21692,
                        Neg['Average Sentence Length']<= .19357))))]]
Neg11b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] > .113636,
                        np.logical_and(Neg['Total Words']<= .21692,
                        Neg['Average Sentence Length']> .19357))))]]
Pos11a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] > .113636,
                        np.logical_and(Pos['Total Words']<= .21692,
                        Pos['Average Sentence Length']<= .19357))))]]
Pos11b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] > .113636,
                        np.logical_and(Pos['Total Words']<= .21692,
                        Pos['Average Sentence Length']> .19357))))]]

py.plot(Neg11a['Average Sentence Length'],Neg11a['Average Sentence Length'],'ro')
py.plot(Neg11b['Average Sentence Length'],-Neg11b['Longest Sentence'],'ro')
py.plot(Pos11a['Average Sentence Length'],Pos11a['Average Sentence Length'],'go')
py.plot(Pos11b['Average Sentence Length'],-Pos11b['Longest Sentence'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.19357, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[49]:





<matplotlib.lines.Line2D at 0x118cf198>



In [50]:

    
#twelve

Neg12a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] > .113636,
                        np.logical_and(Neg['Total Words']> .21692,
                        Neg['Total Sentences']<= .25978))))]]
Neg12b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] <= .3439, 
                        np.logical_and(Neg['Number of not Contractions'] > .113636,
                        np.logical_and(Neg['Total Words']> .21692,
                        Neg['Total Sentences']> .25978))))]]
Pos12a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] > .113636,
                        np.logical_and(Pos['Total Words']> .21692,
                        Pos['Total Sentences']<= .25978))))]]
Pos12b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] <= .3439, 
                        np.logical_and(Pos['Number of not Contractions'] > .113636,
                        np.logical_and(Pos['Total Words']> .21692,
                        Pos['Total Sentences']> .25978))))]]

py.plot(Neg12a['Total Sentences'],Neg12a['Average Sentence Length'],'ro')
py.plot(Neg12b['Total Sentences'],-Neg12b['Closest Not'],'ro')
py.plot(Pos12a['Total Sentences'],Pos12a['Average Sentence Length'],'go')
py.plot(Pos12b['Total Sentences'],-Pos12b['Closest Not'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.25978, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[50]:





<matplotlib.lines.Line2D at 0x60fee5f8>



In [51]:

    
#thirteen

Neg13a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] > .3439, 
                        np.logical_and(Neg['Average Sentence Length'] <= .2522,
                        np.logical_and(Neg['Average Sentence Length'] <= .2522,
                        Neg['Average Sentence Length'] <= .2522))))]]
Neg13b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] > .3439, 
                        np.logical_and(Neg['Average Sentence Length'] <= .2522,
                        np.logical_and(Neg['Average Sentence Length'] <= .2522,
                        Neg['Average Sentence Length'] <= .2522))))]]
Pos13a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] > .3439, 
                        np.logical_and(Pos['Average Sentence Length'] <= .2522,
                        np.logical_and(Pos['Average Sentence Length'] <= .2522,
                        Pos['Average Sentence Length'] <= .2522))))]]
Pos13b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] > .3439, 
                        np.logical_and(Pos['Average Sentence Length'] <= .2522,
                        np.logical_and(Pos['Average Sentence Length'] <= .2522,
                        Pos['Average Sentence Length'] <= .2522))))]]

py.plot(Neg13a['Total Words'],Neg13a['Average Sentence Length'],'ro')
#py.plot(Neg12b['Total Sentences'],-Neg12b['Closest Not'],'ro')
py.plot(Pos13a['Total Words'],Pos13a['Average Sentence Length'],'go')
#py.plot(Pos12b['Total Sentences'],-Pos12b['Closest Not'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.2522, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[51]:





<matplotlib.lines.Line2D at 0x61ff3400>



In [52]:

    
#fifteen

Neg15a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] > .3439, 
                        np.logical_and(Neg['Average Sentence Length'] > .2522,
                        np.logical_and(Neg['Last Sentence Length'] <= .7165,
                        Neg['First Sentence Length'] <= .2647))))]]
Neg15b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] > .3439, 
                        np.logical_and(Neg['Average Sentence Length'] > .2522,
                        np.logical_and(Neg['Last Sentence Length'] <= .7165,
                        Neg['First Sentence Length'] > .2647))))]]
Pos15a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] > .3439, 
                        np.logical_and(Pos['Average Sentence Length'] > .2522,
                        np.logical_and(Pos['Last Sentence Length'] <= .7165,
                        Pos['First Sentence Length'] <= .2647))))]]
Pos15b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] > .3439, 
                        np.logical_and(Pos['Average Sentence Length'] > .2522,
                        np.logical_and(Pos['Last Sentence Length'] <= .7165,
                        Pos['First Sentence Length'] > .2647))))]]

py.plot(Neg15a['First Sentence Length'],Neg15a['Total Words'],'ro')
py.plot(Neg15b['First Sentence Length'],-Neg15b['Longest Sentence'],'ro')
py.plot(Pos15a['First Sentence Length'],Pos15a['Total Words'],'go')
py.plot(Pos15b['First Sentence Length'],-Pos15b['Longest Sentence'],'go')

plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.2647, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[52]:





<matplotlib.lines.Line2D at 0x65b46160>



In [53]:

    
#sixteen

Neg16a = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] > .3439, 
                        np.logical_and(Neg['Average Sentence Length'] > .2522,
                        np.logical_and(Neg['Last Sentence Length'] > .7165,
                        Neg['Last Sentence Length'] > .7165))))]]
Neg16b = Neg.ix[Neg.index[np.logical_and(Neg['Shortest Sentence']> .08333, 
                        np.logical_and(Neg['Total Words'] > .3439, 
                        np.logical_and(Neg['Average Sentence Length'] > .2522,
                        np.logical_and(Neg['Last Sentence Length'] > .7165,
                        Neg['Last Sentence Length'] > .7165))))]]
Pos16a = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] > .3439, 
                        np.logical_and(Pos['Average Sentence Length'] > .2522,
                        np.logical_and(Pos['Last Sentence Length'] > .7165,
                        Pos['Last Sentence Length'] > .7165))))]]
Pos16b = Pos.ix[Pos.index[np.logical_and(Pos['Shortest Sentence']> .08333, 
                        np.logical_and(Pos['Total Words'] > .3439, 
                        np.logical_and(Pos['Average Sentence Length'] > .2522,
                        np.logical_and(Pos['Last Sentence Length'] > .7165,
                        Pos['Last Sentence Length'] > .7165))))]]

py.plot(Neg16a['Average Sentence Length'],Neg16a['Last Sentence Length'],'ro')
#py.plot(Neg16b['First Sentence Length'],-Neg16b['Longest Sentence'],'ro')
py.plot(Pos16a['Average Sentence Length'],Pos16a['Last Sentence Length'],'go')
#py.plot(Pos16b['First Sentence Length'],-Pos16b['Longest Sentence'],'go')
plt.axhline(y=0, xmin=0, xmax=1, linewidth=1, color = 'k')
plt.axvline(x=.7165, ymin=-1, ymax = 1, linewidth=1, color='k')









    Out[53]:





<matplotlib.lines.Line2D at 0x6579edd8>

Report: communicate the results (20 points)

(1) (5 points) What data you collected?

(2) (5 points) Why this topic is interesting or important to you? (Motivations)

(3) (5 points) How did you analyse the data?

(4) (5 points) What did you find in the data? (please include figures or tables in the report, but no source code)

Slides (for 10 minutes of presentation) (20 points)

(5 points) Motivation about the data collection, why the topic is interesting to you.
(10 points) Communicating Results (figure/table)
(5 points) Story telling (How all the parts (data, analysis, result) fit together as a story?)

Done

All set!

What do you need to submit?

Notebook File: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

PPT Slides: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study.
Report: please prepare a report (less than 10 pages) to report what you found in the data.
- What is the relationship between this topic and Business Intelligence?
- How did you analyse the data?
- What did you find in the data?
- What conjectures did you make and how did you support or disprove them using data?
- Did you find anything suprising in the data?
- What business decision do you think this data could help answer? Why?
  
  (please include figures or tables in the report, but no source code)

Please compress all the files into a single zipped file.

How to submit:

    Send an email to rcpaffenroth@wpi.edu with the subject: "[DS501] Case study 3".

	Review Type	Total Words	Total Sentences	Average Sentence Length	Number of not Contractions	Total number of Nots	Last Sentence Length	First Sentence Length	Longest Sentence	Shortest Sentence	Number of Contractions	Number of () ... or ?	Sentence Deviation	Number of Negative Prefixes	Total You	Closest Not
0	0	676	36	18.777778	6	9	27	12	53	1	16	19	14.163880	20	5	0.127133
1	0	227	16	16.214286	3	4	8	6	48	4	13	3	13.471425	6	3	0.093431
2	0	472	24	19.666667	5	12	6	19	47	6	15	7	9.449498	15	3	0.066362
3	0	456	22	22.800000	2	3	11	26	48	3	25	10	12.013151	9	2	0.138614
4	0	698	39	18.368421	1	12	14	23	42	3	15	10	8.801604	34	2	0.032820
5	0	641	37	17.805556	0	4	11	18	34	3	3	3	6.497346	23	0	0.035796
6	0	525	26	18.750000	3	6	11	11	40	2	17	6	10.945431	30	7	0.039525
7	0	540	25	19.285714	3	9	10	9	79	3	21	15	16.149762	30	4	0.007316
8	0	684	31	20.117647	7	12	9	8	56	3	26	14	11.531965	33	0	0.172497
9	0	739	57	14.780000	5	11	25	15	48	1	22	37	11.446124	17	11	0.090270
10	0	674	55	19.257143	2	8	16	30	40	3	21	14	9.620660	31	2	0.075315
11	0	511	28	18.925926	2	2	7	27	54	4	16	4	12.421886	19	0	0.338516
12	0	448	19	23.578947	2	5	29	21	52	7	16	11	12.319311	14	0	0.023708
13	0	886	54	19.260870	6	9	15	13	55	2	14	22	10.502533	34	4	0.008229
14	0	470	31	13.823529	9	12	13	8	54	1	28	8	11.977328	20	12	0.001295
15	0	572	36	19.724138	3	7	17	4	61	0	18	17	12.231191	19	2	0.031470
16	0	615	33	17.571429	5	9	1	8	44	1	24	8	10.589583	25	6	0.027838
17	0	653	42	18.657143	8	12	32	31	74	1	20	13	17.639827	30	12	0.045921
18	0	420	30	15.555556	2	7	1	19	45	1	17	11	10.829775	7	6	0.046643
19	0	678	38	16.536585	9	11	11	18	40	3	20	7	9.134960	17	6	0.060606
20	0	640	34	18.823529	2	6	14	24	52	1	18	8	13.103179	27	1	0.025503
21	0	505	21	18.703704	5	8	17	31	42	1	18	10	11.059883	29	3	0.004813
22	0	621	44	15.525000	4	10	33	12	33	1	18	4	8.373905	20	0	0.072801
23	0	1036	44	28.000000	5	10	46	20	84	3	37	10	18.395517	36	2	0.022955
24	0	681	32	20.636364	0	5	8	11	38	5	12	6	8.979958	27	1	0.093594
25	0	541	32	16.906250	2	3	7	12	33	7	10	3	6.996952	20	14	0.023072
26	0	469	23	19.541667	2	4	25	3	40	3	8	3	8.406722	14	2	0.125735
27	0	1233	45	28.022727	3	9	29	29	68	2	23	28	14.436387	40	6	0.024522
28	0	494	20	24.700000	2	3	8	13	71	3	19	9	15.894388	22	0	0.078313
29	0	310	23	14.761905	0	2	1	18	33	1	8	5	9.243917	8	0	0.067465
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1970	1	626	27	21.586207	0	1	19	19	43	2	6	10	10.731177	26	3	0.215154
1971	1	579	29	19.965517	3	7	24	42	42	4	21	3	9.878019	20	7	0.051432
1972	1	507	39	13.342105	1	5	36	21	38	3	18	8	8.529329	19	5	0.248134
1973	1	1141	59	19.672414	7	18	8	15	50	2	32	5	11.244478	36	17	0.030540
1974	1	471	22	21.409091	2	7	1	12	52	1	22	4	13.650646	16	1	0.000971
1975	1	918	62	15.559322	6	12	20	7	33	6	15	4	5.249427	26	1	0.077534
1976	1	236	18	13.882353	1	5	5	11	36	5	3	1	7.284401	7	7	0.032258
1977	1	394	28	14.592593	6	10	20	9	28	2	12	7	6.230769	18	0	0.006319
1978	1	760	46	16.888889	8	13	15	10	45	1	19	2	10.382160	20	5	0.028694
1979	1	821	36	22.189189	4	11	17	11	72	4	27	8	12.558351	26	14	0.064870
1980	1	445	26	17.800000	3	7	2	30	38	2	5	3	9.539301	9	3	0.101234
1981	1	676	29	23.310345	3	9	33	18	44	2	23	13	10.466662	22	3	0.035849
1982	1	583	45	13.558140	0	9	11	7	42	5	0	2	7.660046	24	4	0.036169
1983	1	751	39	18.317073	6	9	34	21	40	2	31	5	9.941045	35	2	0.088296
1984	1	525	25	21.000000	2	5	9	25	47	8	10	14	9.107289	25	2	0.030145
1985	1	603	46	13.704545	2	4	13	9	63	1	13	6	10.375482	26	0	0.122060
1986	1	1132	58	16.895522	1	25	40	17	44	1	10	13	10.098368	29	31	0.010112
1987	1	180	17	12.857143	0	3	9	14	22	2	4	3	6.099762	7	2	0.241728
1988	1	847	45	19.250000	6	13	15	24	42	2	22	8	7.840073	30	7	0.025314
1989	1	743	28	27.518519	2	10	31	9	63	9	9	5	11.066836	32	0	0.023743
1990	1	948	63	14.584615	7	13	7	6	44	3	38	10	7.601141	32	5	0.086086
1991	1	577	30	20.607143	4	10	8	5	43	5	12	8	10.165840	18	2	0.001416
1992	1	1346	84	15.651163	13	20	19	23	37	3	19	17	7.079128	35	3	0.016371
1993	1	602	35	23.153846	4	7	21	8	44	4	17	18	10.740279	22	13	0.038695
1994	1	493	36	15.903226	4	6	10	31	47	2	8	13	12.695625	15	7	0.116979
1995	1	732	47	16.636364	5	9	13	3	39	1	16	8	8.105377	18	3	0.085121
1996	1	311	26	12.440000	3	5	5	13	35	4	12	1	6.919051	8	3	0.013048
1997	1	1020	54	19.615385	1	7	60	30	60	5	12	4	8.915727	47	7	0.020898
1998	1	579	40	15.648649	3	6	7	15	37	1	8	3	8.699856	14	1	0.099915
1999	1	959	62	15.467742	6	9	11	14	42	5	29	6	7.400397	28	3	0.098205