Supreme Court Oral Argument Analysis

The aim of this notebook is to use NLP and Machine learning strategies to see if one can predict how a Supreme Court Justice will vote based on oral arguments, which take place months before a decision. Justices speak during two distinct sections in a given case– one, while the Petitioner (P) is presenting, and two, while the respondent (R) is presenting. The theory behind the analysis is that Justices speak differently while each side is presenting, because they actually already may know how they will vote, and the language they use in questioning may show such an indication.

Some notes:

  1. This notebook is just a small display of research at the Columbia Law School Programming Lab
  2. This uses traditional Machine Learning algorithms (Logistic Regression, Naive Bayes, SVMs) and doesn't use deep learning. It would be interesting to see how LSTMs would perform.
  3. This notebook does not use word encodings such as Word2Vec. Encodings would almost certainly help the predictive performance.
  4. The dataset contains 215760 utterances from 1128 different cases, before data cleaning (which deletes some bad data).

Notebook author: Zack Nagler


In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import grid_search
from sklearn.grid_search import GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
import sklearn
import statsmodels.formula.api as smf
from textblob import TextBlob
from __future__ import division

#viz
import matplotlib
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

sns.set(color_codes=True)

#example
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

from sklearn.grid_search import GridSearchCV
from sklearn import metrics

from sklearn.linear_model import SGDClassifier
import pymysql

In [3]:
rows = []
with open('full_conversations.txt') as fp:
    for line in fp:
        row = line.split(" +++$+++ ")
        rows.append(row)
print len(rows)
cols = ["docket",
        "id",
        "after_prev", 
        "speaker", 
        "is_justice", 
        "justice_vote", #want this (5)
        "presentation_side", #want this (6)
        "utterance", #want this (7)
        ]
pd.set_option('display.max_colwidth', 40)
df = pd.DataFrame(rows)
df.columns = cols
df.head()


215760
Out[3]:
docket id after_prev speaker is_justice justice_vote presentation_side utterance
0 03-855 1 FALSE JUSTICE STEVENS JUSTICE RESPONDENT PETITIONER We will now hear argument in the cas...
1 03-855 2 TRUE MR. SACKS NOT JUSTICE NA PETITIONER Justice Stevens, and may it please t...
2 03-855 3 TRUE JUSTICE KENNEDY JUSTICE PETITIONER PETITIONER Well, is it your position that whene...
3 03-855 4 TRUE MR. SACKS NOT JUSTICE NA PETITIONER No. It is not our position that that...
4 03-855 5 TRUE JUSTICE BREYER JUSTICE PETITIONER PETITIONER Why does not having a possessory rig...

In [4]:
counts = df[df.is_justice=="JUSTICE"].speaker.value_counts()
counts


Out[4]:
JUSTICE SCALIA       17485
JUSTICE ROBERTS      13835
JUSTICE BREYER       13790
JUSTICE GINSBURG      9792
JUSTICE KENNEDY       8547
JUSTICE SOTOMAYOR     7573
JUSTICE STEVENS       5090
JUSTICE SOUTER        5052
JUSTICE ALITO         5048
JUSTICE KAGAN         3508
JUSTICE O'CONNOR       968
JUSTICE REHNQUIST      598
JUSTICE THOMAS          11
JUDGE SCALIA             5
JUDGE BREYER             3
JUDGE GINSBURG           3
JUDGE ALITO              3
JUDGE SOTOMAYOR          2
JUSTICE ROBERT           2
JUDGE SOUTER             1
JUTICE SCALIA            1
JUSTINE GINSBURG         1
JUTICE BREYER            1
JUSTICE KENNED           1
JUDGE STEVENS            1
JUST SCALIA              1
JUDGE ROBERTS            1
JUSTCIE BREYER           1
dtype: int64

In [5]:
speakers = counts[counts>100].index.values
print(speakers)
df = df[df.speaker.isin(speakers)]
len(df)


['JUSTICE SCALIA' 'JUSTICE ROBERTS' 'JUSTICE BREYER' 'JUSTICE GINSBURG'
 'JUSTICE KENNEDY' 'JUSTICE SOTOMAYOR' 'JUSTICE STEVENS' 'JUSTICE SOUTER'
 'JUSTICE ALITO' 'JUSTICE KAGAN' "JUSTICE O'CONNOR" 'JUSTICE REHNQUIST']
Out[5]:
91286

In [6]:
pd.set_option('display.max_colwidth', -1)
df[df.speaker=="JUSTICE ROBERTS"].utterance


Out[6]:
204       Well hear argument first this morning in Case 09-497, Rent-A-Center West v. Jackson. Mr. Friedman. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
206       But not to the question of which parties have agreed to arbitrate?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
208       Not the question of which parties have agreed to arbitrate?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
236       But I suppose that the substance of the agreement -- maybe this is just the - Subject to Final Review same question as Justice Scalia's. I suppose the substance of the agreement is evidence -- could be evidence on the unconscionability at formation.\n                                                                                                                                                                                                                                                                                                                                                                           
238       And that is for the court.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
240       No, the point is -- it's not that. It would be the -- the provisions are so one-sided that you may assume from that that the formation was not voluntary.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
268       So your position is that the arbitrator gets to decide questions of unconscionability, but the court gets to decide whether the arbitrator can do that?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
290       I would have thought the answer to your -- the answer to your answer would be, well then, the -- youre more likely to win on that question. Obviously, you are going to lose on the gun to the head, but if its simply the economic inequality or whatever, under the State law youre probably going to prevail, and they will say there is a valid contract. I thought the -- your -- your whole point was simply it's all or nothing. The courts get to decide is there a valid contract or is there not. And once they decide there is, then everything else about unconscionability of particular clauses is for the arbitrator.\n
307       Thank you, counsel. Mr. Silverberg. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
311       I would have thought the issue would be -- it's odd to say, I think, that if you have 10 provisions, some are unconscionable and some are not. The issue would be whether there is unconscionability in the making of the whole contract. In other words, it's the same question I asked your friend: Why isn't it all or nothing? If it was -- if there was no unconscionability in the making, then the arbitrator decides. If there was unconscionability in the making, then -- then the arbitrator doesn't decide anything. Questions 1 through 10, not simply, you know, - Subject to Final Review 1, 8, and 9.\n               
313       No, my point is that once you get past that gateway question of whether the formation of the contract was not unconscionable, then claims that particular provisions were unconscionable are by definition for the arbitrator to decide.\n                                                                                                                                                                                                                                                                                                                                                                                            
362       I thought your --\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
408       No, that can't be right. The -- how can you say there's no problem agreeing to arbitrate, no imbalance in bargaining authority whatever, but then say, oh, but these procedures are unconscionable? It seems to me that the procedures are there, and the party, the employee, whatever, can look at those. And if he says, well, that's unconscionable, you don't sign the agreement as a whole. But once you are -- in for a penny, in for a pound. If you agree to arbitrate, then it's at least for the arbitrator to decide particular provisions, whether theyre unconscionable.\n                                              
410       I know youre arguing in the alternative. But the one argument that we get to pick out the provisions we don't like and say those are unconscionable, but the agreement as a whole is not -- that seems to me illogical.\n                                                                                                                                                                                                                                                                                                                                                                                                             
412       Well, it's a matter -- it may be a matter of State law, but the open question is who gets to decide it.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
414       Arbitrators decide matters of State law all the time.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
446       Does it make a difference in response to Justice Stevens's hypothetical that there is a provision saying the arbitrator will decide the conscionability of all clauses? The arbitrator may decide that clauses 2 and 8 are unconscionable, but if theres an agreement and it's not unconscionable that the arbitrator will decide, then the arbitrator decides all of them, right?\n                                                                                                                                                                                                                                                  
448       Right.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
450       Can I ask you just a follow-up on Justice Breyer's hypothetical to you where he had the first agreement and then the issue to the second? You said youve got to leave the door open. The door open on the second agreement or on the first agreement?\n                                                                                                                                                                                                                                                                                                                                                                               
467       Thank you, counsel. Mr. Friedman, you have 4 minutes remaining. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
479       We will hear argument next in Case 13534, North Carolina State Board of Dental Examiners v. The Federal Trade Commission. Mr. Mooppan. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
527       Why is a statute that says do what you want not clearly articulated?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
529       It says clearly  do what you want so long as it promotes the dental  monopoly of dentists.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
531       So it's not enough just to say, oh, you're a public agent or you're a public official.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
533       But none of that's responsive to the concern that the State policy is purely to displace competition by promoting the selfinterest of the dentist.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
535       They can do that in open meeting. They can do that with let anyone look at the records but \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
577       Thank you, counsel. Mr. Stewart. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
593       Presumably they also set judicial ethics rules.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
627       What if the  I'm sorry. Please.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
629       Well, what if one of the members of the board is appointed as a fulltime member of the board for a oneyear term and he's  he's called the board state supervisor, is that good enough?\n                                                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                    ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
215442    Thank you, counsel. Mr. Bartolomucci. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
215444    This is our original jurisdiction. I regard the Special Master as more akin to a law clerk than a district judge. We don't defer to somebody who's an aide that we have assigned to help us gather things here. I think on legal questions of intervention we have to decide de novo.\n                                                                                                                                                                                                                                                                                                                                               
215454    Counsel, let me tell you what I'm very worried about. This is our original jurisdiction, a delicate jurisdiction that allows us to resolve disputes between sovereign States. And I look out and I see all sorts of private parties intervening in a way that would give them party status. And I think that's compromising what our original jurisdiction is supposed to be about.\n                                                                                                                                                                                                                                                 
215460    You're -- all of the intervenors, prospective intervenors, they want to make sure North Carolina doesn't lose water, right?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
215462    Well, their -- well, they want to reduce South Carolina's claim on the water.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
215464    Well -- to the extent they have differing interests, why aren't those interests fully satisfied by amicus participation?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
215466    Shape the record, but intervention status would give you the right to appeal, right?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
215468    Right, and appeal the normal case.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
215470    Well, that's my question. If we grant intervention in this type of case and there is no reason it would be three -- I mean, in the next case, it could be 20 different intervenors, and they are filing exceptions every other week that we have to review and adjudicate because they are not bound by whether or not the State that is on their side wants to file exceptions.\n                                                                                                                                                                                                                                                    
215474    And North Carolina, as a sovereign State, can represent the interests of its constituents as it sees fit. You and your fellow prospective intervenors just have to do what citizens do all the time, which is convince North Carolina, one, and you can help them, to get as much water as they can; and, two, when they get it or if they lose it, whatever they are left it, to give it to you, rather than the other parties.\n                                                                                                                                                                                                    
215476    Well, then thats - then I just wonder why you are here in an original action. What you are saying is they have all sorts of different interests, and it just -- they get to skip district court. They get to skip the court of appeals. They can just come right in here, as if they were a State, and participate in the case.\n                                                                                                                                                                                                                                                                                                     
215486    You dont --\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
215499    Well, what's special about it? I mean, let's say I own a little farm on the banks of the Catawba, and I take water out to -- so the cows have something to drink, why does Charlotte get a special status just because they take a lot? I'm affected by how much water runs through there.\n                                                                                                                                                                                                                                                                                                                                          
215501    Well, and that relief will affect how much water is available for me to draw out and use on my farm. That's a compelling interest. I -- you know, in times of drought, this water barely trickles by, and, if it's cut back, the farm is going to go down. It seems to me that, when you say they have a special interest, you are just saying they have got a big interest.\n                                                                                                                                                                                                                                                        
215503    Well, let's say the interest -- the dispute is really in effect between company ABC in North Carolina and company XYZ in South Carolina. I mean, do we -- we would not accept an original action if they sued each other, right?\n                                                                                                                                                                                                                                                                                                                                                                                                    
215505    Do we let them just use the States as, you know, a faade to get into this Court and have their dispute adjudicated here?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
215510    Wouldnt it  would it be surprising if the Special Master recommended that all the issue that she was going to address was the relative equitable apportionment between North Carolina and South Carolina, and even though South Carolina wanted an injunction directed against the City of Charlotte, that's up to North Carolina? North Carolina can divvy up its water however it wants.\n                                                                                                                                                                                                                                          
215539    Thank you, Counsel.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
215541    Mr. Browning. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
215543    Maryland v. Louisiana involved a specific tax on specific companies, and they were allowed to intervene. This is not that. This is a question of how the equitable apportionment of the water is going to be, and North Carolina can do with the water whatever it will. It strikes me as very different than Maryland v. Louisiana.\n                                                                                                                                                                                                                                                                                                
215547    As -- as the allocation would in this case, presumably.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
215559    Counsel, my basic concern is that -- and I will let you finish if there is more to the answer. I'm sorry. Private parties are going to hijack our original jurisdiction, and it was highlighted for me when I read your motion, the motion of private parties for divided argument. Your proposal was that they be divided 10, 10, and 10. You didn't even want to be here. As they view the case and as you view the case, it's got so little to do with the State that the State didn't even want to come here and argue the case.\n                                                                                                
215561    You thought their participation here before this Court on a question in original jurisdiction was more important than yours, and you represent the State.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
215563    Why can't you represent them?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
215565    They are your constituents. You are the State. You are coming here directly, not even going to district court, and you seem to be ceding your sovereignty over to them.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                             
215571    What is -- what's the interest of North Carolina?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
215573    You are standing there telling me why Duke has an interest. Whats North Carolina's interest?\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
215575    So oppose their intervention.\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
215593    Well, if it's an attack on -- if it's an attack on Charlotte, I would expect the State to be standing there protecting it and not feel that they can't do that without Charlotte, itself, coming into the case.\n                                                                                                                                                                                                                                                                                                                                                                                                                     
215605    Thank you, counsel. Mr. Frederick, you have 2 minutes. \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
Name: utterance, dtype: object

In [7]:
pd.set_option('display.max_colwidth', 40)
df.shape


Out[7]:
(91286, 8)

In [8]:
df.presentation_side.value_counts()


Out[8]:
PETITIONER    46705
RESPONDENT    30639
NA            13942
dtype: int64

In [9]:
sides = ["PETITIONER","RESPONDENT"]
df = df[(df.presentation_side.isin(sides)) & (df.justice_vote.isin(sides)) & (df.is_justice=="JUSTICE") ]

In [10]:
df.shape


Out[10]:
(76778, 8)

In [11]:
df.justice_vote.value_counts()


Out[11]:
PETITIONER    48206
RESPONDENT    28572
dtype: int64

In [12]:
df.head()


Out[12]:
docket id after_prev speaker is_justice justice_vote presentation_side utterance
0 03-855 1 FALSE JUSTICE STEVENS JUSTICE RESPONDENT PETITIONER We will now hear argument in the cas...
2 03-855 3 TRUE JUSTICE KENNEDY JUSTICE PETITIONER PETITIONER Well, is it your position that whene...
4 03-855 5 TRUE JUSTICE BREYER JUSTICE PETITIONER PETITIONER Why does not having a possessory rig...
6 03-855 7 TRUE JUSTICE BREYER JUSTICE PETITIONER PETITIONER No, I'm just thinking, that suppose ...
8 03-855 9 TRUE JUSTICE KENNEDY JUSTICE PETITIONER PETITIONER All right. So if you were to say the...

In [13]:
df.presentation_side = df.presentation_side.map({"PETITIONER": 1, "RESPONDENT": 0})
df.justice_vote = df.justice_vote.map({"PETITIONER": 1, "RESPONDENT": 0})

In [14]:
df.describe()


Out[14]:
justice_vote presentation_side
count 76778.000000 76778.000000
mean 0.627862 0.603350
std 0.483378 0.489205
min 0.000000 0.000000
25% 0.000000 0.000000
50% 1.000000 1.000000
75% 1.000000 1.000000
max 1.000000 1.000000

In [15]:
df.head()
# def polarize(data):
#     return TextBlob(data).polarity

# df["polarity"] = df.utterance.apply(polarize)


Out[15]:
docket id after_prev speaker is_justice justice_vote presentation_side utterance
0 03-855 1 FALSE JUSTICE STEVENS JUSTICE 0 1 We will now hear argument in the cas...
2 03-855 3 TRUE JUSTICE KENNEDY JUSTICE 1 1 Well, is it your position that whene...
4 03-855 5 TRUE JUSTICE BREYER JUSTICE 1 1 Why does not having a possessory rig...
6 03-855 7 TRUE JUSTICE BREYER JUSTICE 1 1 No, I'm just thinking, that suppose ...
8 03-855 9 TRUE JUSTICE KENNEDY JUSTICE 1 1 All right. So if you were to say the...

In [16]:
rows = []
for docket in df.docket.unique():
    cond_a = (df.docket == docket)
    for speaker in speakers:
        cond_b = (df.speaker == speaker)
        if len(df[(cond_a)&(cond_b)].presentation_side.unique())!=2: continue
        justice_vote = df[(cond_a)&(cond_b)].justice_vote.head(1).values[0]
        row = [docket,speaker,justice_vote]        
        for presentation_side in [0,1]:
            cond_c = (df.presentation_side == presentation_side)
            temp_df = df[(cond_a) & (cond_b) & (cond_c)]
            utterances = temp_df.utterance
            # print(utterances.head(1).values)
            text = " ".join(utterances.tolist()).replace('\n', ' ').replace('--', '')
            row.append(text)
        rows.append(row)

In [17]:
cols = ["docket",
        "speaker",
        "justice_vote",
        "pres0_text",
        "pres1_text", 
        ]

print len(rows)
df2 = pd.DataFrame(rows)
df2.columns = cols
df2.head()


3353
Out[17]:
docket speaker justice_vote pres0_text pres1_text
0 09-497 JUSTICE SCALIA 1 Is that is that right? Is the arbit... I guess you could argue that on its ...
1 09-497 JUSTICE ROBERTS 1 Thank you, counsel. Mr. Silverberg. ... Well hear argument first this mornin...
2 09-497 JUSTICE BREYER 0 What is what I'm not sure about wha... Yes, that's thats true. The thing I...
3 09-497 JUSTICE GINSBURG 0 Why is that why is - Subject to Fin... But if if fraud in the inducement, ...
4 09-497 JUSTICE KENNEDY 1 After this After this suit was fil... Why is it post-formation? Arguably, ...

In [18]:
###### Naive Bayes ########

X0 = df2.pres0_text
y0 = df2.justice_vote

X1 = df2.pres1_text
y1 = df2.justice_vote

nb_pipeline = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

nb_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                 'vect__stop_words': ["english",None],
                 'clf__alpha': (1e-2, 1e-3),
}


nb_gs = GridSearchCV(nb_pipeline, nb_parameters, n_jobs=-1)
nb0_gs = nb_gs.fit(X0,y0)


nb0_best_parameters, nb0_score, _ = max(nb0_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(nb0_best_parameters.keys()):
    print("%s: %r" % (param_name, nb0_best_parameters[param_name]))
print("nb0 score: " + str(nb0_score))
    
nb1_gs = nb_gs.fit(X1,y1)
nb1_best_parameters, nb1_score, _ = max(nb1_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(nb1_best_parameters.keys()):
    print("%s: %r" % (param_name, nb1_best_parameters[param_name]))
print("nb1 score: " + str(nb1_score))

print "Dummy score: " + str(y0[y0==y0.mode().values[0]].size/y0.size)


clf__alpha: 0.01
vect__ngram_range: (1, 3)
vect__stop_words: None
nb0 score: 0.601550849985
clf__alpha: 0.01
vect__ngram_range: (1, 4)
vect__stop_words: None
nb1 score: 0.583954667462
Dummy score: 0.603638532657

In [19]:
###### Support Vector Machine ########

X0 = df2.pres0_text
y0 = df2.justice_vote

X1 = df2.pres1_text
y1 = df2.justice_vote

sv_pipeline = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),
])

sv_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
              'vect__stop_words': ["english",None],
}



sv_gs = GridSearchCV(sv_pipeline, sv_parameters, n_jobs=-1)

sv0_gs = sv_gs.fit(X0,y0)


sv0_best_parameters, sv0_score, _ = max(sv0_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(sv0_best_parameters.keys()):
    print("%s: %r" % (param_name, sv0_best_parameters[param_name]))
print("sv0 score: " + str(sv0_score))
    
sv1_gs = sv_gs.fit(X1,y1)
sv1_best_parameters, sv1_score, _ = max(sv1_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(sv1_best_parameters.keys()):
    print("%s: %r" % (param_name, sv1_best_parameters[param_name]))
print("sv1 score: " + str(sv1_score))

print "Dummy score: " + str(y0[y0==y0.mode().values[0]].size/y0.size)


vect__ngram_range: (1, 2)
vect__stop_words: 'english'
sv0 score: 0.60572621533
vect__ngram_range: (1, 1)
vect__stop_words: None
sv1 score: 0.605427974948
Dummy score: 0.603638532657

In [20]:
###### LOGISTIC REGRESSION ########

X0 = df2.pres0_text
y0 = df2.justice_vote

X1 = df2.pres1_text
y1 = df2.justice_vote

lr_pipeline = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', LogisticRegression()),
])

lr_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
              'vect__stop_words': ["english",None],
}

lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)

lr0_gs = lr_gs.fit(X0,y0)
lr0_best_parameters, lr0_score, _ = max(lr0_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(lr0_best_parameters.keys()):
    print("%s: %r" % (param_name, lr0_best_parameters[param_name]))
print("lr0 score: " + str(lr0_score))
    
lr1_gs = lr_gs.fit(X1,y1)
lr1_best_parameters, lr1_score, _ = max(lr1_gs.grid_scores_, key=lambda x: x[1])
for param_name in sorted(lr1_best_parameters.keys()):
    print("%s: %r" % (param_name, lr1_best_parameters[param_name]))
print("lr1 score: " + str(lr1_score))

print "Dummy score: " + str(y0[y0==y0.mode().values[0]].size/y0.size)


vect__ngram_range: (1, 3)
vect__stop_words: None
lr0 score: 0.607217417238
vect__ngram_range: (1, 4)
vect__stop_words: None
lr1 score: 0.614076946018
Dummy score: 0.603638532657

In [21]:
for speaker in speakers:
    subframe = df2[df2.speaker==speaker]
    
    if len(subframe) < 10: continue
    print speaker+ ": " + str(len(subframe))
    X = subframe.pres0_text
    y = subframe.justice_vote

    
    ## Naive Bayes
    nb_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', MultinomialNB()),
    ])

    nb_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
                  'clf__alpha': (1e-2, 1e-3,1e-4),
    }

    # Tried ngrams up to (1,7) and they didn't beat (1,4)
    nb_gs = GridSearchCV(nb_pipeline, nb_parameters, n_jobs=-1)
    nb_gs = nb_gs.fit(X,y)
    nb_best_parameters, nb_score, _ = max(nb_gs.grid_scores_, key=lambda x: x[1])
    
    
    #### Support Vector
    sv_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),
    ])
    
    sv_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    sv_gs = GridSearchCV(sv_pipeline, sv_parameters, n_jobs=-1)
    sv_gs = sv_gs.fit(X,y)
    sv_best_parameters, sv_score, _ = max(sv_gs.grid_scores_, key=lambda x: x[1])

    
    #### Logistic Regression
    lr_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', LogisticRegression()),
    ])
    
    lr_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
    lr_gs = lr_gs.fit(X,y)
    lr_best_parameters, lr_score, _ = max(lr_gs.grid_scores_, key=lambda x: x[1])
    

    
#     for param_name in sorted(parameters.keys()):
#         print("%s: %r" % (param_name, nb_best_parameters[param_name]))


    print "Naive Bayes score :" + str(nb_score)
    print "Support Vector score :" + str(sv_score)
    print "Logistic Regression score :" + str(sv_score)    
    print "Dummy score: " + str(y[y==y.mode().values[0]].size/y.size)


JUSTICE SCALIA: 462
Naive Bayes score :0.625541125541
Support Vector score :0.623376623377
Logistic Regression score :0.623376623377
Dummy score: 0.614718614719
JUSTICE ROBERTS: 555
Naive Bayes score :0.614414414414
Support Vector score :0.625225225225
Logistic Regression score :0.625225225225
Dummy score: 0.636036036036
JUSTICE BREYER: 404
Naive Bayes score :0.574257425743
Support Vector score :0.589108910891
Logistic Regression score :0.589108910891
Dummy score: 0.569306930693
JUSTICE GINSBURG: 460
Naive Bayes score :0.591304347826
Support Vector score :0.59347826087
Logistic Regression score :0.59347826087
Dummy score: 0.567391304348
JUSTICE KENNEDY: 393
Naive Bayes score :0.671755725191
Support Vector score :0.643765903308
Logistic Regression score :0.643765903308
Dummy score: 0.653944020356
JUSTICE SOTOMAYOR: 246
Naive Bayes score :0.605691056911
Support Vector score :0.569105691057
Logistic Regression score :0.569105691057
Dummy score: 0.59756097561
JUSTICE STEVENS: 207
Naive Bayes score :0.6038647343
Support Vector score :0.574879227053
Logistic Regression score :0.574879227053
Dummy score: 0.589371980676
JUSTICE SOUTER: 161
Naive Bayes score :0.633540372671
Support Vector score :0.652173913043
Logistic Regression score :0.652173913043
Dummy score: 0.627329192547
JUSTICE ALITO: 243
Naive Bayes score :0.547325102881
Support Vector score :0.543209876543
Logistic Regression score :0.543209876543
Dummy score: 0.551440329218
JUSTICE KAGAN: 155
Naive Bayes score :0.593548387097
Support Vector score :0.6
Logistic Regression score :0.6
Dummy score: 0.593548387097
JUSTICE O'CONNOR: 35
Naive Bayes score :0.657142857143
Support Vector score :0.685714285714
Logistic Regression score :0.685714285714
Dummy score: 0.685714285714
JUSTICE REHNQUIST: 32
Naive Bayes score :0.6875
Support Vector score :0.6875
Logistic Regression score :0.6875
Dummy score: 0.59375

In [22]:
for speaker in speakers:
    subframe = df2[df2.speaker==speaker]
    
    if len(subframe) < 10: continue
    print speaker+ ": " + str(len(subframe))
    X = subframe.pres1_text
    y = subframe.justice_vote

    
    ## Naive Bayes
    nb_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', MultinomialNB()),
    ])

    nb_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
                  'clf__alpha': (1e-2, 1e-3,1e-4),
    }

    # Tried ngrams up to (1,7) and they didn't beat (1,4)
    nb_gs = GridSearchCV(nb_pipeline, nb_parameters, n_jobs=-1)
    nb_gs = nb_gs.fit(X,y)
    nb_best_parameters, nb_score, _ = max(nb_gs.grid_scores_, key=lambda x: x[1])
    
    
    #### Support Vector
    sv_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),
    ])
    
    sv_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    sv_gs = GridSearchCV(sv_pipeline, sv_parameters, n_jobs=-1)
    sv_gs = sv_gs.fit(X,y)
    sv_best_parameters, sv_score, _ = max(sv_gs.grid_scores_, key=lambda x: x[1])

    
    #### Logistic Regression
    lr_pipeline = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', LogisticRegression()),
    ])
    
    lr_parameters = {'vect__ngram_range': [(1, 1),(1,2),(1,3),(1,4)],
                  'vect__stop_words': ["english",None],
    }
    

    lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
    lr_gs = lr_gs.fit(X,y)
    lr_best_parameters, lr_score, _ = max(lr_gs.grid_scores_, key=lambda x: x[1])
    

    
#     for param_name in sorted(parameters.keys()):
#         print("%s: %r" % (param_name, nb_best_parameters[param_name]))


    print "Naive Bayes score :" + str(nb_score)
    print "Support Vector score :" + str(sv_score)
    print "Logistic Regression score :" + str(sv_score)    
    print "Dummy score: " + str(y[y==y.mode().values[0]].size/y.size)


JUSTICE SCALIA: 462
Naive Bayes score :0.577922077922
Support Vector score :0.612554112554
Logistic Regression score :0.612554112554
Dummy score: 0.614718614719
JUSTICE ROBERTS: 555
Naive Bayes score :0.637837837838
Support Vector score :0.643243243243
Logistic Regression score :0.643243243243
Dummy score: 0.636036036036
JUSTICE BREYER: 404
Naive Bayes score :0.561881188119
Support Vector score :0.576732673267
Logistic Regression score :0.576732673267
Dummy score: 0.569306930693
JUSTICE GINSBURG: 460
Naive Bayes score :0.571739130435
Support Vector score :0.604347826087
Logistic Regression score :0.604347826087
Dummy score: 0.567391304348
JUSTICE KENNEDY: 393
Naive Bayes score :0.64631043257
Support Vector score :0.656488549618
Logistic Regression score :0.656488549618
Dummy score: 0.653944020356
JUSTICE SOTOMAYOR: 246
Naive Bayes score :0.593495934959
Support Vector score :0.565040650407
Logistic Regression score :0.565040650407
Dummy score: 0.59756097561
JUSTICE STEVENS: 207
Naive Bayes score :0.570048309179
Support Vector score :0.545893719807
Logistic Regression score :0.545893719807
Dummy score: 0.589371980676
JUSTICE SOUTER: 161
Naive Bayes score :0.60248447205
Support Vector score :0.590062111801
Logistic Regression score :0.590062111801
Dummy score: 0.627329192547
JUSTICE ALITO: 243
Naive Bayes score :0.654320987654
Support Vector score :0.61316872428
Logistic Regression score :0.61316872428
Dummy score: 0.551440329218
JUSTICE KAGAN: 155
Naive Bayes score :0.567741935484
Support Vector score :0.593548387097
Logistic Regression score :0.593548387097
Dummy score: 0.593548387097
JUSTICE O'CONNOR: 35
Naive Bayes score :0.685714285714
Support Vector score :0.685714285714
Logistic Regression score :0.685714285714
Dummy score: 0.685714285714
JUSTICE REHNQUIST: 32
Naive Bayes score :0.65625
Support Vector score :0.625
Logistic Regression score :0.625
Dummy score: 0.59375

In [23]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)

# count_vect = CountVectorizer()
# X_train_counts = count_vect.fit_transform(twenty_train.data)
# X_train_counts.shape

# tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
# X_train_tf = tf_transformer.transform(X_train_counts)

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[23]:
0.83488681757656458

In [4]:



WARNING:root:Database file corrupt or not found, using empty database
Out[4]:
'This is a string of Text'

In [ ]: