First, let's analyze some text...



































...

“Each of us is full of shit in our own special way. We are all shitty little snowflakes dancing in the universe.”

― Lewis Black, Me of Little Faith

Alice's Case

Overview of Taggers/Parsers

Tagging and Parsing into Trees is different:

Tagging: Tagging every word [fast]
Parsing: Tagging and puts into Tree [slow]
Chunking: Gives pieces of Trees [medium]
POSH Rules: Special fact and deap and context aware [amazing]

Other important words:

Probabilistic Parsing
Chart Parsing
- Grammer
- Strategy

NLTK is the mother of all mother of NLP

so many parsers:

pyStatParser (python yay!, little slow, but fun)
Stanford (popular) and btw, online! => http://nlp.stanford.edu:8080/parser/
TextBlob (python yay! NLTK simplification)
clips Pattern (python yay!)
MaltParser (java 1.8)
spaCy (pyython yay!)

Example Parsers/Taggers



In [ ]:

    
sent = "Each of us is full of shit in our own special way"

# setup display for demo
%matplotlib inline
import os
os.environ['DISPLAY'] = 'localhost:1'

pyStatParser



In [ ]:

    
from stat_parser import Parser
parser = Parser()
parser.parse(sent)
tree = parser.parse(sent) # returns nltk Tree instance
tree

TextBlob



In [ ]:

    
from textblob import TextBlob
blob = TextBlob(sent)
blob.parse()

MaltParser



In [ ]:

    
import nltk
mp = nltk.parse.malt.MaltParser(os.getcwd(),
                                model_filename="engmalt.linear-1.7.mco")
mp.parse_one(sent.split()).tree()

Pattern



In [ ]:

    
from pattern.en import parse, pprint

s = parse(sent,
     tokenize = True,  # Tokenize the input, i.e. split punctuation from words.
         tags = True,  # Find part-of-speech tags.
       chunks = True,  # Find chunk tags, e.g. "the black cat" = NP = noun phrase.
    relations = True,  # Find relations between chunks.
      lemmata = True,  # Find word lemmata.
        light = False) 
pprint(s)

spaCy



In [ ]:

    
from spacy.en import English
parser = English()
parsedData = parser(unicode(sent))



In [ ]:

    
for i, token in enumerate(parsedData):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    if i > 1:
        break

Interactive Example

Word Langauge Graph



In [ ]:

    
from visualize_word_graph import draw_graph  
draw_graph("dog")



In [ ]:

    
draw_graph("noise", hypernym=True)

Alice's Yelp Data



In [ ]:

    
bad_sounds =['The sound in the place is terrible.',
            'dining with clatter and the occasional smell of BMW exausts',
            'Also, the acoustics are not conducive to having any sort of conversation.']
not_bad_sounds = ["not to sound like a snob",
                  "at your table and you can tune the sound to whichever game you're interested in",
                  "oh god I sound old!"]

1. parts of speach for each



In [ ]:

    
from pattern.en import parse, pprint

def print_parts(sents):
    for sent in sents:
        s = parse(sent,
             tokenize = True,  # Tokenize the input, i.e. split punctuation from words.
                 tags = True,  # Find part-of-speech tags.
               chunks = True,  # Find chunk tags, e.g. "the black cat" = NP = noun phrase.
            relations = True,  # Find relations between chunks.
              lemmata = True,  # Find word lemmata.
                light = False) 
        print sent
        pprint(s)
sents = bad_sounds + not_bad_sounds
print_parts(bad_sounds + not_bad_sounds)

Penn Treebank Project Chunks guide

parts

Tag	Description	Example
CC	conjunction, coordinating	and, or, but
CD	cardinal number	five, three, 13%
DT	determiner	the, a, these
EX	existential there	there were six boys
FW	foreign word	mais
IN	conjunction, subordinating or preposition	of, on, before, unless
JJ	adjective	nice, easy
JJR	adjective, comparative	nicer, easier
JJS	adjective, superlative	nicest, easiest
LS	list item marker
MD	verb, modal auxillary	may, should
NN	noun, singular or mass	tiger, chair, laughter
NNS	noun, plural	tigers, chairs, insects
NNP	noun, proper singular	Germany, God, Alice
NNPS	noun, proper plural	we met two Christmases ago
PDT	predeterminer	both his children
POS	possessive ending	's
PRP	pronoun, personal	me, you, it
PRP$	pronoun, possessive	my, your, our
RB	adverb	extremely, loudly, hard
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	adverb, particle	about, off, up
SYM	symbol	%
TO	infinitival to	what to do?
UH	interjection	oh, oops, gosh
VB	verb, base form	think
VBZ	verb, 3rd person singular present	she thinks
VBP	verb, non-3rd person singular present	I think
VBD	verb, past tense	they thought
VBN	verb, past participle	a sunken ship
VBG	verb, gerund or present participle	thinking is fun
WDT	wh-determiner	which, whatever, whichever
WP	wh-pronoun, personal	what, who, whom
WP$$	wh-pronoun, possessive	whose, whosever
WRB	wh-adverb	where, when
.	punctuation mark, sentence closer	.;?*
,	punctuation mark, comma	,
:	punctuation mark, colon	:
(	contextual separator, left paren	(
)	contextual separator, right paren	)

chunks

Tag	Description	Words	Example	%
NP	noun phrase	DT+RB+JJ+NN + PR	the strange bird	51
PP	prepositional phrase	TO+IN	in between	19
VP	verb phrase	RB+MD+VB	was looking	9
ADVP	adverb phrase	RB	also	6
ADJP	adjective phrase	CC+RB+JJ	warm and cosy	3
SBAR	subordinating conjunction	IN	whether or not	3
PRT	particle	RP	up the stairs	1
INTJ	interjection	UH	hello	0

2. seach for patterns



In [ ]:

    
from pattern.en import parsetree
from pattern.search import search

for sent in sents:
    t = parsetree(sent)
    print 
    print sent
    print "Tagged Sent:", t
    print "Verbs:", search('VB*', t) # verbs
    print "ADJP:", search('ADJP', t) # verbs   
    print "Nouns:", search('NN', t) # all nouns

3. create similar word list (stemming + synsets)



In [ ]:

    
from nltk.corpus import wordnet as wn
from pattern.en import parsetree
from pattern.search import taxonomy, WordNetClassifier, search

taxonomy.classifiers.append(WordNetClassifier())

def get_parts(word, pos, recursive=False):
    parts = [word, ]
    parts += taxonomy.children(word, pos=pos, recursive=recursive)
    parts += taxonomy.parents(word, pos=pos, recursive=recursive)
    return parts

def word_search(t, word, pos):
    parts = get_parts(word, pos)
    results = search(pos, t)
    for result in results:
        #  print result.string, parts
        if any(x in result.string.split() for x in parts):
            return True
    return False

def run_a_rule(sent, word, pos):
    t = parsetree(sent)
    return word_search(t, word, pos)

3. test



In [ ]:

    
print "1. 'sound' is a NN"
print run_a_rule(sents[0], 'noise', 'NN')

print "2. clatter is a NN"
print run_a_rule(sents[1], 'noise', 'NN')

print "3. acoustics is NNS and RB Not"
print run_a_rule(sents[2], 'acoustics', 'NNS') and run_a_rule(sents[2], 'not', 'RB')

print "4. sound is a VB"
print run_a_rule(sents[3], 'noise', 'VB*') 

print "5. Sounds is JJ"
print run_a_rule(sents[4], 'sound', 'JJ') 

print "6. sound is VBP"
print run_a_rule(sents[5], 'noise', 'VB*')

4. create a feature extractor function



In [ ]:

    
def ext_func(tgt):
    return bool(not (run_a_rule(tgt, 'noise', 'VB*') and not run_a_rule(tgt, 'sound', 'JJ'))
                and (run_a_rule(tgt, 'noise', 'NN') or run_a_rule(tgt, 'acoustics', 'NNS') or
                        (run_a_rule(tgt, 'acoustics', 'NNS') and run_a_rule(tgt, 'not', 'RB'))))
        
print "bad noises in review:"
for sent in bad_sounds:
    print "\t" + sent
    assert(ext_func(sent) == True)
print
print "no mention of bad noises:"
for sent in not_bad_sounds:
    print "\t" + sent
    assert(ext_func(sent) == False)

Machine Learning Example



In [14]:

    
import zipfile
import pickle
from lxml import etree
from StringIO import StringIO

zf = zipfile.ZipFile('nhtsa_as_xml.zip', 'r')
nhtsa_injured = zf.read('nhtsa_injured.xml')
nhtsa_not_injured = zf.read('nhtsa_not_injured.xml')
xml_injured = etree.parse(StringIO(nhtsa_injured))
xml_not_injured = etree.parse(StringIO(nhtsa_not_injured))


def injured(l):
    return ['0' != str(x) and 'injured' or 'notinjured' for x in l]


def data(x):
    out = [x.xpath("//rows/row/@c1"),
           injured(x.xpath("//rows/row/@c8")),
           x.xpath("//rows/row/@c2")]
    return list(reversed(zip(*out)))


xml_injured_data = data(xml_injured)[:800]
xml_not_injured_data = data(xml_not_injured)[:800]



In [15]:

    
xml_injured_data[0]









    Out[15]:





('106859',
 'injured',
 'VIOLENT DEPLOYMENT OF AIR BAG DURING COLLISION, CAUSING INJURIES TO CONSUMERS (BURNS ON HANDS AND ARMS).')



In [16]:

    
from visualize_word_graph import draw_graph  
draw_graph("injury")









    Out[16]:



In [17]:

    
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from pattern.search import taxonomy, search

taxonomy.append('dislocated', type='injury')
taxonomy.append('sustained', type='injury')
taxonomy.append('burn', type='injury')
taxonomy.append('injury', type='hurt')


def check_sustained(text):
    if len(search('HURT', text)) > 0:
        return True
    return False


def feats(text):
    words = text.replace(".", "").split()
    out = dict([(word, True) for word in words])
    if 'SUSTAINED' in out:
        del out['SUSTAINED']
    out['rule(SUSTAINED)'] = check_sustained(text)
    return out
    
negcutoff = len(xml_not_injured_data)*3/4
poscutoff = len(xml_injured_data)*3/4
 
not_inj_data = xml_not_injured_data[:negcutoff] + xml_injured_data[:poscutoff]
inj_data = xml_not_injured_data[negcutoff:] + xml_injured_data[poscutoff:]    
    
negfeats = [(feats(f[2]), 'not') for f in not_inj_data]
posfeats = [(feats(f[2]), 'injure') for f in inj_data]
egcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
 
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
 
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features(n=100)


classifier.classify(feats("HE SUSTAINED INJURY"))









    



train on 900 instances, test on 700 instances
accuracy: 0.168571428571
Most Informative Features
                    BAGS = True           injure : not    =     19.1 : 1.0
               INSURANCE = True           injure : not    =     18.0 : 1.0
                  DEPLOY = True           injure : not    =     17.4 : 1.0
                    NECK = True           injure : not    =     15.3 : 1.0
                 TOTALED = True           injure : not    =     15.3 : 1.0
                 AIRBAGS = True           injure : not    =     13.2 : 1.0
                INJURIES = True           injure : not    =     12.9 : 1.0
                  POLICE = True           injure : not    =     10.9 : 1.0
                     2ND = True           injure : not    =     10.0 : 1.0
                    2014 = True           injure : not    =     10.0 : 1.0
                 INSPECT = True           injure : not    =     10.0 : 1.0
                DEPLOYED = True           injure : not    =      9.4 : 1.0
                 CRASHED = True           injure : not    =      9.1 : 1.0
                 COVERED = True              not : injure =      8.8 : 1.0
                FEBRUARY = True           injure : not    =      8.7 : 1.0
                    TYPE = True           injure : not    =      8.7 : 1.0
                    SITE = True           injure : not    =      8.7 : 1.0
                SUFFERED = True           injure : not    =      7.6 : 1.0
                OPPOSITE = True           injure : not    =      7.3 : 1.0
               DIRECTION = True           injure : not    =      7.3 : 1.0
                 ARRIVED = True           injure : not    =      7.3 : 1.0
                    HEAD = True           injure : not    =      7.3 : 1.0
                  IMPACT = True           injure : not    =      6.9 : 1.0
                INVOLVED = True           injure : not    =      6.6 : 1.0
                  HANDLE = True           injure : not    =      6.6 : 1.0
                  POPPED = True           injure : not    =      6.0 : 1.0
               SHOULDER, = True           injure : not    =      6.0 : 1.0
               PREMATURE = True           injure : not    =      6.0 : 1.0
                 ENTERED = True           injure : not    =      6.0 : 1.0
                  DEMAND = True           injure : not    =      6.0 : 1.0
                ELECTRIC = True           injure : not    =      6.0 : 1.0
                  ACCESS = True           injure : not    =      6.0 : 1.0
               DESTROYED = True           injure : not    =      6.0 : 1.0
               COLLISION = True           injure : not    =      6.0 : 1.0
                  ROLLED = True           injure : not    =      6.0 : 1.0
                KNOCKING = True           injure : not    =      6.0 : 1.0
                  EXISTS = True           injure : not    =      6.0 : 1.0
                 DESPITE = True           injure : not    =      6.0 : 1.0
               VEHICLE'S = True           injure : not    =      6.0 : 1.0
                 HANDLES = True           injure : not    =      6.0 : 1.0
                    LOT, = True           injure : not    =      6.0 : 1.0
                   FILED = True           injure : not    =      5.8 : 1.0
                  REPORT = True           injure : not    =      5.6 : 1.0
                 DEALER, = True              not : injure =      5.5 : 1.0
                     *AK = True           injure : not    =      5.3 : 1.0
         rule(SUSTAINED) = True           injure : not    =      5.3 : 1.0
                  WANTED = True           injure : not    =      5.2 : 1.0
                DECEMBER = True           injure : not    =      5.2 : 1.0
                SEVERELY = True           injure : not    =      5.2 : 1.0
               AUTOMATIC = True           injure : not    =      5.2 : 1.0
             ILLUMINATED = True              not : injure =      5.1 : 1.0
                     AGO = True              not : injure =      4.9 : 1.0
                 HYUNDAI = True              not : injure =      4.8 : 1.0
               CONTACT'S = True           injure : not    =      4.7 : 1.0
                       D = True           injure : not    =      4.7 : 1.0
               ACCIDENT, = True           injure : not    =      4.7 : 1.0
                SOLENOID = True           injure : not    =      4.7 : 1.0
                      11 = True           injure : not    =      4.7 : 1.0
                 FAILED, = True           injure : not    =      4.7 : 1.0
                   SIDE, = True           injure : not    =      4.7 : 1.0
                FOLLOWED = True           injure : not    =      4.7 : 1.0
                   OLDER = True           injure : not    =      4.7 : 1.0
                      V6 = True           injure : not    =      4.7 : 1.0
                  LOCATE = True           injure : not    =      4.7 : 1.0
                 DEFECTS = True           injure : not    =      4.7 : 1.0
                 ROADWAY = True           injure : not    =      4.7 : 1.0
                 FEELING = True           injure : not    =      4.7 : 1.0
                  SIENNA = True           injure : not    =      4.7 : 1.0
                   PAINT = True           injure : not    =      4.7 : 1.0
                     US, = True           injure : not    =      4.7 : 1.0
                   SEAT, = True           injure : not    =      4.7 : 1.0
              CONFIDENCE = True           injure : not    =      4.7 : 1.0
                     AAA = True           injure : not    =      4.7 : 1.0
                CONCERN, = True           injure : not    =      4.7 : 1.0
                 WHEELS, = True           injure : not    =      4.7 : 1.0
            CATASTROPHIC = True           injure : not    =      4.7 : 1.0
                REMEMBER = True           injure : not    =      4.7 : 1.0
                    WONT = True           injure : not    =      4.7 : 1.0
                   CASE, = True           injure : not    =      4.7 : 1.0
                 DEALING = True           injure : not    =      4.7 : 1.0
                MANIFOLD = True           injure : not    =      4.7 : 1.0
              DIFFERENCE = True           injure : not    =      4.7 : 1.0
                  IMPALA = True           injure : not    =      4.7 : 1.0
                  PICKUP = True           injure : not    =      4.7 : 1.0
                   CAR'S = True           injure : not    =      4.7 : 1.0
                   FACE, = True           injure : not    =      4.7 : 1.0
                 PARKED, = True           injure : not    =      4.7 : 1.0
                INFINITI = True           injure : not    =      4.7 : 1.0
                 SLAMMED = True           injure : not    =      4.7 : 1.0
                    GATE = True           injure : not    =      4.7 : 1.0
              RESOLUTION = True           injure : not    =      4.7 : 1.0
               CORRECTED = True              not : injure =      4.5 : 1.0
            UNEXPECTEDLY = True           injure : not    =      4.4 : 1.0
                   LEAVE = True           injure : not    =      4.4 : 1.0
               CONVERTER = True           injure : not    =      4.4 : 1.0
               CONTACT�S = True           injure : not    =      4.4 : 1.0
               BASICALLY = True           injure : not    =      4.4 : 1.0
                   BACK, = True           injure : not    =      4.4 : 1.0
                    NONE = True           injure : not    =      4.3 : 1.0
                  BUMPER = True           injure : not    =      4.3 : 1.0






    Out[17]:





'injure'

POSH Syntax Overview

converts:

return bool(not (run_a_rule(tgt, 'noise', 'VB*') and not run_a_rule(tgt, 'sound', 'JJ'))
            and (run_a_rule(tgt, 'noise', 'NN') or run_a_rule(tgt, 'acoustics', 'NNS') or
                    (run_a_rule(tgt, 'acoustics', 'NNS') and run_a_rule(tgt, 'not', 'RB'))))

To:

SENT: !VB*(noise+3) and !JJ(sound+3) ) and (NN(noise+2) | NNS(acoustics) | (NNS(acoustics) & RB(not)))

POSH Library

Comming soon to: https://github.com/brianray/posh

About Me

Deloitte Enterprise Science brray (at) deloitte dot com
ChiPy (Chicago Python User Group) brianhray@gmail.com
LinkedIn: https://www.linkedin.com/in/brianray
Twitter: Follow @brianray

Copy of this presentation found here: https://github.com/brianray/puppy_dec_2015