Identifying entities in notes

Goal of this notebook is to determine how successful entity identification is using



In [1]:

    
%matplotlib inline
from __future__ import print_function
import os
from pyspark import SQLContext
from pyspark.sql import Row
import pyspark.sql.functions as sql
#from pyspark.sql.functions import udf, length
import matplotlib.pyplot as plt
import numpy
import math
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import pyspark.ml.feature as feature









    



:0: FutureWarning: IPython widgets are experimental and may change in the future.



In [2]:

    
# Load Processed Parquet
sqlContext = SQLContext(sc)
notes = sqlContext.read.parquet("../data/idigbio_notes.parquet")
total_records = notes.count()
print(total_records)
# Small sample of the df
notes = notes.sample(withReplacement=False, fraction=0.001)
notes.cache()
print(notes.count())



In [3]:

    
for r in notes.head(20):
    print(r['document'] + "\n")









    



Additional data on card.  

  [S. VIETNAM Binh Thuy Army Post 24.X.1971 H.J. Harlan]

Notes and spore print with collection  

  Gloridonus (Laterana) ?arma (DeL. [Prescott N.F., VI-20-37.Ar.] [D.J.&J.N. Knull Collrs.] [♀]

Additional data on card  

 Malaise trap 

Flowers pink. Infrequent in locality.  

SFRP Kisatchie Nat'l Forest, So. Research Station Herbarium.  

 Rich, woods, undergrowth. 

Colorado State Flower, Number in Set: 1  

  [Pt.Betsie,Mich. V 30-VI 4,1993 B.&C.Dasch  trap]

BBP 7.10-6  

 tropical moist montane berlese forest litter 

  ACC:1983-VII -29; preparation:EtOH jar; det_comments:121.0mm

 lowland tropical forest flight intercept trap 

BBP 5.12-6  

 See field notes. 

Herbaceous plant about 30 cm tall flower stalk arising from basal rosette of leaves, flower white.  

Source: MRS.   Captive for several days. parasites  

=Lampanyctus. CAS #9037-9156 in one bottle. bottom depth 2012m; see Thompson & Van Cleve 1936, Rept. Intl. Fish. Comm. No.9, pp.88-89 for explanation of hauling procedures.



In [4]:

    
notes_pdf = notes.toPandas()

Sentence detection

Does splitting in to sentences matter? Is there enough information to do this with a natural language library or should things like "," "[]", and "{}" be worked in to address semi-structured data?

Entitys from documents



In [5]:

    
def tokenize(s):
    '''
    Take a string and return a list of tokens split out from it
    with the nltk library
    '''
    if s is not None:
        return nltk.tokenize.word_tokenize(s)
    else:
        return ""

notes_pdf['tokens'] = map(tokenize, notes_pdf['document'])



In [6]:

    
print(notes_pdf.head()['tokens'])









    



0                      [Additional, data, on, card, .]
1    [[, S., VIETNAM, Binh, Thuy, Army, Post, 24.X....
2         [Notes, and, spore, print, with, collection]
3    [Gloridonus, (, Laterana, ), ?, arma, (, DeL, ...
4                         [Additional, data, on, card]
Name: tokens, dtype: object



In [7]:

    
def part_of_speech(t):
    '''
    With a list of tokens, mark their part of speech and return
    a list of tuples.
    '''
    return nltk.pos_tag(t)

notes_pdf['pos'] = map(part_of_speech, notes_pdf['tokens'])



In [8]:

    
print(notes_pdf.head()['pos'])









    



0    [(Additional, JJ), (data, NNS), (on, IN), (car...
1    [([, NN), (S., NNP), (VIETNAM, NNP), (Binh, NN...
2    [(Notes, NNS), (and, CC), (spore, VB), (print,...
3    [(Gloridonus, NNP), ((, :), (Laterana, NNP), (...
4    [(Additional, JJ), (data, NNS), (on, IN), (car...
Name: pos, dtype: object



In [9]:

    
def chunk(p):
    return nltk.chunk.ne_chunk(p)

notes_pdf['chunks'] = map(chunk, notes_pdf['pos'])



In [10]:

    
print(notes_pdf.head()['chunks'])









    



0    [(Additional, JJ), (data, NNS), (on, IN), (car...
1    [([, NN), (S., NNP), (VIETNAM, NNP), (Binh, NN...
2    [(Notes, NNS), (and, CC), (spore, VB), (print,...
3    [[(Gloridonus, NNP)], ((, :), [(Laterana, NNP)...
4    [(Additional, JJ), (data, NNS), (on, IN), (car...
Name: chunks, dtype: object

Now, with some chunks, can we find any that match ones from darwinCore text? Use word2vec on the Dude, this is a Hard Problem. Need ontology lookup service's code: http://www.ebi.ac.uk/ols/beta/search?q=puma&groupField=iri&start=0&ontology=envo



In [11]:

    
# https://github.com/alvations/pywsd
# This uses it's own term definitions
from pywsd.similarity import max_similarity
s = """locality The specific description of the place. Less specific geographic information can be 
provided in other geographic terms (higherGeography, continent, country, stateProvince, county, 
                                    municipality, waterBody, island, islandGroup). This term may 
contain information modified from the original to correct perceived errors or standardize the description."""



In [12]:

    
print(max_similarity(s, 'town', 'lin'))









    



Synset('township.n.01')

Making triples

Piece together subject-verb-predicate sets and take a look at the manually even if we don't know what they mean.



In [18]:

    
def find_triples(s):
    '''
    Find s-v-p triples in taged list of tokens, returns
    list of dicts with the found triples.
    '''
    triples = []
    t = {}
    for (token, tag) in s:
        if tag.startswith("NN"):
            t["subject"] = token
            
        #else:
        #    triples.append(t)
        #    t = {}
        
    return triples

for s in notes_pdf.head(1)['chunks']:
    print(s)
    
print(find_triples(notes_pdf.head(1)['chunks']))









    



(S Additional/JJ data/NNS on/IN card/NN ./.)






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-faeee641ba64> in <module>()
     19     print(s)
     20 
---> 21 print(find_triples(notes_pdf.head(1)['chunks']))

<ipython-input-18-faeee641ba64> in find_triples(s)
      6     triples = []
      7     t = {}
----> 8     for (token, tag) in s:
      9         if tag.startswith("NN"):
     10             t["subject"] = token

ValueError: too many values to unpack