NLP Information Extraction


SparkContext and SparkSession


In [1]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

Simple NLP pipeline architecture

Reference: Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.

Example data

The raw text is from the gutenberg corpus from the nltk package. The fileid is milton-paradise.txt.

Get the data

Raw text


In [2]:
import nltk
from nltk.corpus import gutenberg

milton_paradise = gutenberg.raw('milton-paradise.txt')

Create a spark data frame to store raw text

  • Use the nltk.sent_tokenize() function to split text into sentences.

In [5]:
import pandas as pd
pdf = pd.DataFrame({
        'sentences': nltk.sent_tokenize(milton_paradise)
    })
df = spark.createDataFrame(pdf)
df.show(n=5)


+--------------------+
|           sentences|
+--------------------+
|[Paradise Lost by...|
|And chiefly thou,...|
|Say first--for He...|
|Who first seduced...|
|Th' infernal Serp...|
+--------------------+
only showing top 5 rows

Tokenization and POS tagging


In [8]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

## define udf function
def sent_to_tag_words(sent):
    wordlist = nltk.word_tokenize(sent)
    tagged_words = nltk.pos_tag(wordlist)
    return(tagged_words)
## define schema for returned result from the udf function
## the returned result is a list of tuples.
schema = ArrayType(StructType([
            StructField('f1', StringType()),
            StructField('f2', StringType())
        ]))
        
## the udf function
sent_to_tag_words_udf = udf(sent_to_tag_words, schema)

Transform data


In [9]:
df_tagged_words = df.select(sent_to_tag_words_udf(df.sentences).alias('tagged_words'))
df_tagged_words.show(5)


+--------------------+
|        tagged_words|
+--------------------+
|[[[,JJ], [Paradis...|
|[[And,CC], [chief...|
|[[Say,NNP], [firs...|
|[[Who,WP], [first...|
|[[Th,NNP], [',POS...|
+--------------------+
only showing top 5 rows

Chunking

Chunking is the process of segmenting and labeling multitokens. The following example shows how to do a noun phrase chunking on the tagged words data frame from the previous step.

First we define a udf function which chunks noun phrases from a list of pos-tagged words.


In [10]:
import nltk
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# define a udf function to chunk noun phrases from pos-tagged words
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
chunk_parser_udf = udf(lambda x: str(chunk_parser.parse(x)), StringType())

Transform data


In [15]:
df_NP_chunks = df_tagged_words.select(chunk_parser_udf(df_tagged_words.tagged_words).alias('NP_chunk'))

In [16]:
df_NP_chunks.show(2, truncate=False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NP_chunk                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(S
  [/JJ
  Paradise/NNP
  Lost/VBN
  by/IN
  John/NNP
  Milton/NNP
  1667/CD
  ]/NNP
  Book/NNP
  I/PRP
  Of/IN
  Man/NNP
  's/POS
  (NP first/JJ disobedience/NN)
  ,/,
  and/CC
  (NP the/DT fruit/NN)
  Of/IN
  (NP that/DT forbidden/JJ tree/NN)
  whose/WP$
  (NP mortal/JJ taste/NN)
  Brought/NNP
  (NP death/NN)
  into/IN
  the/DT
  World/NNP
  ,/,
  and/CC
  all/DT
  our/PRP$
  (NP woe/NN)
  ,/,
  With/IN
  (NP loss/NN)
  of/IN
  Eden/NNP
  ,/,
  till/VB
  one/CD
  greater/JJR
  (NP Man/NN)
  Restore/NNP
  us/PRP
  ,/,
  and/CC
  regain/VB
  (NP the/DT blissful/JJ seat/NN)
  ,/,
  Sing/NNP
  ,/,
  Heavenly/NNP
  Muse/NNP
  ,/,
  that/WDT
  ,/,
  on/IN
  (NP the/DT secret/JJ top/NN)
  Of/IN
  Oreb/NNP
  ,/,
  or/CC
  of/IN
  Sinai/NNP
  ,/,
  (NP didst/NN)
  (NP inspire/NN)
  That/WDT
  (NP shepherd/NN)
  who/WP
  first/RB
  taught/VBD
  (NP the/DT chosen/NN)
  (NP seed/NN)
  In/IN
  (NP the/DT beginning/NN)
  how/WRB
  the/DT
  heavens/NNS
  and/CC
  (NP earth/NN)
  Rose/NNP
  out/IN
  of/IN
  (NP Chaos/NN)
  :/:
  or/CC
  ,/,
  if/IN
  Sion/NNP
  (NP hill/NN)
  Delight/NNP
  (NP thee/NN)
  more/RBR
  ,/,
  and/CC
  Siloa/NNP
  's/POS
  (NP brook/NN)
  that/WDT
  flowed/VBD
  Fast/NNP
  by/IN
  (NP the/DT oracle/NN)
  of/IN
  God/NNP
  ,/,
  I/PRP
  thence/VBP
  Invoke/NNP
  (NP thy/NN)
  (NP aid/NN)
  to/TO
  my/PRP$
  (NP adventurous/JJ song/NN)
  ,/,
  That/IN
  with/IN
  (NP no/DT middle/JJ flight/NN)
  intends/VBZ
  to/TO
  soar/VB
  Above/NNP
  (NP th/NN)
  '/''
  (NP Aonian/JJ mount/NN)
  ,/,
  while/IN
  it/PRP
  pursues/VBZ
  Things/NNP
  unattempted/JJ
  yet/RB
  in/IN
  (NP prose/NN)
  or/CC
  (NP rhyme/NN)
  ./.)|
|(S
  And/CC
  (NP chiefly/NN)
  (NP thou/NN)
  ,/,
  O/NNP
  Spirit/NNP
  ,/,
  that/IN
  (NP dost/NN)
  (NP prefer/NN)
  Before/IN
  all/DT
  temples/NNS
  th/VBP
  '/''
  (NP upright/JJ heart/NN)
  and/CC
  (NP pure/NN)
  ,/,
  Instruct/NNP
  me/PRP
  ,/,
  for/IN
  (NP thou/JJ know'st/NN)
  ;/:
  thou/CC
  from/IN
  the/DT
  first/JJ
  Wast/NNP
  (NP present/NN)
  ,/,
  and/CC
  ,/,
  with/IN
  mighty/JJ
  wings/NNS
  outspread/VBP
  ,/,
  Dove-like/NNP
  (NP sat'st/NN)
  brooding/VBG
  on/IN
  the/DT
  vast/JJ
  Abyss/NNP
  ,/,
  And/CC
  mad'st/VB
  it/PRP
  pregnant/JJ
  :/:
  what/WP
  in/IN
  me/PRP
  is/VBZ
  dark/JJ
  Illumine/NNP
  ,/,
  what/WP
  is/VBZ
  (NP low/JJ raise/NN)
  and/CC
  (NP support/NN)
  ;/:
  That/DT
  ,/,
  to/TO
  (NP the/DT height/NN)
  of/IN
  (NP this/DT great/JJ argument/NN)
  ,/,
  I/PRP
  may/MD
  assert/VB
  Eternal/NNP
  Providence/NNP
  ,/,
  And/CC
  justify/VB
  the/DT
  ways/NNS
  of/IN
  God/NNP
  to/TO
  men/NNS
  ./.)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows


In [ ]: