NLP and NLTK Basics

SparkContext and SparkSession



In [1]:

    
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

A lot of examples in this article are borrowed from the book written by Bird et al. (2009). Here I tried to implement the examples from the book with spark as much as possible.

Refer to the book for more details: Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.

Basic terminology

text: a sequence of words and punctuation.
frequency distribution: the frequency of words in a text object.
collocation: a sequence of words that occur together unusually often.
bigrams: word pairs. High frequent bigrams are collocations.
corpus: a large body of text
wordnet: a lexical database in which english words are grouped into sets of synonyms (also called synsets).
text normalization: the process of transforming text into a single canonical form, e.g., converting text to lowercase, removing punctuations and so on.
Lemmatization: the process of grouping variant forms of the same word so that they can be analyzed as a single item.
Stemming: the process of reducing inflected words to their word stem.
tokenization:
segmentation:
chunking:

Texts as lists of words

Create a data frame consisting of text elements.



In [3]:

    
import pandas as pd
pdf = pd.DataFrame({
        'texts': [['I', 'like', 'playing', 'basketball'],
                 ['I', 'like', 'coding'],
                 ['I', 'like', 'machine', 'learning', 'very', 'much']]
    })
    
df = spark.createDataFrame(pdf)
df.show(truncate=False)









    



+----------------------------------------+
|texts                                   |
+----------------------------------------+
|[I, like, playing, basketball]          |
|[I, like, coding]                       |
|[I, like, machine, learning, very, much]|
+----------------------------------------+

Ngrams and collocations

Transform texts to 2-grams, 3-grams and 4-grams collocations.



In [4]:

    
from pyspark.ml.feature import NGram
from pyspark.ml import Pipeline
ngrams = [NGram(n=n, inputCol='texts', outputCol=str(n)+'-grams') for n in [2,3,4]]

# build pipeline model
pipeline = Pipeline(stages=ngrams)

# transform data
texts_ngrams = pipeline.fit(df).transform(df)



In [5]:

    
# display result
texts_ngrams.select('2-grams').show(truncate=False)
texts_ngrams.select('3-grams').show(truncate=False)
texts_ngrams.select('4-grams').show(truncate=False)









    



+------------------------------------------------------------------+
|2-grams                                                           |
+------------------------------------------------------------------+
|[I like, like playing, playing basketball]                        |
|[I like, like coding]                                             |
|[I like, like machine, machine learning, learning very, very much]|
+------------------------------------------------------------------+

+----------------------------------------------------------------------------------+
|3-grams                                                                           |
+----------------------------------------------------------------------------------+
|[I like playing, like playing basketball]                                         |
|[I like coding]                                                                   |
|[I like machine, like machine learning, machine learning very, learning very much]|
+----------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------+
|4-grams                                                                          |
+---------------------------------------------------------------------------------+
|[I like playing basketball]                                                      |
|[]                                                                               |
|[I like machine learning, like machine learning very, machine learning very much]|
+---------------------------------------------------------------------------------+

Access corpora from the NLTK package

The `gutenberg` corpus

Get file ids in gutenberg corpus



In [6]:

    
from nltk.corpus import gutenberg

gutenberg_fileids = gutenberg.fileids()
gutenberg_fileids









    Out[6]:





['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Absolute path of a file



In [7]:

    
gutenberg.abspath(gutenberg_fileids[0])









    Out[7]:





FileSystemPathPointer('/Users/mingchen/nltk_data/corpora/gutenberg/austen-emma.txt')

Raw text



In [8]:

    
gutenberg.raw(gutenberg_fileids[0])[:200]









    Out[8]:





'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; an'

The words of the entire corpus



In [9]:

    
gutenberg.words()









    Out[9]:





['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]



In [10]:

    
len(gutenberg.words())









    Out[10]:





2621613

Sentences of a specific file



In [11]:

    
gutenberg.sents(gutenberg_fileids[0])









    Out[11]:





[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]



In [12]:

    
len(gutenberg.sents(gutenberg_fileids[0]))









    Out[12]:





7752

Loading custom corpus

Let's create a corpus consisting all files from the ./data directory.



In [13]:

    
from nltk.corpus import PlaintextCorpusReader
corpus_data = PlaintextCorpusReader('./data', '.*')

Files in the corpus corpus_data



In [14]:

    
data_fileids = corpus_data.fileids()
data_fileids









    Out[14]:





['Advertising.csv',
 'Credit.csv',
 'WineData.csv',
 'churn-bigml-20.csv',
 'churn-bigml-80.csv',
 'cuse_binary.csv',
 'horseshoe_crab.csv',
 'hsb2.csv',
 'hsb2_modified.csv',
 'iris.csv',
 'mtcars.csv',
 'prostate.csv',
 'twitter.txt']

Raw text in twitter.txt



In [15]:

    
corpus_data.raw('twitter.txt')









    Out[15]:





'Fresh install of XP on new computer. Sweet relief! fuck vista\t1018769417\t1.0\nWell. Now I know where to go when I want my knives. #ChiChevySXSW http://post.ly/RvDl\t10284216536\t1.0\n"Literally six weeks before I can take off ""SSC Chair"" off my email. Its like the torturous 4th mile before everything stops hurting."\t10298589026\t1.0\nMitsubishi i MiEV - Wikipedia, the free encyclopedia - http://goo.gl/xipe Cutest car ever!\t109017669432377344\t1.0\n\'Cheap Eats in SLP\' - http://t.co/4w8gRp7\t109642968603963392\t1.0\nTeenage Mutant Ninja Turtle art is never a bad thing... http://bit.ly/aDMHyW\t10995492579\t1.0\nNew demographic survey of online video viewers: http://bit.ly/cx8b7I via @KellyOlexa\t11713360136\t1.0\nhi all - i\'m going to be tweeting things lookstat at the @lookstat twitter account. please follow me there\t1208319583\t1.0\nHoly carp, no. That movie will seriously suffer for it. RT @MouseInfo: Anyone excited for The Little Mermaid in 3D?\t121330835726155776\t1.0\n"Did I really need to learn ""I bought a box and put in it things"" in arabic? This is the most random book ever."\t12358025545\t1.0\n'

Words and sentences in file twitter.txt



In [16]:

    
corpus_data.words(fileids='twitter.txt')









    Out[16]:





['Fresh', 'install', 'of', 'XP', 'on', 'new', ...]



In [17]:

    
len(corpus_data.words(fileids='twitter.txt'))









    Out[17]:





253



In [18]:

    
corpus_data.sents(fileids='twitter.txt')









    Out[18]:





[['Fresh', 'install', 'of', 'XP', 'on', 'new', 'computer', '.'], ['Sweet', 'relief', '!'], ...]



In [19]:

    
len(corpus_data.sents(fileids='twitter.txt'))









    Out[19]:





14

WordNet

The nltk.corpus.wordnet.synsets() function load all synsents with a given lemma and part of speech tag.

Load all synsets into a spark data frame given the lemma car.



In [20]:

    
from nltk.corpus import wordnet
wordnet.synsets
pdf = pd.DataFrame({
        'car_synsets': [synsets._name for synsets in wordnet.synsets('car')]
    })
df = spark.createDataFrame(pdf)
df.show()









    



+--------------+
|   car_synsets|
+--------------+
|      car.n.01|
|      car.n.02|
|      car.n.03|
|      car.n.04|
|cable_car.n.01|
+--------------+

Get lemma names given a synset



In [41]:

    
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from nltk.corpus import wordnet

def lemma_names_from_synset(x):
    synset = wordnet.synset(x)
    return synset.lemma_names()

lemma_names_from_synset('car.n.02')
# synset_lemmas_udf = udf(lemma_names_from_synset, ArrayType(StringType()))









    Out[41]:





['car', 'railcar', 'railway_car', 'railroad_car']