NLP and NLTK Basics


SparkContext and SparkSession


In [1]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

A lot of examples in this article are borrowed from the book written by Bird et al. (2009). Here I tried to implement the examples from the book with spark as much as possible.

Refer to the book for more details: Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.

Basic terminology

  • text: a sequence of words and punctuation.
  • frequency distribution: the frequency of words in a text object.
  • collocation: a sequence of words that occur together unusually often.
  • bigrams: word pairs. High frequent bigrams are collocations.
  • corpus: a large body of text
  • wordnet: a lexical database in which english words are grouped into sets of synonyms (also called synsets).
  • text normalization: the process of transforming text into a single canonical form, e.g., converting text to lowercase, removing punctuations and so on.
  • Lemmatization: the process of grouping variant forms of the same word so that they can be analyzed as a single item.
  • Stemming: the process of reducing inflected words to their word stem.
  • tokenization:
  • segmentation:
  • chunking:

Texts as lists of words

Create a data frame consisting of text elements.


In [3]:
import pandas as pd
pdf = pd.DataFrame({
        'texts': [['I', 'like', 'playing', 'basketball'],
                 ['I', 'like', 'coding'],
                 ['I', 'like', 'machine', 'learning', 'very', 'much']]
    })
    
df = spark.createDataFrame(pdf)
df.show(truncate=False)


+----------------------------------------+
|texts                                   |
+----------------------------------------+
|[I, like, playing, basketball]          |
|[I, like, coding]                       |
|[I, like, machine, learning, very, much]|
+----------------------------------------+

Ngrams and collocations

Transform texts to 2-grams, 3-grams and 4-grams collocations.


In [4]:
from pyspark.ml.feature import NGram
from pyspark.ml import Pipeline
ngrams = [NGram(n=n, inputCol='texts', outputCol=str(n)+'-grams') for n in [2,3,4]]

# build pipeline model
pipeline = Pipeline(stages=ngrams)

# transform data
texts_ngrams = pipeline.fit(df).transform(df)

In [5]:
# display result
texts_ngrams.select('2-grams').show(truncate=False)
texts_ngrams.select('3-grams').show(truncate=False)
texts_ngrams.select('4-grams').show(truncate=False)


+------------------------------------------------------------------+
|2-grams                                                           |
+------------------------------------------------------------------+
|[I like, like playing, playing basketball]                        |
|[I like, like coding]                                             |
|[I like, like machine, machine learning, learning very, very much]|
+------------------------------------------------------------------+

+----------------------------------------------------------------------------------+
|3-grams                                                                           |
+----------------------------------------------------------------------------------+
|[I like playing, like playing basketball]                                         |
|[I like coding]                                                                   |
|[I like machine, like machine learning, machine learning very, learning very much]|
+----------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------+
|4-grams                                                                          |
+---------------------------------------------------------------------------------+
|[I like playing basketball]                                                      |
|[]                                                                               |
|[I like machine learning, like machine learning very, machine learning very much]|
+---------------------------------------------------------------------------------+

Access corpora from the NLTK package

The gutenberg corpus

Get file ids in gutenberg corpus


In [6]:
from nltk.corpus import gutenberg

gutenberg_fileids = gutenberg.fileids()
gutenberg_fileids


Out[6]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Absolute path of a file


In [7]:
gutenberg.abspath(gutenberg_fileids[0])


Out[7]:
FileSystemPathPointer('/Users/mingchen/nltk_data/corpora/gutenberg/austen-emma.txt')

Raw text


In [8]:
gutenberg.raw(gutenberg_fileids[0])[:200]


Out[8]:
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; an'

The words of the entire corpus


In [9]:
gutenberg.words()


Out[9]:
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

In [10]:
len(gutenberg.words())


Out[10]:
2621613

Sentences of a specific file


In [11]:
gutenberg.sents(gutenberg_fileids[0])


Out[11]:
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]

In [12]:
len(gutenberg.sents(gutenberg_fileids[0]))


Out[12]:
7752

Loading custom corpus

Let's create a corpus consisting all files from the ./data directory.


In [13]:
from nltk.corpus import PlaintextCorpusReader
corpus_data = PlaintextCorpusReader('./data', '.*')

Files in the corpus corpus_data


In [14]:
data_fileids = corpus_data.fileids()
data_fileids


Out[14]:
['Advertising.csv',
 'Credit.csv',
 'WineData.csv',
 'churn-bigml-20.csv',
 'churn-bigml-80.csv',
 'cuse_binary.csv',
 'horseshoe_crab.csv',
 'hsb2.csv',
 'hsb2_modified.csv',
 'iris.csv',
 'mtcars.csv',
 'prostate.csv',
 'twitter.txt']

Raw text in twitter.txt


In [15]:
corpus_data.raw('twitter.txt')


Out[15]:
'Fresh install of XP on new computer. Sweet relief! fuck vista\t1018769417\t1.0\nWell. Now I know where to go when I want my knives. #ChiChevySXSW http://post.ly/RvDl\t10284216536\t1.0\n"Literally six weeks before I can take off ""SSC Chair"" off my email. Its like the torturous 4th mile before everything stops hurting."\t10298589026\t1.0\nMitsubishi i MiEV - Wikipedia, the free encyclopedia - http://goo.gl/xipe Cutest car ever!\t109017669432377344\t1.0\n\'Cheap Eats in SLP\' - http://t.co/4w8gRp7\t109642968603963392\t1.0\nTeenage Mutant Ninja Turtle art is never a bad thing... http://bit.ly/aDMHyW\t10995492579\t1.0\nNew demographic survey of online video viewers: http://bit.ly/cx8b7I via @KellyOlexa\t11713360136\t1.0\nhi all - i\'m going to be tweeting things lookstat at the @lookstat twitter account. please follow me there\t1208319583\t1.0\nHoly carp, no. That movie will seriously suffer for it. RT @MouseInfo: Anyone excited for The Little Mermaid in 3D?\t121330835726155776\t1.0\n"Did I really need to learn ""I bought a box and put in it things"" in arabic? This is the most random book ever."\t12358025545\t1.0\n'

Words and sentences in file twitter.txt


In [16]:
corpus_data.words(fileids='twitter.txt')


Out[16]:
['Fresh', 'install', 'of', 'XP', 'on', 'new', ...]

In [17]:
len(corpus_data.words(fileids='twitter.txt'))


Out[17]:
253

In [18]:
corpus_data.sents(fileids='twitter.txt')


Out[18]:
[['Fresh', 'install', 'of', 'XP', 'on', 'new', 'computer', '.'], ['Sweet', 'relief', '!'], ...]

In [19]:
len(corpus_data.sents(fileids='twitter.txt'))


Out[19]:
14

WordNet

The nltk.corpus.wordnet.synsets() function load all synsents with a given lemma and part of speech tag.

Load all synsets into a spark data frame given the lemma car.


In [20]:
from nltk.corpus import wordnet
wordnet.synsets
pdf = pd.DataFrame({
        'car_synsets': [synsets._name for synsets in wordnet.synsets('car')]
    })
df = spark.createDataFrame(pdf)
df.show()


+--------------+
|   car_synsets|
+--------------+
|      car.n.01|
|      car.n.02|
|      car.n.03|
|      car.n.04|
|cable_car.n.01|
+--------------+

Get lemma names given a synset


In [41]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from nltk.corpus import wordnet

def lemma_names_from_synset(x):
    synset = wordnet.synset(x)
    return synset.lemma_names()

lemma_names_from_synset('car.n.02')
# synset_lemmas_udf = udf(lemma_names_from_synset, ArrayType(StringType()))


Out[41]:
['car', 'railcar', 'railway_car', 'railroad_car']