Text processing

You can do many interesting things with text! It even exists an entire area in Machine Learning called Natural Language Processing (NLP) which cover any kind of machine manipulation of natural human languages.
I want to show here a couple of pre-processing examples that are normally the base for more sophisticated models and algorithms.

Let's start with a simple text in a Python string:


In [1]:
sampleText = " The Elephant's 4 legs: this is THE Pub. Hi, you, my super_friend! You can't believe what happened to our * common friend *, the butcher. How are you?"

Tokens

The basic atomic part of each text are the tokens. A token is the NLP name for a sequence of characters that we want to treat as a group.
For example we can consider group of characters separated by blank spaces, therefore forming words.
Tokens can be extracted by splitting the text:


In [2]:
textTokens = sampleText.split() # by default split by spaces

In [3]:
textTokens


Out[3]:
['The',
 "Elephant's",
 '4',
 'legs:',
 'this',
 'is',
 'THE',
 'Pub.',
 'Hi,',
 'you,',
 'my',
 'super_friend!',
 'You',
 "can't",
 'believe',
 'what',
 'happened',
 'to',
 'our',
 '*',
 'common',
 'friend',
 '*,',
 'the',
 'butcher.',
 'How',
 'are',
 'you?']

In [4]:
print ("The sample text has {} tokens".format (len(textTokens)))


The sample text has 28 tokens

As you can see, tokens are words but also symbols like *. This is because we have split the string simply by using blank spaces. But you can pass other separators to the function split() such as commas.

Tokens frequency

I have the number of tokens for this sample text. Let's say now that I want to count the frequency for each token.
This can be done quickly by using the Python package Counter from collections.


In [5]:
from collections import Counter

In [6]:
totalWords = Counter(textTokens); 
totalWords


Out[6]:
Counter({'*': 1,
         '*,': 1,
         '4': 1,
         "Elephant's": 1,
         'Hi,': 1,
         'How': 1,
         'Pub.': 1,
         'THE': 1,
         'The': 1,
         'You': 1,
         'are': 1,
         'believe': 1,
         'butcher.': 1,
         "can't": 1,
         'common': 1,
         'friend': 1,
         'happened': 1,
         'is': 1,
         'legs:': 1,
         'my': 1,
         'our': 1,
         'super_friend!': 1,
         'the': 1,
         'this': 1,
         'to': 1,
         'what': 1,
         'you,': 1,
         'you?': 1})

There are a number of problems:

  • some word (like The/THE or You/you) is in two different tokens because of capital vs. small letter
  • some token contains punctuation marks which makes the same word counted twice
  • same token consists of symbols (like *)

Remove capital letters

This is very easy to do in Python by using the lower() function.


In [7]:
loweredText = sampleText.lower()

In [8]:
loweredText


Out[8]:
" the elephant's 4 legs: this is the pub. hi, you, my super_friend! you can't believe what happened to our * common friend *, the butcher. how are you?"

In [9]:
textTokens = loweredText.split()
totalWords = Counter(textTokens); 
totalWords


Out[9]:
Counter({'*': 1,
         '*,': 1,
         '4': 1,
         'are': 1,
         'believe': 1,
         'butcher.': 1,
         "can't": 1,
         'common': 1,
         "elephant's": 1,
         'friend': 1,
         'happened': 1,
         'hi,': 1,
         'how': 1,
         'is': 1,
         'legs:': 1,
         'my': 1,
         'our': 1,
         'pub.': 1,
         'super_friend!': 1,
         'the': 3,
         'this': 1,
         'to': 1,
         'what': 1,
         'you': 1,
         'you,': 1,
         'you?': 1})

Now the token "the" is counted correctly 3 times !
But other words like you are still wrongly counted because of the punctuation such as comma or question mark.

Remove punctuation and trailing spaces

Removing the extra spaces is very easy, by using the string function strip():


In [10]:
strippedText = loweredText.strip()
strippedText


Out[10]:
"the elephant's 4 legs: this is the pub. hi, you, my super_friend! you can't believe what happened to our * common friend *, the butcher. how are you?"

To remove punctuaction we can use regular expressions.
Regular expression are very powerful for match patterns in a sequence of characters.

The way in RE to match specific characters is to list them inside square brackets. For example [abc] will match only a single a,b or c letter an nothing else. A shorthand for matching sequential characters is to use the dash, for example [a-z] will match any single lowercase letter and [0-9] any single digit character.
To exclude characters from the matching we use the ^ (hat) symbol, for example [^abc] will match any single character except the letters a, b or c.
Finally there is also a special symbol for the whitespaces as they are so ubiquitous in text. Whitespaces are blank spaces, tabs, newlines, carriage returns. The symbol \s will match any of them.

The function re.sub() takes as input a starting string, a string to substitute and a pattern to match and returns the string with the matches replaced with the given substring using the given pattern. for example re.sub(r'[s]', 'z', "whatsapp") will return "whatzapp".

So one way to remove the punctuation is to replace any character that is NOT a letter, a number or a whitespace with an empty substring:


In [11]:
import re

In [12]:
processedText = re.sub(r'[^a-z0-9\s]', '', strippedText) # keep only numbers and letters
processedText


Out[12]:
'the elephants 4 legs this is the pub hi you my superfriend you cant believe what happened to our  common friend  the butcher how are you'

Another useful symbol is \w which match ANY alphanumeric character or \W which matches any NON alphanumeric character.
So, an alternative way could be:


In [13]:
processedText = re.sub(r'[^\s\w]', '', strippedText) # remove punctuation
processedText


Out[13]:
'the elephants 4 legs this is the pub hi you my super_friend you cant believe what happened to our  common friend  the butcher how are you'

In [14]:
processedText4 = re.sub(r'^\s+', r'', processedText) # remove spaces at the beginning
processedText4


Out[14]:
'the elephants 4 legs this is the pub hi you my super_friend you cant believe what happened to our  common friend  the butcher how are you'

In [15]:
textTokens = processedText.split()
totalWords = Counter(textTokens); 
totalWords


Out[15]:
Counter({'4': 1,
         'are': 1,
         'believe': 1,
         'butcher': 1,
         'cant': 1,
         'common': 1,
         'elephants': 1,
         'friend': 1,
         'happened': 1,
         'hi': 1,
         'how': 1,
         'is': 1,
         'legs': 1,
         'my': 1,
         'our': 1,
         'pub': 1,
         'super_friend': 1,
         'the': 3,
         'this': 1,
         'to': 1,
         'what': 1,
         'you': 3})

Now the token you is also counted correctly 3 times.

A collection can be sorted easily:


In [16]:
print (totalWords.items())


dict_items([('believe', 1), ('pub', 1), ('this', 1), ('my', 1), ('legs', 1), ('to', 1), ('what', 1), ('are', 1), ('is', 1), ('common', 1), ('you', 3), ('hi', 1), ('4', 1), ('happened', 1), ('our', 1), ('super_friend', 1), ('the', 3), ('how', 1), ('friend', 1), ('butcher', 1), ('cant', 1), ('elephants', 1)])

In [17]:
sorted(totalWords.items(), key=lambda x:x[1],reverse=True)


Out[17]:
[('you', 3),
 ('the', 3),
 ('believe', 1),
 ('pub', 1),
 ('this', 1),
 ('my', 1),
 ('legs', 1),
 ('to', 1),
 ('what', 1),
 ('are', 1),
 ('is', 1),
 ('common', 1),
 ('hi', 1),
 ('4', 1),
 ('happened', 1),
 ('our', 1),
 ('super_friend', 1),
 ('how', 1),
 ('friend', 1),
 ('butcher', 1),
 ('cant', 1),
 ('elephants', 1)]

An alternative way (without lambda functions) to sort a collection:


In [18]:
from operator import itemgetter
sorted(totalWords.items(), key=itemgetter(1), reverse=True)


Out[18]:
[('you', 3),
 ('the', 3),
 ('believe', 1),
 ('pub', 1),
 ('this', 1),
 ('my', 1),
 ('legs', 1),
 ('to', 1),
 ('what', 1),
 ('are', 1),
 ('is', 1),
 ('common', 1),
 ('hi', 1),
 ('4', 1),
 ('happened', 1),
 ('our', 1),
 ('super_friend', 1),
 ('how', 1),
 ('friend', 1),
 ('butcher', 1),
 ('cant', 1),
 ('elephants', 1)]

Let's put these results into functions we can re-use


In [19]:
def tokenise(text):
    return text.split() # split by space; return a list

In [20]:
def removePunctuation(text):
    processedText = re.sub(r'([^\s\w_]|_)+', '', text.strip()) # remove punctuation

    return processedText

In [21]:
def preprocessText(text, lowercase=True):
    if lowercase:
        processedText = removePunctuation(text.lower())
    else:
        processedText = removePunctuation(text)

    return tokenise(processedText)

In [22]:
def getMostCommonWords(tokens, n=10):
    wordsCount = Counter(tokens) # count the occurrences
    
    return wordsCount.most_common()[:n]

Let's process a text file

Python makes working with files pretty simple.
Let's use one of the books from the Project Gutenberg, which I have already dowloaded from ....


In [25]:
filePath = "../datasets/theprince.txt"

First of all, we need to open the file using the function open() that returns a file object.
We can then read the file content using the file object method read()


In [27]:
f = open(filePath)
try:
    theText = f.read()  # this is a giant String
finally:
    f.close()  # we should always close the file once finished

In [29]:
print ("*** Analysing book The Prince:")  
print ("The book is {} chars long".format (len(theText)))

tokens = preprocessText(theText)

print ("The text has {} tokens".format (len(tokens)))


*** Analysing book The Prince:
The book is 300814 chars long
The text has 52536 tokens

This is an easy way to read the entire text but:

  1. It's quite memory inefficient
  2. It's slower than processing data as it is read, because it defers any processing done on read data until after all data has been read into memory, rather than processing as data is read.

A better way is to read the file line by line.
We can do this with a simple loop which will go through each line.
Note that the block keyword with will automatically close the file at the end of the block.


In [30]:
textTokens = [] # tokens will be added here
lines = 0

with open(filePath) as f:
    for line in f:
               # I know that the very first line is the book title
        if lines == 0:
            print ("Title: {}".format(line))
            
               # every line gets processed
        lineTokens = preprocessText(line)               
                # append the tokens to my list
        textTokens.extend(lineTokens)
        
        lines += 1  # finally move to the next line


Title: The Project Gutenberg EBook of The Prince, by Nicolo Machiavelli


In [31]:
print ("The text has {} lines".format (lines))
print ("The text has {} tokens".format (len(textTokens)))


The text has 5064 lines
The text has 52536 tokens

In [32]:
textTokens[:10]


Out[32]:
['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'the',
 'prince',
 'by',
 'nicolo',
 'machiavelli']

Unique tokens

Another useful data structure in Python is the set which is an unordered collection of distinct objects Arranging the tokens in a set means that they will put only once, and could be a smaller data collection useful to see how many distinct tokens are in a text or to see if a specific token is in the text or not.


In [33]:
uniqueTokens = set(textTokens)
print ("The text has {} unique tokens".format (len(uniqueTokens)))
print (" -> lexical diversity: each token in average is repeated {} times".format(len(textTokens) / len(uniqueTokens)))


The text has 5630 unique tokens
 -> lexical diversity: each token in average is repeated 9.331438721136767 times

In [34]:
sorted(uniqueTokens)[200:205]


Out[34]:
['accumulate', 'accuse', 'accused', 'accustomed', 'accustoms']

In [35]:
'accuse' in uniqueTokens


Out[35]:
True

In [36]:
'phone' in uniqueTokens


Out[36]:
False

Remove stopwords


In [37]:
getMostCommonWords(textTokens, 5)


Out[37]:
[('the', 3109), ('to', 2107), ('and', 1935), ('of', 1802), ('in', 993)]

As you can see the most common words are not really meaningful but we can remove them

Stopwords (https://en.wikipedia.org/wiki/Stop_words) are common (English) words that do not contribute much to the content or meaning of a document (e.g., "the", "a", "is", "to", etc.).


In [38]:
f = open("stopwords.txt")

In [39]:
stopWordsText = f.read().splitlines()  # splitlines is used to remove newlines

In [40]:
f.close()

In [41]:
stopWords = set(stopWordsText)

In [42]:
betterTokens = [token for token in textTokens if token not in stopWords]

In [43]:
betterTokens[:10]


Out[43]:
['project',
 'gutenberg',
 'ebook',
 'prince',
 'nicolo',
 'machiavelli',
 'ebook',
 'use',
 'anyone',
 'anywhere']

In [44]:
getMostCommonWords(betterTokens)


Out[44]:
[('prince', 222),
 ('men', 161),
 ('castruccio', 136),
 ('people', 115),
 ('many', 101),
 ('others', 96),
 ('time', 93),
 ('great', 89),
 ('duke', 88),
 ('project', 87)]

Generate a words cloud from a text

Small example: we generate a words cloud from a text.
To make it more interesting we take as text a novel directly from the web (Project Gutenberg).


In [45]:
from urllib.request import urlopen

In [46]:
def getWordCloud(filename):
    textTokens = [] # tokens will be added here
    lines = 0
    path = "http://www.gutenberg.org/files/"

    url = path + filename + "/" + filename + "-0.txt"
    f = urlopen(url)
    
    for line in f:            
               # every line gets processed
        lineTokens = preprocessText(line.decode('utf-8'))               
                # append the tokens to my list
        textTokens.extend(lineTokens)
        
        lines += 1  # finally move to the next line
            
    fs = open("stopwords.txt")
    stopWordsText = fs.read().splitlines()
    stopWords = set(stopWordsText)
    fs.close()
    betterTokens = [token for token in textTokens if token not in stopWords]
    
    wordsCount = Counter(betterTokens) # count the occurences
    
          # put each token and its occurrence in a file
    with open("wordcloud_"+filename+".txt", 'a') as fw:
        for line in wordsCount.most_common():
    #        freq = (line[1] + 10 // 2) // 10
     #       if freq > 2:
            fw.write(str(line[1]) + ' ' + line[0] + '\n')

In [47]:
getWordCloud("1342") # Jane Austen Pride and Prejudice


In [48]:
getWordCloud("804") # Laurence Sterne A sentimental journey through France and Italy


In [ ]: