You can do many interesting things with text!
It even exists an entire area in Machine Learning called Natural Language Processing (NLP) which cover any kind of machine manipulation of natural human languages.
I want to show here a couple of pre-processing examples that are normally the base for more sophisticated models and algorithms.
Let's start with a simple text in a Python string:
In [1]:
sampleText = " The Elephant's 4 legs: this is THE Pub. Hi, you, my super_friend! You can't believe what happened to our * common friend *, the butcher. How are you?"
The basic atomic part of each text are the tokens.
A token is the NLP name for a sequence of characters that we want to treat as a group.
For example we can consider group of characters separated by blank spaces, therefore forming words.
Tokens can be extracted by splitting the text:
In [2]:
textTokens = sampleText.split() # by default split by spaces
In [3]:
textTokens
Out[3]:
In [4]:
print ("The sample text has {} tokens".format (len(textTokens)))
As you can see, tokens are words but also symbols like *. This is because we have split the string simply by using blank spaces. But you can pass other separators to the function split() such as commas.
I have the number of tokens for this sample text. Let's say now that I want to count the frequency for each token.
This can be done quickly by using the Python package Counter from collections.
In [5]:
from collections import Counter
In [6]:
totalWords = Counter(textTokens);
totalWords
Out[6]:
There are a number of problems:
This is very easy to do in Python by using the lower() function.
In [7]:
loweredText = sampleText.lower()
In [8]:
loweredText
Out[8]:
In [9]:
textTokens = loweredText.split()
totalWords = Counter(textTokens);
totalWords
Out[9]:
Now the token "the" is counted correctly 3 times !
But other words like you are still wrongly counted because of the punctuation such as comma or question mark.
Removing the extra spaces is very easy, by using the string function strip():
In [10]:
strippedText = loweredText.strip()
strippedText
Out[10]:
To remove punctuaction we can use regular expressions.
Regular expression are very powerful for match patterns in a sequence of characters.
The way in RE to match specific characters is to list them inside square brackets. For example [abc] will match only a single a,b or c letter an nothing else. A shorthand for matching sequential characters is to use the dash, for example [a-z] will match any single lowercase letter and [0-9] any single digit character.
To exclude characters from the matching we use the ^ (hat) symbol, for example [^abc] will match any single character except the letters a, b or c.
Finally there is also a special symbol for the whitespaces as they are so ubiquitous in text. Whitespaces are blank spaces, tabs, newlines, carriage returns. The symbol \s will match any of them.
The function re.sub() takes as input a starting string, a string to substitute and a pattern to match and returns the string with the matches replaced with the given substring using the given pattern. for example re.sub(r'[s]', 'z', "whatsapp") will return "whatzapp".
So one way to remove the punctuation is to replace any character that is NOT a letter, a number or a whitespace with an empty substring:
In [11]:
import re
In [12]:
processedText = re.sub(r'[^a-z0-9\s]', '', strippedText) # keep only numbers and letters
processedText
Out[12]:
Another useful symbol is \w which match ANY alphanumeric character or \W which matches any NON alphanumeric character.
So, an alternative way could be:
In [13]:
processedText = re.sub(r'[^\s\w]', '', strippedText) # remove punctuation
processedText
Out[13]:
In [14]:
processedText4 = re.sub(r'^\s+', r'', processedText) # remove spaces at the beginning
processedText4
Out[14]:
In [15]:
textTokens = processedText.split()
totalWords = Counter(textTokens);
totalWords
Out[15]:
Now the token you is also counted correctly 3 times.
A collection can be sorted easily:
In [16]:
print (totalWords.items())
In [17]:
sorted(totalWords.items(), key=lambda x:x[1],reverse=True)
Out[17]:
An alternative way (without lambda functions) to sort a collection:
In [18]:
from operator import itemgetter
sorted(totalWords.items(), key=itemgetter(1), reverse=True)
Out[18]:
In [19]:
def tokenise(text):
return text.split() # split by space; return a list
In [20]:
def removePunctuation(text):
processedText = re.sub(r'([^\s\w_]|_)+', '', text.strip()) # remove punctuation
return processedText
In [21]:
def preprocessText(text, lowercase=True):
if lowercase:
processedText = removePunctuation(text.lower())
else:
processedText = removePunctuation(text)
return tokenise(processedText)
In [22]:
def getMostCommonWords(tokens, n=10):
wordsCount = Counter(tokens) # count the occurrences
return wordsCount.most_common()[:n]
Python makes working with files pretty simple.
Let's use one of the books from the Project Gutenberg, which I have already dowloaded from ....
In [25]:
filePath = "../datasets/theprince.txt"
First of all, we need to open the file using the function open() that returns a file object.
We can then read the file content using the file object method read()
In [27]:
f = open(filePath)
try:
theText = f.read() # this is a giant String
finally:
f.close() # we should always close the file once finished
In [29]:
print ("*** Analysing book The Prince:")
print ("The book is {} chars long".format (len(theText)))
tokens = preprocessText(theText)
print ("The text has {} tokens".format (len(tokens)))
This is an easy way to read the entire text but:
A better way is to read the file line by line.
We can do this with a simple loop which will go through each line.
Note that the block keyword with will automatically close the file at the end of the block.
In [30]:
textTokens = [] # tokens will be added here
lines = 0
with open(filePath) as f:
for line in f:
# I know that the very first line is the book title
if lines == 0:
print ("Title: {}".format(line))
# every line gets processed
lineTokens = preprocessText(line)
# append the tokens to my list
textTokens.extend(lineTokens)
lines += 1 # finally move to the next line
In [31]:
print ("The text has {} lines".format (lines))
print ("The text has {} tokens".format (len(textTokens)))
In [32]:
textTokens[:10]
Out[32]:
Another useful data structure in Python is the set which is an unordered collection of distinct objects Arranging the tokens in a set means that they will put only once, and could be a smaller data collection useful to see how many distinct tokens are in a text or to see if a specific token is in the text or not.
In [33]:
uniqueTokens = set(textTokens)
print ("The text has {} unique tokens".format (len(uniqueTokens)))
print (" -> lexical diversity: each token in average is repeated {} times".format(len(textTokens) / len(uniqueTokens)))
In [34]:
sorted(uniqueTokens)[200:205]
Out[34]:
In [35]:
'accuse' in uniqueTokens
Out[35]:
In [36]:
'phone' in uniqueTokens
Out[36]:
In [37]:
getMostCommonWords(textTokens, 5)
Out[37]:
As you can see the most common words are not really meaningful but we can remove them
Stopwords (https://en.wikipedia.org/wiki/Stop_words) are common (English) words that do not contribute much to the content or meaning of a document (e.g., "the", "a", "is", "to", etc.).
In [38]:
f = open("stopwords.txt")
In [39]:
stopWordsText = f.read().splitlines() # splitlines is used to remove newlines
In [40]:
f.close()
In [41]:
stopWords = set(stopWordsText)
In [42]:
betterTokens = [token for token in textTokens if token not in stopWords]
In [43]:
betterTokens[:10]
Out[43]:
In [44]:
getMostCommonWords(betterTokens)
Out[44]:
In [45]:
from urllib.request import urlopen
In [46]:
def getWordCloud(filename):
textTokens = [] # tokens will be added here
lines = 0
path = "http://www.gutenberg.org/files/"
url = path + filename + "/" + filename + "-0.txt"
f = urlopen(url)
for line in f:
# every line gets processed
lineTokens = preprocessText(line.decode('utf-8'))
# append the tokens to my list
textTokens.extend(lineTokens)
lines += 1 # finally move to the next line
fs = open("stopwords.txt")
stopWordsText = fs.read().splitlines()
stopWords = set(stopWordsText)
fs.close()
betterTokens = [token for token in textTokens if token not in stopWords]
wordsCount = Counter(betterTokens) # count the occurences
# put each token and its occurrence in a file
with open("wordcloud_"+filename+".txt", 'a') as fw:
for line in wordsCount.most_common():
# freq = (line[1] + 10 // 2) // 10
# if freq > 2:
fw.write(str(line[1]) + ' ' + line[0] + '\n')
In [47]:
getWordCloud("1342") # Jane Austen Pride and Prejudice
In [48]:
getWordCloud("804") # Laurence Sterne A sentimental journey through France and Italy
In [ ]: