TODO: spacy.io for text processing. does parts of speech even. Merge with some of the content in html->text lecture. strip punct you dummy.
Most of the data we've looked at so far has been structured, meaning essentially that the data looked like a table or Excel spreadsheet. Not all data looks like that, however. Human readable text is an extremely common unstructured data source. From the text of a webpage, tweet, or document, businesses want to perform things like:
Later in MSAN 692, we'll learn how to extract the text from webpages or pieces of webpages such as the bestseller list at Amazon. For now, we can play with some prepared text files.
Text analysis uses words as data rather than numbers, which means tokenizing text; i.e., splitting the text string for a document into individual words. This problem is actually much harder than you might think. For example, if we split the document text on the space character, then "San Francisco" would be split into two words. For our purposes here, that'll work just fine. See Tokenization in this excellent information retrieval book for more information.
Let's use an article on Istanbul as our text file and then figure out how to get an appropriate list of words out of it.
In [18]:
! head data/IntroIstanbul.txt
In Loading files, we learned how to read the contents of such a file into a string and split it on the space character:
In [5]:
with open('data/IntroIstanbul.txt') as f:
contents = f.read() # read all content of the file
words = contents.split()
print(words[:25]) # print first 25 words
That looks more like it although it is still not very clean. Some of the words are capitalized. What we need, is all words normalized so that people
and People
are consider the same word etc...
Exercise: Implement another filter pattern to convert the words to lowercase using lower()
. E.g., 'The'.lower()
is 'the'
.
Here's one way to do it:
In [6]:
words = [w.lower() for w in words]
print(words[:25])
That's not the best we can do. For example "faces" and "face" should be the same. Let's stem the words:
In [7]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in words]
print(stemmed[:45])
Let's create a bag of words representation. My work plan would have a description like "Walk through the words in a document, updating a dictionary that holds the count for each word." The plan pseudocode would have a loop over the words whose body incremented a count in a dictionary
My code implementation would look like the following.
In [8]:
from collections import defaultdict
wfreqs = defaultdict(int) # missing entries yield value 0
for w in words:
wfreqs[w] = wfreqs[w] + 1
print(wfreqs['ottoman'])
print(wfreqs['the'])
Computing the frequency of elements in a list is common enough that Python provides a built-in data structure called a Counter
that will do this for us:
In [9]:
from collections import Counter
ctr = Counter(words)
print(ctr['ottoman'])
print(ctr['the'])
That data structure is nice because it can give the list of, say, 10 most common words:
In [10]:
print(ctr.most_common(10))
In [11]:
print([p[0] for p in ctr.most_common(10)])
Python has a nice library called wordcloud
we can use to visualize the relative frequency of words. It should already be installed in your Anaconda Python directory, but if not use the command line to install it:
$ pip install wordcloud
The key elements of the following code are the creation of the WordCloud
and calling fit_words()
with a dictionary (type dict
) of word-freq associations, wfreq
.
In [12]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud()
wordcloud.fit_words(ctr)
fig=plt.figure(figsize=(6, 4)) # Prepare a plot 5x3 inches
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
That's kind of busy will all of those words in there, so let's focus on the top 30 words. To do that we will call most_common()
, which gives us a list of tuples. Because fit_words()
it requires a dict
, we convert the most common word list into a dictionary:
In [17]:
# Get 30 most common word-freq pairs then convert to dictionary for use by WordCloud
wtuples = ctr.most_common(30)
wdict = dict(wtuples)
wordcloud = WordCloud()
wordcloud.fit_words(wdict)
fig=plt.figure(figsize=(6, 4))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
That looks better but it looks like common English words like "the" and "of" are dominating the visualization. To focus on the words most relevant to the document, let's filter out such so-called English stop words. scikit-learn, a machine learning library you will become very familiar with in future classes, provides a nice list of stop words we can use:
In [14]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
english = list(ENGLISH_STOP_WORDS) # Convert to a list so I can grab a subset
print(english[:25]) # Print 25 of the words
In [15]:
goodwords = [w for w in words if w not in ENGLISH_STOP_WORDS]
goodctr = Counter(goodwords)
print(goodctr.most_common(10))
In [16]:
wtuples = goodctr.most_common(30)
wdict = dict(wtuples)
wordcloud = WordCloud()
wordcloud.fit_words(wdict)
fig=plt.figure(figsize=(5, 3))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
You can play around with the list of stop words to remove things like "important" and others to really get the key words to pop out. There is a technique to automatically damp down common English words called TFIDF, which we will learn about soon in this class.
Text files are an unstructured data source that we typically represent as a bag of words. A bag of words representation is a set of associations mapping words to their frequency or count. We typically use a dictionary data structure for bag of words because dictionary lookup is extremely efficient, versus linearly scanning an entire list of associations. We used word clouds to visualize the relative frequency of words in a document.
The data structures and techniques described in this lecture-lab form the basis of natural language processing (NLP).