TODO: spacy.io for text processing. does parts of speech even. Merge with some of the content in html->text lecture. strip punct you dummy.
Most of the data we've looked at so far has been structured, meaning essentially that the data looked like a table or Excel spreadsheet. Not all data looks like that, however. Human readable text is an extremely common unstructured data source. From the text of a webpage, tweet, or document, businesses want to perform things like:
Later in MSAN 692, we'll learn how to extract the text from webpages or pieces of webpages such as the bestseller list at Amazon. For now, we can play with some prepared text files.
Text analysis uses words as data rather than numbers, which means tokenizing text; i.e., splitting the text string for a document into individual words. This problem is actually much harder than you might think. For example, if we split the document text on the space character, then "San Francisco" would be split into two words. For our purposes here, that'll work just fine. See Tokenization in this excellent information retrieval book for more information.
In :! head data/IntroIstanbul.txt
The City and ITS People Istanbul is one of the worlds most venerable cities. Part of the citys allure is its setting, where Europe faces Asia across the winding turquoise waters of the Bosphorus, making it the only city in the world to bridge two continents.
In :with open('data/IntroIstanbul.txt') as f: contents = f.read() # read all content of the file words = contents.split() print(words[:25]) # print first 25 words
['The', 'City', 'and', 'ITS', 'People', 'Istanbul', 'is', 'one', 'of', 'the', 'worlds', 'most', 'venerable', 'cities.', 'Part', 'of', 'the', 'citys', 'allure', 'is', 'its', 'setting,', 'where', 'Europe', 'faces']
That looks more like it although it is still not very clean. Some of the words are capitalized. What we need, is all words normalized so that
People are consider the same word etc...
Exercise: Implement another filter pattern to convert the words to lowercase using
Here's one way to do it:
In :words = [w.lower() for w in words] print(words[:25])
['the', 'city', 'and', 'its', 'people', 'istanbul', 'is', 'one', 'of', 'the', 'worlds', 'most', 'venerable', 'cities.', 'part', 'of', 'the', 'citys', 'allure', 'is', 'its', 'setting,', 'where', 'europe', 'faces']
That's not the best we can do. For example "faces" and "face" should be the same. Let's stem the words:
In :from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() stemmed = [stemmer.stem(w) for w in words] print(stemmed[:45])
['the', 'citi', 'and', 'it', 'peopl', 'istanbul', 'is', 'one', 'of', 'the', 'world', 'most', 'vener', 'cities.', 'part', 'of', 'the', 'citi', 'allur', 'is', 'it', 'setting,', 'where', 'europ', 'face', 'asia', 'acr\xadoss', 'the', 'wind', 'turquois', 'water', 'of', 'the', 'bosphorus,', 'make', 'it', 'the', 'onli', 'citi', 'in', 'the', 'world', 'to', 'bridg', 'two']
Let's create a bag of words representation. My work plan would have a description like "Walk through the words in a document, updating a dictionary that holds the count for each word." The plan pseudocode would have a loop over the words whose body incremented a count in a dictionary
My code implementation would look like the following.
In :from collections import defaultdict wfreqs = defaultdict(int) # missing entries yield value 0 for w in words: wfreqs[w] = wfreqs[w] + 1 print(wfreqs['ottoman']) print(wfreqs['the'])
Computing the frequency of elements in a list is common enough that Python provides a built-in data structure called a
Counter that will do this for us:
In :from collections import Counter ctr = Counter(words) print(ctr['ottoman']) print(ctr['the'])
That data structure is nice because it can give the list of, say, 10 most common words:
[('the', 123), ('of', 55), ('and', 40), ('to', 19), ('in', 16), ('is', 14), ('a', 13), ('city', 9), ('most', 9), ('from', 9)]
In :print([p for p in ctr.most_common(10)])
['the', 'of', 'and', 'to', 'in', 'is', 'a', 'city', 'most', 'from']
Python has a nice library called
wordcloud we can use to visualize the relative frequency of words. It should already be installed in your Anaconda Python directory, but if not use the command line to install it:
$ pip install wordcloud
The key elements of the following code are the creation of the
WordCloud and calling
fit_words() with a dictionary (type
dict) of word-freq associations,
In :from wordcloud import WordCloud import matplotlib.pyplot as plt wordcloud = WordCloud() wordcloud.fit_words(ctr) fig=plt.figure(figsize=(6, 4)) # Prepare a plot 5x3 inches plt.imshow(wordcloud) plt.axis("off") plt.show()
That's kind of busy will all of those words in there, so let's focus on the top 30 words. To do that we will call
most_common(), which gives us a list of tuples. Because
fit_words() it requires a
dict, we convert the most common word list into a dictionary:
In :# Get 30 most common word-freq pairs then convert to dictionary for use by WordCloud wtuples = ctr.most_common(30) wdict = dict(wtuples) wordcloud = WordCloud() wordcloud.fit_words(wdict) fig=plt.figure(figsize=(6, 4)) plt.imshow(wordcloud) plt.axis("off") plt.show()
That looks better but it looks like common English words like "the" and "of" are dominating the visualization. To focus on the words most relevant to the document, let's filter out such so-called English stop words. scikit-learn, a machine learning library you will become very familiar with in future classes, provides a nice list of stop words we can use:
In :from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS english = list(ENGLISH_STOP_WORDS) # Convert to a list so I can grab a subset print(english[:25]) # Print 25 of the words
['between', 'nobody', 'here', 'cannot', 'ourselves', 'hence', 'never', 'is', 'by', 'call', 'ie', 'which', 'off', 'due', 'this', 'some', 'any', 'another', 'why', 'one', 'ltd', 'become', 'first', 'anyone', 'during']
In :goodwords = [w for w in words if w not in ENGLISH_STOP_WORDS] goodctr = Counter(goodwords) print(goodctr.most_common(10))
[('city', 9), ('istanbul', 7), ('worlds', 6), ('citys', 5), ('bosphorus', 5), ('sea', 4), ('ottoman', 4), ('important', 4), ('—', 4), ('europe', 3)]
In :wtuples = goodctr.most_common(30) wdict = dict(wtuples) wordcloud = WordCloud() wordcloud.fit_words(wdict) fig=plt.figure(figsize=(5, 3)) plt.imshow(wordcloud) plt.axis("off") plt.show()
You can play around with the list of stop words to remove things like "important" and others to really get the key words to pop out. There is a technique to automatically damp down common English words called TFIDF, which we will learn about soon in this class.
Text files are an unstructured data source that we typically represent as a bag of words. A bag of words representation is a set of associations mapping words to their frequency or count. We typically use a dictionary data structure for bag of words because dictionary lookup is extremely efficient, versus linearly scanning an entire list of associations. We used word clouds to visualize the relative frequency of words in a document.
The data structures and techniques described in this lecture-lab form the basis of natural language processing (NLP).