Text processing and visualization
We'll use a Python package wordcloud
to make word clouds. To install, in your terminal type:
pip install wordcloud
If you don't have/know pip, you can manually install:
Download the package from https://github.com/amueller/word_cloud/archive/master.zip and unzip the file.
In your terminal, go to that folder (cd word_cloud-master
)
Install the requirements:
sudo pip install -r requirements.txt
Install the package:
python setup.py install
In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from wordcloud import WordCloud
import pandas as pd
import numpy as np
Let's go back to the IMDB dataset. We'll try to extract some meaningful keywords in the movie titles and make a word cloud with them.
In [2]:
df = pd.read_csv('../imdb.csv', sep = '\t')
df.head(2)
Out[2]:
The wordcloud
package takes a string as its input, so we'll make the titles into a big string.
In [3]:
text = ' '.join([i for i in df.Title.tolist()])
In [4]:
text[0:30]
Out[4]:
In [5]:
wordcloud = WordCloud(max_font_size=40, max_words = 200, background_color="white").generate(text)
Note that you can tune these parameters, the options can be found here.
In [6]:
plt.figure(figsize = (10,7))
plt.imshow(wordcloud)
Out[6]:
I'll make a function because we'll do this multiple times.
In [7]:
def draw_wordcloud(s):
wordcloud = WordCloud(max_font_size=40, max_words = 200, background_color="white").generate(s)
plt.figure(figsize = (10,7))
plt.imshow(wordcloud)
What do you think about the word cloud? Are the words really meaningful?
As we discussed in class, the word cloud's quality is mostly determined by the quality of the text that we give it. We can do some text processing to improve the text's quality. NLTK is a classic Python package for such tasks.
The first step of any text processing is usually to tokenize the string: that is, break down the movie titles into words.
In [8]:
from nltk import word_tokenize
In [9]:
tokens = word_tokenize(text)
In [10]:
tokens[0:10]
Out[10]:
There are a lot of non-word symbols. Let's get rid of them.
In [11]:
tokens = [word for word in tokens if word.isalpha()]
In [12]:
tokens[0:10]
Out[12]:
Some of these words, like "My", are not really meaningful: these are the so-called stopwords. We can pick them out and remove them. NLTK provides a list of common stop words.
In [13]:
from nltk.corpus import stopwords
In [14]:
st = stopwords.words('english')
st[0:5]
Out[14]:
In [15]:
#TODO: remove stop words in the tokens.
In [16]:
tokens = [word for word in tokens if word not in st]
In [17]:
tokens[0:10]
Out[17]:
Let's try the word cloud again.
In [18]:
text = ' '.join([i for i in tokens])
In [19]:
draw_wordcloud(text)
Maybe a little better, but not much?
We can also add words to the stopword list. In our case, we probably don't want the words like "One" on the picture.
In [20]:
st.append("one")
In [21]:
#TODO: add other words that are not meaningful and re-create the word cloud.
st.append("de")
st.append("di")
text = ' '.join([i for i in tokens])
draw_wordcloud(text)
There are still a lot of words that we probably can't make sense of. Some of these, like "Der", "El", are likely non-English words.
To remove non-English words, a simple way is to check each word and see if it is in the English dictionary. One source of a collection of common English words is here. You can download and use it.
In [22]:
english = [line.strip() for line in open('wordsEn.txt')]
english = set(english)
In [23]:
tokens = [word for word in tokens if word not in english]
text = ' '.join([i for i in tokens])
draw_wordcloud(text)
It removes some non-English words, but some others are still there. This may be because they also exist in the English volcabulary for whatever reason.
We may have to manually remove these by adding them to the stop words list.
In [24]:
st.append("De")
st.append("Da")
st.append("Lo")
st.append("Le")
st.append("Da")
st.append("El")
st.append("du")
st.append("la")
st.append("au")
In [25]:
tokens = [word.lower() for word in tokens]
st = [word.lower() for word in st]
In [26]:
#TODO: remove the other non-English words and make a word cloud.
tokens = [word for word in tokens if word not in st]
text = ' '.join([i for i in tokens])
draw_wordcloud(text)
In [27]:
tempary = []
tempary = tokens
Now we finally have a plot where most of the words we can make sense of. However, it seems still not very informative.
The most frequent words in a collection of texts may not be the most meaningful ones. To address this, there is a popular method called TFIDF. In short, it is a way for evaluating the "importance" of words. We can pick out the most important words and draw a word cloud with them. sklearn has an implementation for this.
In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
TFIDF works with a set of documents. Here I'll just divide the tokens into short sublists.
In [29]:
text = np.split(np.asarray(tokens), len(tokens)/3)
text = [' '.join([j for j in i.tolist()]) for i in text]
In [30]:
vectorizer = TfidfVectorizer(max_features=1000)
vectorizer.fit_transform(text)
tokens = vectorizer.get_feature_names()
In [31]:
text = ' '.join([i for i in tokens])
draw_wordcloud(text)
By setting max_features=1000
, I'm telling the vectorizer to only keep the top 1000 words with highest TFIDF scores. You can change this parameter and see how it influences the output.
In [32]:
#TODO: tune the max_features parameter and make another word cloud.
tokens = tempary
text = np.split(np.asarray(tokens), len(tokens)/3)
text = [' '.join([j for j in i.tolist()]) for i in text]
In [33]:
vectorizer = TfidfVectorizer(max_features=500)
vectorizer.fit_transform(text)
tokens = vectorizer.get_feature_names()
In [34]:
text = np.split(np.asarray(tokens), len(tokens)/3)
text = [' '.join([j for j in i.tolist()]) for i in text]
vectorizer = TfidfVectorizer(max_features=500)
vectorizer.fit_transform(text)
tokens = vectorizer.get_feature_names()
In [34]:
text = ' '.join([i for i in tokens])
draw_wordcloud(text)
In [ ]: