Tutorial Outline

Use NLTK to extract N-Gram features
Use Scikit-learn to explore some text datasets
Extract two different features, i) Bag of Words with TF-IDF weighting and ii) Document-Term Matrix
Visualize feature spaces



In [37]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Using NLTK to extract Unigram and Bigram

Ref: Chen Sun, Chuang Gan and Ram Nevatia, Automatic Concept Discovery from Parallel Text and Visual Corpora. ICCV 2015



In [8]:

    
from nltk.util import ngrams 
sentence = 'A black-dog and a spotted dog are fighting.'
n = 2
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
  print grams









    



('A', 'black-dog')
('black-dog', 'and')
('and', 'a')
('a', 'spotted')
('spotted', 'dog')
('dog', 'are')
('are', 'fighting.')

Some of the bigrams are obviously not relevant. So we tokenize and exclude stop words to get some relevant classes.



In [17]:

    
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','-']) # remove it if you need punctuation 

list_of_words = [i.lower() for i in wordpunct_tokenize(sentence) if i.lower() not in stop_words]
bigrams = ngrams(list_of_words,2)
for grams in bigrams:
    print grams









    



('black', 'dog')
('dog', 'spotted')
('spotted', 'dog')
('dog', 'fighting')

Using Scikit-Learn to explore some text datasets

20 Newsgroup dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. This dataset is often used for text classification and text clustering. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). From: http://qwone.com/~jason/20Newsgroups/



In [18]:

    
%run fetch_data.py twenty_newsgroups









    



Creating datasets folder: /home/surenkum/work/multimodal_class/code/feature_extraction/datasets
Checking availability of the 20 newsgroups dataset
Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)
Decompressing /home/surenkum/work/multimodal_class/code/feature_extraction/datasets/20news-bydate.tar.gz
Checking that the 20 newsgroups files exist...
=> Success!



In [22]:

    
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer # Tf IDF feature extraction
from sklearn.feature_extraction.text import CountVectorizer # Count and vectorize text feature
# Load the text data
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
twenty_train_small = load_files('./datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')
twenty_test_small = load_files('./datasets/20news-bydate-test/',
    categories=categories, encoding='latin-1')

# Lets display some of the data
def display_sample(i, dataset):
    target_id = dataset.target[i]
    print("Class id: %d" % target_id)
    print("Class name: " + dataset.target_names[target_id])
    print("Text content:\n")
    print(dataset.data[i])
    
display_sample(0,twenty_train_small)









    



Class id: 1
Class name: comp.graphics
Text content:

From: fineman@stein2.u.washington.edu (Twixt your toes)
Subject: Anyone know use "rayshade" out there?
Organization: University of Washington
Lines: 12
NNTP-Posting-Host: stein2.u.washington.edu
Keywords: rayshade, uw.

I'm using "rayshade" on the u.w. computers here, and i'd like input
from other users, and perhaps swap some ideas.  I could post
uuencoded .gifs here, or .ray code, if anyone's interested.  I'm having
trouble coming up with colors that are metallic (i.e. brass, steel)
from the RGB values.

If you're on the u.w. machines, check out "~fineman/rle.files/*.rle" on 
stein.u.washington.edu for some of what i've got.  

dan

Extracting features

Lets extract vector counts to convert text to a vector.



In [41]:

    
count_vect = CountVectorizer(min_df=2)
X_train_counts = count_vect.fit_transform(twenty_train_small.data)
print X_train_counts.shape









    



(2034, 17566)

Lets extract TF-IDF features from text data. min_df option is to put a lower bound to ignore terms that have a low document frequency.



In [26]:

    
# Extract features 
# Turn the text documents into vectors of word frequencies with tf-idf weighting
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target
print type(X_train)
print X_train.shape









    



<class 'scipy.sparse.csr.csr_matrix'>
(2034, 17566)

As observed, X_train is a scipy sparse matrix consisting of 2034 rows (number of text files) and 17566 different features (unique words)



In [32]:

    
print type(vectorizer.vocabulary_) # Type of vocabulary
print len(vectorizer.vocabulary_) # Length of vocabulary
print vectorizer.get_feature_names()[:10] # Print first 10 elements of dictionary
print vectorizer.get_feature_names()[-10:] # Print last 10 elements of dictionary









    



<type 'dict'>
17566
[u'00', u'000', u'0000', u'00000', u'000021', u'0000vec', u'00041032', u'0004422', u'0005', u'0010580b']
[u'zoology', u'zoom', u'zooming', u'zoroaster', u'zoroastrian', u'zoroastrianism', u'zoroastrians', u'zxmkr08', u'zyeh', u'zyxel']

Visualizing Feature Space

Obviously, its hard to make any sense of such high-dimensional feature space. A good technique to visualize such data is to project it to lower dimensions using PCA and then visualizing low-dimensional splace.



In [39]:

    
from sklearn.decomposition import TruncatedSVD
X_train_pca = TruncatedSVD(n_components=2).fit_transform(X_train)
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
    plt.scatter(X_train_pca[y_train == i, 0],
               X_train_pca[y_train == i, 1],
               c=c, label=twenty_train_small.target_names[i], alpha=0.8)
    
_ = plt.legend(loc='best')

Obviously, this data is not linearly separable any more but there are some interesting patterns that can be observed, alt.atheism and talk.religion.misc overlap.

References

https://github.com/ogrisel/parallel_ml_tutorial
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
http://fbkarsdorp.github.io/python-course/