Basic usage of Sklearn


In [2]:
import sklearn
import numpy as np
import matplotlib.pyplot as plt

data = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
x = data[:,0]
y = data[:,1]

data, x, y


Out[2]:
(array([[1, 2],
        [2, 3],
        [3, 4],
        [4, 5],
        [5, 6]]), array([1, 2, 3, 4, 5]), array([2, 3, 4, 5, 6]))

Text processing with Scikit learn

We can use CountVectorizer to extract a bag of words representation from a collection of documents, using the SciKit-Learn method fit_transform. We will use a list of strings as documents.


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df = 1)
content = ["How to format my hard disk", " Hard disk format problems "]

# fit_transform returns array of two rows, one per 'document'.
# each row has 7 elements, each element being the number of items
# a given feature occurred in that document.
X = vectorizer.fit_transform(content)

vectorizer.get_feature_names(), X.toarray()


Out[3]:
([u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to'],
 array([[1, 1, 1, 1, 1, 0, 1],
        [1, 1, 1, 0, 0, 1, 0]]))

Array vector for the first document


In [4]:
X.toarray()[0]


Out[4]:
array([1, 1, 1, 1, 1, 0, 1])

Number of times word "hard" occurs


In [5]:
X.toarray()[1][vectorizer.get_feature_names().index('hard')]


Out[5]:
1

Using the 20 Newsgroups dataset

We are going to fetch just some categories so that it doesn't take that long to download the docs.


In [6]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
                                 categories=categories, shuffle=True,
                                 random_state=42)

Creating a CountVectorizer object


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
train_counts = vectorizer.fit_transform(twenty_train.data)

We can now see how frequently the word algorithm occurs in the subset of the 20Newgroups collection we are considering.


In [8]:
vectorizer.vocabulary_.get(u'algorithm')


Out[8]:
4690

How many terms were extracted? use get_feature_names()


In [9]:
len(vectorizer.get_feature_names())


Out[9]:
35788

CountVectorizer can do more preprocessing. This can be stopword removal.


In [10]:
vectorizer = CountVectorizer(stop_words='english')
sorted(vectorizer.get_stop_words())[0:20]


Out[10]:
['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

More preprocessing

For stemming and more advanced preprocessing, supplement SciKit Learn with another Python library, NLTK. Up next.

More advanced preprocessing with NLTK

NLTK is described in detail in a book by Bird, Klein and Loper available online: http://www.nltk.org/book_1ed/ for version 2.7 of python

About NLTK

  • It is not the best
  • It is very easy to use

You should read the book linked above to get familiar with the package and with text preprocessing.


In [11]:
import nltk

Create an English stemmer

http://www.nltk.org/howto/stem.html for general intro. http://www.nltk.org/api/nltk.stem.html for more details (including languages covered).


In [12]:
s = nltk.stem.SnowballStemmer('english')
s.stem("cats"), s.stem("ran"), s.stem("jumped")


Out[12]:
(u'cat', u'ran', u'jump')

NLTK for text analytics

  • NERs
  • Sentiment analysis
  • Extracting information from social media.

In [13]:
from nltk.tokenize import word_tokenize
text = word_tokenize("And now for something completely different")

In [14]:
nltk.pos_tag(text)


Out[14]:
[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

Integrating NLTK with SciKit's vectorizer

NLTK Stemmer

The stemmer can be used to stem documents before feeding into SciKit's vectorizer, thus obtaining a more compact index. One way to do this is to define a new class StemmedCountVectorizer extending CountVectorizer by redifining the method build_analyzer() that handles preprocessing and tokenization.

http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

build_analyzer() takes a string as input and outputs a list of tokens.


In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english")
analyze = vectorizer.build_analyzer()
analyze("John bought carrots and potatoes")


Out[15]:
[u'john', u'bought', u'carrots', u'potatoes']

If we modify build_analyzer() to apply the NLTK stemmer to the output of default build_analyzer(), we get a version that does stemming as well:


In [16]:
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer=super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc:(english_stemmer.stem(w) for w in analyzer(doc))

So now we can create an instance of this class:


In [17]:
stem_vectorizer = StemmedCountVectorizer(min_df=1,
                                        stop_words='english')
stem_analyze = stem_vectorizer.build_analyzer()
Y = stem_analyze("John bought carrots and potatoes")

[tok for tok in Y]


Out[17]:
[u'john', u'bought', u'carrot', u'potato']

Use this vectorizer to extract features

Compare this result to around 35,000 features we obtained using the unstemmed version.


In [18]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
             'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
                                 categories=categories,
                                 shuffle=True, random_state=42)
train_counts = stem_vectorizer.fit_transform(twenty_train.data)

len(stem_vectorizer.get_feature_names())
print train_counts[:6]


  (0, 21801)	1
  (0, 7414)	4
  (0, 3982)	2
  (0, 24860)	2
  (0, 16603)	3
  (0, 7664)	3
  (0, 23291)	1
  (0, 8048)	3
  (0, 13380)	1
  (0, 13029)	2
  (0, 15168)	2
  (0, 13343)	2
  (0, 17763)	1
  (0, 19575)	1
  (0, 13005)	1
  (0, 12402)	1
  (0, 18309)	1
  (0, 25122)	2
  (0, 15512)	1
  (0, 587)	1
  (0, 9528)	1
  (0, 14886)	1
  (0, 12005)	1
  (0, 26032)	1
  (0, 22970)	1
  :	:
  (5, 7893)	1
  (5, 8052)	2
  (5, 11738)	1
  (5, 10823)	1
  (5, 5559)	1
  (5, 4064)	1
  (5, 19573)	1
  (5, 21596)	1
  (5, 9606)	1
  (5, 22968)	1
  (5, 10061)	1
  (5, 10238)	1
  (5, 19197)	1
  (5, 12061)	1
  (5, 23254)	1
  (5, 21137)	1
  (5, 24451)	1
  (5, 25969)	1
  (5, 6408)	1
  (5, 13897)	1
  (5, 20641)	1
  (5, 9531)	1
  (5, 15677)	1
  (5, 14290)	1
  (5, 6821)	1

Notes

You should always experiment and see if it is good to use stemming with your problem set. It might not be the best thing to do.

SOLR works for processing larger datasets, since Python and SciKit-Learn become less effective, and more industrial strength software is required. One example of such software is Apache SOLR, an open source indexing package available from: http://lucene.apache.org/solr/ It produces Lucene-style indices that can be used by text analytics packages such as Mahout.

Elastic http://www.elastic.co/


In [44]:
!ipython nbconvert --to script Lab1\ Text\ processing\ with\ python.ipynb


[NbConvertApp] Converting notebook Lab1 Text processing with python.ipynb to script
[NbConvertApp] Writing 5943 bytes to Lab1 Text processing with python.py

In [ ]: