In [2]:
import sklearn
import numpy as np
import matplotlib.pyplot as plt
data = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
x = data[:,0]
y = data[:,1]
data, x, y
Out[2]:
In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df = 1)
content = ["How to format my hard disk", " Hard disk format problems "]
# fit_transform returns array of two rows, one per 'document'.
# each row has 7 elements, each element being the number of items
# a given feature occurred in that document.
X = vectorizer.fit_transform(content)
vectorizer.get_feature_names(), X.toarray()
Out[3]:
In [4]:
X.toarray()[0]
Out[4]:
In [5]:
X.toarray()[1][vectorizer.get_feature_names().index('hard')]
Out[5]:
In [6]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
categories=categories, shuffle=True,
random_state=42)
In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
train_counts = vectorizer.fit_transform(twenty_train.data)
We can now see how frequently the word algorithm occurs in the subset of the 20Newgroups collection we are considering.
In [8]:
vectorizer.vocabulary_.get(u'algorithm')
Out[8]:
How many terms were extracted? use get_feature_names()
In [9]:
len(vectorizer.get_feature_names())
Out[9]:
CountVectorizer can do more preprocessing. This can be stopword removal.
In [10]:
vectorizer = CountVectorizer(stop_words='english')
sorted(vectorizer.get_stop_words())[0:20]
Out[10]:
NLTK is described in detail in a book by Bird, Klein and Loper available online: http://www.nltk.org/book_1ed/ for version 2.7 of python
You should read the book linked above to get familiar with the package and with text preprocessing.
In [11]:
import nltk
http://www.nltk.org/howto/stem.html for general intro. http://www.nltk.org/api/nltk.stem.html for more details (including languages covered).
In [12]:
s = nltk.stem.SnowballStemmer('english')
s.stem("cats"), s.stem("ran"), s.stem("jumped")
Out[12]:
In [13]:
from nltk.tokenize import word_tokenize
text = word_tokenize("And now for something completely different")
In [14]:
nltk.pos_tag(text)
Out[14]:
The stemmer can be used to stem documents before feeding into SciKit's vectorizer, thus obtaining a more compact index. One way to do this is to define a new class StemmedCountVectorizer extending CountVectorizer by redifining the method build_analyzer() that handles preprocessing and tokenization.
http://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
build_analyzer() takes a string as input and outputs a list of tokens.
In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english")
analyze = vectorizer.build_analyzer()
analyze("John bought carrots and potatoes")
Out[15]:
If we modify build_analyzer() to apply the NLTK stemmer to the output of default build_analyzer(), we get a version that does stemming as well:
In [16]:
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer=super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc:(english_stemmer.stem(w) for w in analyzer(doc))
So now we can create an instance of this class:
In [17]:
stem_vectorizer = StemmedCountVectorizer(min_df=1,
stop_words='english')
stem_analyze = stem_vectorizer.build_analyzer()
Y = stem_analyze("John bought carrots and potatoes")
[tok for tok in Y]
Out[17]:
In [18]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
categories=categories,
shuffle=True, random_state=42)
train_counts = stem_vectorizer.fit_transform(twenty_train.data)
len(stem_vectorizer.get_feature_names())
print train_counts[:6]
You should always experiment and see if it is good to use stemming with your problem set. It might not be the best thing to do.
SOLR works for processing larger datasets, since Python and SciKit-Learn become less effective, and more industrial strength software is required. One example of such software is Apache SOLR, an open source indexing package available from: http://lucene.apache.org/solr/ It produces Lucene-style indices that can be used by text analytics packages such as Mahout.
Elastic http://www.elastic.co/
In [44]:
!ipython nbconvert --to script Lab1\ Text\ processing\ with\ python.ipynb
In [ ]: