In [1]:
%load_ext watermark

%watermark -a 'Vahid Mirjalili' -d -p scikit-learn,numpy,numexpr,pandas,matplotlib,plotly -v


Vahid Mirjalili 20/12/2014 

CPython 2.7.3
IPython 2.3.1

scikit-learn 0.15.2
numpy 1.9.1
numexpr 2.2.2
pandas 0.15.1
matplotlib 1.4.2
plotly 1.4.7

In [2]:
from matplotlib import pyplot as plt
%matplotlib inline

import numpy as np
import scipy

import logging
import numpy as np

from timeit import timeit

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

Loading the Data


In [3]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')

categories = None 

remove = ('headers', 'footers', 'quotes')

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

categories = data_train.target_names

print("Categories: %s" %categories)

y_train, y_test = data_train.target, data_test.target

print("Dataset size: Training: %d Testing: %d" % (y_train.shape[0], y_test.shape[0]))


Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Dataset size: Training: 11314 Testing: 7532

Vectorization (TF-IDF)

TF-IDF stands for Term frequency inverse document frequency, is a statistical measure to see how important the appearance of a word in a document corpus is for classification. Term frequency measures how many times a term has appeared in a particular document, and inverse document frequency measures logarithm inverse of number of ducuments that have that word out of total number of documents. Augmented TF-IDF is used to avoid bias towards longer documents, and it is defined as below for a term $t$ appearing in a particular document $d$:

$$ TF(t, d) = 0.5 + \frac{0.5 \times freq(t, d)}{max(freq(w, d) \ \ where\ w \in d)} $$$$IDF(t, D_{train}) = \frac{|D_{train}|}{|\{d\in D_{train} \ \ where\ t \in d\}|}$$

and

$$TFIDF(t, d, D) = TF(t,d) \times IDF(t, d)$$

where $D_{train}=\{set\ of\ all\ training\ documents\}$. It is important to note that TF-IDF depends on the entire set of documents that we are considering (for example the training set), and not just the term in a document.


In [4]:
### Vectorizing the training set:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)

print("Number of samples N= %d,  Number of features d= %d" % X_train.shape)


### Transforming the test dataset:
X_test = vectorizer.transform(data_test.data)

print("Number of Test Documents: %d,  Number of features: %d" %X_test.shape)


Number of samples N= 11314,  Number of features d= 101323
Number of Test Documents: 7532,  Number of features: 101323

A Generic Classifier Function


In [5]:
### Train a classifier object and test it on the test set:
def apply_classifier(clf):
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    score = metrics.f1_score(y_test, pred)

    return(score)

Apply Different Classifiers


In [6]:
scores = {}

%timeit scores["BernoulliNB"] = apply_classifier(BernoulliNB(alpha=.01))

%timeit scores["MultinomialNB"] = apply_classifier(MultinomialNB(alpha=.01))

%timeit scores["SGD-classification"] = apply_classifier(SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"))


1 loops, best of 3: 641 ms per loop
1 loops, best of 3: 376 ms per loop
1 loops, best of 3: 21.4 s per loop

In [7]:
scores


Out[7]:
{'BernoulliNB': 0.56475507274497405,
 'MultinomialNB': 0.69070660701674913,
 'SGD-classification': 0.68722858277421361}

In [ ]: