Udacity Machine Learning Feature Selection mini-project

Prep stuff:


In [1]:
import pickle
import numpy
from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
numpy.random.seed(42)
words_file = "../text_learning/your_word_data.pkl"
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )

In [2]:
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()
print 'training data:',features_train.shape
print 'testing data:',features_test.shape


training data: (15820, 37863)
testing data: (1758, 37863)

In [3]:
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

Overfitting a decision tree (many features with few data points)


In [4]:
clfOverfit = DecisionTreeClassifier()
clfOverfit.fit(features_train,labels_train)


Out[4]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Evaluating:


In [5]:
clfOverfit.score(features_test,labels_test)


Out[5]:
0.94766780432309439

Finding particularly relevant features


In [6]:
for i in range(clfOverfit.feature_importances_.shape[0]):
    if clfOverfit.feature_importances_[i] > 0.1:
        print 'feature',i,':',clfOverfit.feature_importances_[i]


feature 33201 : 0.134028294862
feature 33614 : 0.764705882353

Identifying problematic word in question:


In [7]:
vectorizer.get_feature_names()[33614]


Out[7]:
u'sshacklensf'

This looks rather like part of someone's email signature.

Loading modified data set where this has been pruned out:


In [8]:
words_file = "../text_learning/your_word_data2.pkl"
authors_file = "../text_learning/your_email_authors2.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

Decision Tree trained on modified data:


In [9]:
clfOverfit = DecisionTreeClassifier()
clfOverfit.fit(features_train,labels_train)


Out[9]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

New accuracy is still very high


In [10]:
clfOverfit.score(features_test,labels_test)


Out[10]:
0.9704209328782708

Still has a couple of possible problem words:


In [11]:
for i in range(clfOverfit.feature_importances_.shape[0]):
    if clfOverfit.feature_importances_[i] > 0.1:
        print 'feature',i,':',clfOverfit.feature_importances_[i]


feature 8674 : 0.162601626016
feature 14343 : 0.666666666667

In [12]:
print vectorizer.get_feature_names()[8674]
print vectorizer.get_feature_names()[14343]


62502pst
cgermannsf

First one looks like a timestamp, second seems to be signature. Removing signature.

Pruning and reloading again


In [14]:
words_file = "../text_learning/your_word_data3.pkl"
authors_file = "../text_learning/your_email_authors3.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

In [15]:
clfOverfit = DecisionTreeClassifier()
clfOverfit.fit(features_train,labels_train)


Out[15]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Notable drop in accuracy:


In [16]:
clfOverfit.score(features_test,labels_test)


Out[16]:
0.81626848691695109

Checking for overly-influential words


In [17]:
for i in range(clfOverfit.feature_importances_.shape[0]):
    if clfOverfit.feature_importances_[i] > 0.1:
        print 'feature',i,':',clfOverfit.feature_importances_[i]


feature 11975 : 0.105378579003
feature 18849 : 0.186927243449
feature 21323 : 0.363636363636

These look like (probably) content


In [18]:
print vectorizer.get_feature_names()[11975]
print vectorizer.get_feature_names()[18849]
print vectorizer.get_feature_names()[21323]


attach
fax
houectect