Template for guiding principle component analysis (PCA) using Python's Scikit Learn (sklearn) library. Sklearn will be imported as skl. The common convention for import I typically see is from sklearn import [library] or from sklearn.library import [library function]. In my experience this can get pretty confusing especially if custom tools are imported in a similiar manner.

Typical imports when not using the convention above

from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn import preprocessing
from sklearn import cross_validation

PCA

This iPython notebook demonstrates use of this workflow

import numpy as np
import sklearn

1. Make a feature array with any fields but label or target.

  • There are a variety approaches to this and will be covered in another template
  • TODO: create and link template
  • Output: my_features (1D array)

2. Scale or standardardize features with the sklearn pre-processing module

  • Example:
scaled_pca_data = sklearn.preprocessing.MinMaxscaler().fit_transform(my_features)

3. Perform dimensionality reduction (dim)often, by assigning #components or % variability.

  • Example:
    perc_var = 0.95
    pca = sklearn.decomposition.PCA(n_components = perc_var)
    
  • It's a good practice to plot %var vs components or track other metrics during this steop
  • TODO: helper code for this will be added in a another template and placed here

4. Implement PCA with a pipeline

  • Pipeline objects can be implemented to create estimator objects with multiple processing steps,for later implementation into other sklearn modules
  • TODO: Add example where standardize or scaling is implemented
  • TODO: Add example links (from references in enron project)
  • Example:
my_pipe = sklearn.pipeline.Pipeline(steps= [('pca, pca)]), 
                           ('my_estimator', my_estimator))

Gridsearch Tuning

1. Set up a scoring function and cross validator (cv)

  • sklearn modules can be used to make both

2. Set up estimator pipeline

  • Example: This example uses PCA but any or no feature selection could be used (feature selection not necessary)
estimator = [('reduce_dim',sklearn.decomposition.PCA(),
              ('dec-tree', base_estimator)]

my_estimator_pipe = sklearn.pipeline.Pipeline(estimator)

3. Set up search paramaters dictionary

my_params = dict(reduce_dim = [perc_var], my_params = #a param list or tuple)
my_grid_search = sklearn.grid_search.GridSearchCV(my_estimator_pipe_object,
                              my_param_grid_dict, my_scoring_function,
                              my_cross_validator)

5. Pass data into the grid search via the fit method

my_grid_search.fit(features, labels)

6. Select for the best estimator

my_best_estimator = my_grid_search.best_estimator

In [ ]: