Classification 1: Generative methods

Stats 208: Lecture 6

Prof. Sharpnack

Lecture slides at course github page
Some content of these slides are from STA 251 notes and STA 141B lectures.
Some content is from Elements of Statistical Learning

\left{ \begin{array}{ll} \end{array}\right.

Bayes rule in classification

Recall from homework that Bayes rule is $$ g(x) = \left\{ \begin{array}{ll} 1, &\mathbb P \{Y = 1 | X = x \} > \mathbb P \{Y = 0 | X = x \} \\ 0, &{\rm otherwise}\end{array}\right. $$ Another way to write this event is (for $f_X(x) > 0$) $$ f_{Y,X}(1, x) = \mathbb P \{Y = 1 | X = x \} f_X(x) > \mathbb P \{Y = 0 | X = x \} f_X(x) = f_{Y,X} (0, x) $$ Let $\pi = \mathbb P \{ Y = 1\}$ then this is also $$ \pi f_{X|Y}(x | 1) > (1 - \pi) f_{X|Y} (x|0) $$ which is $$ \frac{f_{X|Y}(x | 1)}{f_{X|Y} (x|0)} > \tau = \frac{1-\pi}{\pi} $$

Bayes rule in classification

$$ \frac{f_{X|Y}(x | 1)}{f_{X|Y} (x|0)} > \tau = \frac{1-\pi}{\pi} $$

the Bayes rule is performing a likelihood ratio test

Generative methods

A generative method does the following

treats $Y=1$ and $Y=0$ as different datasets and tries to estimate the densities $\hat f_{X | Y}$.
then plug these in to the formula for the Bayes rule

Naive Bayes methods assume that each component of $X$ is independent of one another, but does non-parametric density estimation for the densities $\hat f_{X_j|Y}$
Parametric methods fit a parametric density to $X|Y$

Density estimation

Parametric maximum likelihood estimation
Nonparametric: Kernel density estimation (KDE), nearest neighbor methods,

Reasonable heuristic for estimating a density $\hat f_X$, based on data $x_1,\ldots,x_n$ is

Let $N(x,\epsilon)$ be the number of data points within $\epsilon$ of $x$
$\hat f(x) = N(x,\epsilon) / n$Vol$(B(\epsilon))$ divide by the volume of the ball of radius $\epsilon$

$$\mathbb E \left( \frac{N(x,\epsilon)}{n} \right)= \mathbb P\{X \in B(x,\epsilon) \} \approx f_x(x) \textrm{Vol}(B(\epsilon))$$

Kernel density estimation

Let the Boxcar kernel function be $$ k(\|x_0-x_1\|) = \frac{1\{ \| x_0 - x_1 \| \le 1 \}}{{\rm Vol}(B(1))} $$ then the number of pts within $\epsilon$ is $$ N(x,\epsilon) = {\rm Vol}(B(1)) \sum_i k\left( \frac{\| x - x_i \|}{\epsilon} \right) $$ and the density estimate is $$ \hat f(x) = \frac 1n \sum_i \frac{{\rm Vol}(B(1))}{{\rm Vol}(B(\epsilon))} \cdot k\left( \frac{\| x - x_i \|}{\epsilon} \right) $$ this is equal to $$ \hat f(x) = \frac 1n \sum_i \frac{1}{\epsilon^p} \cdot k\left( \frac{\| x - x_i \|}{\epsilon} \right) $$

Kernel density estimation

General kernel density estimate is based on a kernel such that $$ \int k(\|x-x_0\|) dx = 1 $$ a kernel that is a valid probability density is a common choice.

Then KDE is $$ \hat f(x) = \frac 1n \sum_i \frac{1}{\epsilon^p} \cdot k\left( \frac{\| x - x_i \|}{\epsilon} \right) $$ where $p$ is the dimensionality of the X space.

from wikipedia

Naive Bayes

For each $y = 0,1$ let $x_1,\ldots,x_{n_y}$ be the predictor data with $Y = y$

For each dimension j
- Let $\hat f_{y,j}$ be the KDE of $x_{1,j},\ldots,x_{n_y,j}$
Let $\hat f_y = \prod_j \hat f_{y,j}$

Let $\pi$ be the proportion of $Y = 1$ then let $\tau = (1 - \pi) / \pi$.

Predict $\hat y = 1$ for a new $x$ if $$ \frac{f_{X|Y}(x | 1)}{f_{X|Y} (x|0)} > \tau $$ and $\hat y=0$ otherwise.

from mathworks.org

Gaussian Generative Models

Fit parametric model for each class using likelihood based approach.
Assume a Gaussian distribution $$ X | Y = k \sim \mathcal N(\mu_k, \Sigma_k) $$ for mean and variance parameters $\mu_k, \Sigma_k$.



In [4]:

    
import pandas as pd
import numpy as np
import matplotlib as mpl
import plotnine as p9
import matplotlib.pyplot as plt
import itertools

import warnings
warnings.simplefilter("ignore")

from sklearn import neighbors, preprocessing, impute, metrics, model_selection, linear_model, svm, feature_selection



In [3]:

    
from matplotlib.pyplot import rcParams
rcParams['figure.figsize'] = 6,6



In [7]:

    
def train_bank_to_xy(bank):
    """standardize and impute training"""
    bank_sel = bank[['age','balance','duration','y']].values
    X,y = bank_sel[:,:-1], bank_sel[:,-1]
    scaler = preprocessing.StandardScaler().fit(X)
    imputer = impute.SimpleImputer(fill_value=0).fit(X)
    trans_prep = lambda Z: imputer.transform(scaler.transform(Z)) 
    X = trans_prep(X)
    y = 2*(y == 'yes')-1
    return (X, y), trans_prep

def test_bank_to_xy(bank, trans_prep):
    """standardize and impute test"""
    bank_sel = bank[['age','balance','duration','y']].values
    X,y = bank_sel[:,:-1], bank_sel[:,-1]
    X = trans_prep(X)
    y = 2*(y == 'yes')-1
    return (X, y)



In [9]:

    
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()



In [6]:

    
bank = pd.read_csv('../../data/bank.csv',sep=';',na_values=['unknown',999,'nonexistent'])
bank.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
age          4521 non-null int64
job          4483 non-null object
marital      4521 non-null object
education    4334 non-null object
default      4521 non-null object
balance      4519 non-null float64
housing      4521 non-null object
loan         4521 non-null object
contact      3197 non-null object
day          4521 non-null int64
month        4521 non-null object
duration     4521 non-null int64
campaign     4521 non-null int64
pdays        4521 non-null int64
previous     4521 non-null int64
poutcome     816 non-null object
y            4521 non-null object
dtypes: float64(1), int64(6), object(10)
memory usage: 600.5+ KB



In [10]:

    
bank_tr, bank_te = model_selection.train_test_split(bank,test_size=.33)



In [11]:

    
p9.ggplot(bank_tr, p9.aes(x = 'age',fill = 'y')) + p9.geom_density(alpha=.2)









    












    Out[11]:





<ggplot: (8729495701396)>



In [12]:

    
(X_tr, y_tr), trans_prep  = train_bank_to_xy(bank_tr)
X_te, y_te = test_bank_to_xy(bank_te, trans_prep)



In [13]:

    
def plot_conf_score(y_te,score,tau):
    y_pred = 2*(score > tau) - 1
    classes = [1,-1]
    conf = metrics.confusion_matrix(y_te, y_pred,labels=classes)
    plot_confusion_matrix(conf, classes)

Evaluating a classifier

Most classifiers are "soft" because they can output a score, higher means more likely to be $Y=1$

Logistic regression: output probability
SVM: distance from margin
kNN: percent of neighbors with $Y=1$
LDA/QDA/Naive bayes: estimated likelihood ratio

If we order from largest to smallest then this gives us the points to predict as 1 first.
Choose a cut-off to say all above this value are 1 and below are 0 can see different errors

Confusion matrix and classification metrics

	Pred 1	Pred 0
True 1	True Pos	False Neg
True 0	False Pos	True Neg

$$ \textrm{FPR} = \frac{FP}{FP+TN} $$$$ \textrm{TPR, Recall} = \frac{TP}{TP + FN} $$$$ \textrm{Precision} = \frac{TP}{TP + FP} $$



In [14]:

    
score_dur = X_te[:,2]



In [15]:

    
p9.ggplot(bank_tr[['duration','y']].dropna(axis=0)) + p9.aes(x = 'duration',fill = 'y')\
+ p9.geom_density(alpha=.5)









    












    Out[15]:





<ggplot: (-9223363307361480823)>



In [16]:

    
plot_conf_score(y_te,score_dur,1.)









    



Confusion matrix, without normalization
[[  80   94]
 [ 118 1200]]



In [17]:

    
plot_conf_score(y_te,score_dur,2.)









    



Confusion matrix, without normalization
[[  45  129]
 [  34 1284]]



In [22]:

    
## Fit and find NNs
nn = neighbors.NearestNeighbors(n_neighbors=10,metric="l2")
nn.fit(X_tr)
dists, NNs = nn.kneighbors(X_te)
NNs[1], y_tr[NNs[1]].mean(), y_te[1]









    Out[22]:





(array([ 136,  438, 2848, 2928, 2064, 1181, 1153, 1941,  998,  288]), -0.8, -1)



In [23]:

    
score_nn = np.array([(y_tr[knns] == 1).mean() for knns in NNs])
plot_conf_score(y_te,score_nn,.2)









    



Confusion matrix, without normalization
[[  94   80]
 [ 160 1158]]



In [25]:

    
nn = neighbors.KNeighborsClassifier(n_neighbors=10)
nn.fit(X_tr, y_tr)
score_nn = nn.predict_proba(X_te)[:,1]
plot_conf_score(y_te,score_nn,.2)









    



Confusion matrix, without normalization
[[  94   80]
 [ 160 1158]]



In [26]:

    
def print_top_k(score_dur,y_te,k_top):
    ordering = np.argsort(score_dur)[::-1]
    print("k: score, y")
    for k, (yv,s) in enumerate(zip(y_te[ordering],score_dur[ordering])):
        print("{}: {}, {}".format(k,s,yv))
        if k >= k_top - 1:
            break



In [27]:

    
print_top_k(score_dur,y_te,10)









    



k: score, y
0: 7.254928748788767, 1
1: 7.024561224272875, 1
2: 6.885546338789148, -1
3: 6.794193699756984, 1
4: 6.4208394358864025, 1
5: 5.876695455564383, -1
6: 5.801230232016074, 1
7: 5.674130908145238, 1
8: 5.352410744597183, 1
9: 5.153818051049001, 1

Confusion matrix and metrics

	Pred 1	Pred -1
True 1	True Pos	False Neg
True -1	False Pos	True Neg

$$ \textrm{FPR} = \frac{FP}{FP+TN} $$$$ \textrm{TPR, Recall} = \frac{TP}{TP + FN} $$$$ \textrm{Precision} = \frac{TP}{TP + FP} $$



In [28]:

    
plt.style.use('ggplot')



In [29]:

    
fpr_dur, tpr_dur, threshs = metrics.roc_curve(y_te,score_dur)
plt.figure(figsize=(6,6))
plt.plot(fpr_dur,tpr_dur)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title("ROC for 'duration'")









    Out[29]:





Text(0.5, 1.0, "ROC for 'duration'")



In [30]:

    
def plot_temp():
    plt.figure(figsize=(6,6))
    plt.plot(fpr_dur,tpr_dur,label='duration')
    plt.plot(fpr_nn,tpr_nn,label='knn')
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.legend()
    plt.title("ROC")



In [31]:

    
fpr_nn, tpr_nn, threshs = metrics.roc_curve(y_te,score_nn)
plot_temp()



In [32]:

    
def plot_temp():
    plt.figure(figsize=(6,6))
    plt.plot(rec_dur,prec_dur,label='duration')
    plt.plot(rec_nn,prec_nn,label='knn')
    plt.xlabel('recall')
    plt.ylabel('precision')
    plt.legend()
    plt.title("PR curve")



In [33]:

    
prec_dur, rec_dur, threshs = metrics.precision_recall_curve(y_te,score_dur)
prec_nn, rec_nn, threshs = metrics.precision_recall_curve(y_te,score_nn)
plot_temp()

Comments

"Good" ROC should be in top left
"Good" PR should be large for all recall values
PR is better for large class imbalance
ROC treats each type of error equally