Chaper 04
Response varibale $Y$ may be qualitative but not quantitative
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
default = pd.read_csv('Default.csv', index_col=0)
group = default.groupby('default')
yes_default = default[default['default'].isin(['Yes'])]
no_default = default[default['default'].isin(['No'])]
yes_balance = yes_default['balance'].values
yes_income = yes_default['income'].values
no_balance = no_default['balance'].values
no_income = no_default['income'].values
plt.scatter(yes_balance, yes_income, c='b', marker='*')
plt.scatter(no_balance, no_income, edgecolor='y',facecolor='none', marker='o' )
plt.show()
To make probability to fall between 0 and 1, we model the $p(X)$ using sigmod function: $f(x)=\frac{e^x}{1+e^x}$. So we extend the linear regression to form this: $$p(X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}$$ After a bit of manipulation, we conclude that: $$\frac{p(X)}{1-p(X)}=e^{\beta_0+\beta_1X} \rightarrow log(\frac{p(X)}{1-p(X)}) = \beta_0+\beta_1X$$ We call log-odds or logit is linear in $X$.
The more general method of maximum likehood is preferred. $$l(\beta_0,\beta_1)=\prod_{i:y_i=1}p(x_i)\prod_{i:y_{i'}=0}\big(1-p(x_{x'})\big)$$
In [4]:
X = default['balance'].values.astype('float')
X = X.reshape((len(X),1))
y = default['default'].values
y = [1 if item=='Yes' else 0 for item in y]
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(X,y)
print(lg.coef_, lg.intercept_)
The comparsion of logistic regression and LDA:
Let $\pi_k$ represent the overall or pripr probability that a randomly chosen comes fro mthe $k$th class; Let $f_k(X)=Pr(X=x|Y=k)$ denote the density fucntion of $X$ for an observation that comes form the $k$th class. the Bayes's theorem state that: $$ Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_lf_l(x)} $$ We use the abbreviation $P_k(X)=Pr(Y=k|X)$ and refer to $p_k(x)$ as the posterior probability that an observation $X=x$ belongs to the $k$th class. We classifiy an observation to the class for which $p_k(X)$ is largest, has the lowest possible error rate out of the all classifiers.
As we assume that $f_k(x)$ is normal or Gaussian. In the one-dimensional setting, the normal density takes the form: $$ f_k(x)=\frac{1}{\sqrt{2\pi}\sigma_k}exp(-\frac{1}{2\sigma_k^2(x-\mu_k)^2} $$ where $\mu_k$ and $\sigma_k^2$ are the mean and variance parametes for the $k$th class. Let us further assume that $\sigma_1^2=\ldots=\sigma_k^2$ which means all $K$ classes share the same variance. Take the log of and rearrangeing the terms, it is not hard to show that this is the equivalent to assigning the observation to the class for which: $$ \delta_k(x)=x\times \frac{\mu_k}{\sigma^2} -\frac{\mu_k^2}{2\sigma^2}+log(\pi_k) $$ is largest. where $$\hat{\mu_k}=\frac{1}{n_k}\sum_{i:y_i=k}x_i$$ $$\hat{\sigma^2}=\frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i=k}(x_i-\hat{\mu_k})^2$$
To indicate that $p-$dimensional random variable $X$ has a multivariate Gaussian distribution, we write $X \sim N(\mu,\Sigma)$. Here $E(X)=\mu$ is mean of $X$,and $Cov(X)=\Sigma$ is $p\times p$ convariance matrix of $X$. Formally, the multivariate Gaussian density is defined as $$ f(x)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}exp\big(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\big) $$ The Bayes classifier assigns an observation $X=x$ to the class for which $$ \delta_k(x)=x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + log\pi_k $$ where $\mu_k$ is a class-specific mean vector and $\Sigma$ is a covariance matrix that is common to all K classes.
In [18]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
incomes = default['income'].values.astype('float')
students = default['student'].values
students = np.array([1.0 if student=='Yes' else 0.0 for student in students])
defaults = default['default'].values
incomes = incomes.reshape((len(incomes),1))
students = students.reshape((len(students),1))
y = [1 if default=='Yes' else 0 for default in defaults]
X = np.hstack((incomes, students))
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)
pred = lda.predict(X)
print(1-accuracy_score(pred, y))
Unlike the LDA, the QDA assumes that each class has its own convariance matrix. That is, it assume that an obserivation from the $k$th class is of the form $X \sim N(\mu_k, \Sigma_k)$, where $\Sigma_k$ is a convariance matrix for $k$th class. Under this assumption, the Bayes classifier assigns an observation $X=x$ to the class for which $$ \delta_k=-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{1}{2}log|\Sigma_k|+log\pi_k $$ is largest. The quantity $x$ appears as a quadratic function. That is where QDA gets its name.
In [21]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(X,y)
pred = qda.predict(X)
print(1-accuracy_score(pred, y))
In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
# read data
smarket = pd.read_csv('Weekly.csv',index_col=0)
X = smarket[["Lag1","Lag2","Lag3","Lag4","Lag5","Volume"]].values
y = smarket['Direction'].values
lg = LogisticRegression()
lg.fit(X,y)
down_indice = y=='Down'
up_indice = y=='Up'
X_down = X[down_indice,:]
X_up = X[up_indice,:]
y_down = y[down_indice]
y_up = y[up_indice]
# calculate the confusion matrix
True_Positive = (lg.predict(X_up)==y_up).sum()
False_Negitive = len(y_up) - True_Positive
True_Negitive = (lg.predict(X_down)==y_down).sum()
False_Positive = len(y_down)-True_Negitive
print('TP %d;'%True_Positive, 'FN %d;'%False_Negitive, 'TN %d;'%True_Negitive, 'FP %d'%False_Positive)
print('Overall fraction of correct predictions: %f' % ((True_Positive+True_Negitive)/len(y)))
In [60]:
from sklearn.metrics import confusion_matrix
weekly = pd.read_csv('Weekly.csv',index_col=0)
years = np.arange(1990,2009,1)
weekly_1990_2008 = weekly.ix[weekly['Year'].isin(years)]
weekly_2009_2010 = weekly.ix[weekly['Year'].isin([2009, 2010])]
# prepare the data
train_X = weekly_1990_2008['Lag2'].values
train_X = train_X.reshape((len(train_X),1))
train_y = weekly_1990_2008['Direction'].values
train_y = np.array([1.0 if y=='Up' else 0.0 for y in train_y])
test_X = weekly_2009_2010['Lag2'].values
test_X = test_X.reshape((len(test_X),1))
test_y = weekly_2009_2010['Direction'].values
test_y = np.array([1.0 if y=='Up' else 0.0 for y in test_y])
lg = LogisticRegression()
lg.fit(train_X, train_y)
pred_y = lg.predict(test_X)
confusion_matrix(pred_y, test_y)
Out[60]:
In [65]:
# using LDA
lda = LinearDiscriminantAnalysis()
lda.fit(train_X, train_y)
pred_y = lda.predict(test_X)
print(confusion_matrix(pred_y, test_y))
print(accuracy_score(pred_y,test_y))
In [66]:
# using QDA
qda = QuadraticDiscriminantAnalysis()
qda.fit(train_X, train_y)
pred_y = qda.predict(test_X)
print(confusion_matrix(pred_y, test_y))
print(accuracy_score(pred_y,test_y))
In [67]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(1)
knn.fit(train_X, train_y)
pred = knn.predict(test_X)
print(confusion_matrix(pred, test_y))
print(accuracy_score(pred,test_y))