QDA and LDA

QDA

QDA(quadratic discriminant analysis)는 Y 클래스에 대한 독립 변수 X의 조건부 확률 분포가 다변수 가우시안 정규 분포(multivariate Gaussian normal distribution)이라는 가정을 한다.

$$ p(x \mid y = k) = \dfrac{1}{(2\pi)^{D/2} |\Sigma_k|^{1/2}} \exp \left( -\dfrac{1}{2} (x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k) \right) $$

이 분포들을 알고 있으면 독립 변수 X에 대한 Y 클래스의 조건부 확률 분포는 다음과 같이 베이즈 규칙으로부터 구할 수 있다.

$$ P(y=k \mid x) = \dfrac{p(x \mid y = k)P(y=k)}{p(x)} = \dfrac{p(x \mid y = k)P(y=k)}{\sum_l p(x \mid y = l)P(y=l) } $$

예를 들어 Y 가 1, 2, 3 이라는 3개의 클래스를 가지고 각 클래스에서의 X 의 확률 변수가 다음과 같은 기대값 및 공분산 행렬을 가진다고 가정하자.

$$ \mu_1 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \;\; \mu_2 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \;\; \mu_3 = \begin{bmatrix}-1 \\ 1 \end{bmatrix} $$$$ \Sigma_1 = \begin{bmatrix} 0.7 & 0 \\ 0 & 0.7 \end{bmatrix}, \;\; \Sigma_2 = \begin{bmatrix} 0.8 & 0.2 \\ 0.2 & 0.8 \end{bmatrix}, \;\; \Sigma_3 = \begin{bmatrix} 0.8 & 0.2 \\ 0.2 & 0.8 \end{bmatrix} $$

Y의 사전 확률은 다음과 같이 동일하다

$$ P(Y=1) = P(Y=2) = P(Y=3) = \dfrac{1}{3} $$

이번에는 각 학생간 관련이 있다?



In [1]:

    
N = 100
np.random.seed(0)
X1 = sp.stats.multivariate_normal([ 0, 0], [[0.7, 0],[0, 0.7]]).rvs(100)
X2 = sp.stats.multivariate_normal([ 1, 1], [[0.8, 0.2],[0.2, 0.8]]).rvs(100)
X3 = sp.stats.multivariate_normal([-1, 1], [[0.8, 0.2],[0.2, 0.8]]).rvs(100)
y1 = np.zeros(N)
y2 = np.ones(N)
y3 = 2*np.ones(N)
X = np.vstack([X1, X2, X3])
y = np.hstack([y1, y2, y3])



In [4]:

    
len(X1), X1.shape









    Out[4]:





(100, (100, 2))



In [5]:

    
plt.scatter(X1[:, 0], X1[:, 1], alpha=0.8, s=50, color='r', label='class1')
plt.scatter(X2[:, 0], X2[:, 1], alpha=0.8, s=50, color='g', label='class2')
plt.scatter(X3[:, 0], X3[:, 1], alpha=0.8, s=50, color='b', label='class3')
sns.kdeplot(X1[:, 0], X1[:, 1], alpha=0.3, cmap=mpl.cm.hot)
sns.kdeplot(X2[:, 0], X2[:, 1], alpha=0.3, cmap=mpl.cm.summer)
sns.kdeplot(X3[:, 0], X3[:, 1], alpha=0.3, cmap=mpl.cm.cool)
plt.xlim(-5, 5)
plt.ylim(-4, 5)
plt.legend()
plt.show()

가우시안 베이즈 모델로 해보기



In [7]:

    
from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(X,y)



In [8]:

    
from sklearn.metrics import confusion_matrix, classification_report



In [9]:

    
confusion_matrix(y, model.predict(X))









    Out[9]:





array([[63, 15, 22],
       [20, 76,  4],
       [18, 12, 70]])



In [10]:

    
print(classification_report(y, model.predict(X)))









    



             precision    recall  f1-score   support

        0.0       0.62      0.63      0.63       100
        1.0       0.74      0.76      0.75       100
        2.0       0.73      0.70      0.71       100

avg / total       0.70      0.70      0.70       300

이제 QDA



In [11]:

    
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

qda = QuadraticDiscriminantAnalysis(store_covariances=True).fit(X, y)
#True 한 것의 의미는? 그러면 밑에 값들이 안 생길 수 있다. 공분산 값들



In [ ]:



In [12]:

    
qda.means_









    Out[12]:





array([[ -8.01254084e-04,   1.19457204e-01],
       [  1.16303727e+00,   1.03930605e+00],
       [ -8.64060404e-01,   1.02295794e+00]])



In [13]:

    
qda.covariances_[0]









    Out[13]:





array([[ 0.73846319, -0.01762041],
       [-0.01762041,  0.72961278]])



In [14]:

    
qda.covariances_[1]









    Out[14]:





array([[ 0.66534246,  0.21132313],
       [ 0.21132313,  0.78806006]])



In [15]:

    
qda.covariances_[2]









    Out[15]:





array([[ 0.9351386 ,  0.22880955],
       [ 0.22880955,  0.79142383]])



In [16]:

    
confusion_matrix(y, qda.predict(X))









    Out[16]:





array([[63, 15, 22],
       [19, 77,  4],
       [18,  7, 75]])



In [17]:

    
print(classification_report(y, qda.predict(X)))









    



             precision    recall  f1-score   support

        0.0       0.63      0.63      0.63       100
        1.0       0.78      0.77      0.77       100
        2.0       0.74      0.75      0.75       100

avg / total       0.72      0.72      0.72       300



In [18]:

    
xmin, xmax = -5, 5
ymin, ymax = -4, 5
XX, YY = np.meshgrid(np.arange(xmin, xmax, (xmax-xmin)/1000), np.arange(ymin, ymax, (ymax-ymin)/1000))
ZZ = np.reshape(qda.predict(np.array([XX.ravel(), YY.ravel()]).T), XX.shape) # predict 안에서 1차원으로 플어야 한다. ravel = flatten
cmap = mpl.colors.ListedColormap(sns.color_palette("Set3"))                  #reshape을 한 이유는? 
plt.contourf(XX, YY, ZZ, cmap=cmap, alpha=0.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=cmap)
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
plt.show()

LDA

LDA(linear discriminant analysis)에서는 각 Y 클래스에 대한 독립 변수 X의 조건부 확률 분포가 공통된 공분산 행렬을 가지는 다변수 가우시안 정규 분포(multivariate Gaussian normal distribution)이라고 가정한다. 즉

$$ \Sigma_k = \Sigma \;\;\; \text{ for all } k $$

이다.

이 때는 조건부 확률 분포를 다음과 같이 정리할 수 있다.

$$ \begin{eqnarray} \log p(x \mid y = k) &=& \log \dfrac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} - \dfrac{1}{2} (x-\mu_k)^T \Sigma^{-1} (x-\mu_k) \\ &=& \log \pi + \dfrac{1}{2} (x-\mu_k)^T \Sigma^{-1} (x-\mu_k) \\ &=& \log \pi + \dfrac{1}{2} \left( x^T\Sigma^{-1}x - 2\mu_k^T \Sigma^{-1}x + \mu_k^T \Sigma^{-1}\mu_k \right) \\ \end{eqnarray} $$



In [ ]:

$$ \begin{eqnarray} p(x \mid y = k) &=& C(x)\exp(w_k^Tx + w_{k0}) \\ \end{eqnarray} $$

$$ \begin{eqnarray} P(y=k \mid x) &=& \dfrac{p(x \mid y = k)P(y=k)}{\sum_l p(x \mid y = l)P(y=l) } \\ &=& \dfrac{C(x)\exp(w_k^Tx + w_{k0}) P(y=k)}{\sum_l C(x)\exp(w_k^Tx + w_{k0})P(y=l) } \\ &=& \dfrac{P_k \exp(w_k^Tx + w_{k0}) }{\sum_l P_l \exp(w_k^Tx + w_{k0})} \\ \end{eqnarray} $$

$$ \log P(y=k \mid x) = \log P_k + w_k^Tx + w_{k0} - \sum_l \left( \log P_l + w_l^Tx + w_{l0} \right) = w^T x + w_0 $$

즉, 조건부 확률 변수가 x에 대한 선형 방정식이 된다.



In [20]:

    
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=3, solver="svd", store_covariance=True).fit(X, y)



In [21]:

    
lda.means_









    Out[21]:





array([[ -8.01254084e-04,   1.19457204e-01],
       [  1.16303727e+00,   1.03930605e+00],
       [ -8.64060404e-01,   1.02295794e+00]])



In [22]:

    
lda.covariance_









    Out[22]:





array([[ 0.7718516 ,  0.13942905],
       [ 0.13942905,  0.7620019 ]])



In [23]:

    
confusion_matrix(y, lda.predict(X))









    Out[23]:





array([[60, 15, 25],
       [20, 76,  4],
       [17,  8, 75]])



In [24]:

    
print(classification_report(y, qda.predict(X)))









    



             precision    recall  f1-score   support

        0.0       0.63      0.63      0.63       100
        1.0       0.78      0.77      0.77       100
        2.0       0.74      0.75      0.75       100

avg / total       0.72      0.72      0.72       300



In [25]:

    
xmin, xmax = -5, 5
ymin, ymax = -4, 5
XX, YY = np.meshgrid(np.arange(xmin, xmax, (xmax-xmin)/1000), np.arange(ymin, ymax, (ymax-ymin)/1000))
ZZ = np.reshape(lda.predict(np.array([XX.ravel(), YY.ravel()]).T), XX.shape)
cmap = mpl.colors.ListedColormap(sns.color_palette("Set3"))
plt.contourf(XX, YY, ZZ, cmap=cmap, alpha=0.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=cmap)
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
plt.show()

Pipeline?
- vect = CountVectorizer()
- vect.fit(X0) 이건 아직 문장 상태이다.
- X = vect.transform(X0) 이렇게 하면 숫자로 된다. X0는 문장들의 리스트
- X0는 문장들의 리스트
- mnb = Multinum NB().fit(X, y)
- Xnew0 를 넣었을 때 Xnew = vect.fransform(xnew0)
- mnb.predict(Xnew)
파이프라인이 없을 경우에는 이러한 순서로 전개한다. 그런데 있다면?
- model = Pipeline([
- ('vect', CountVectorizer()),
- ('clf', MultinomialNB()),
- ])
- model.fit(X0)
- vect = CountVectorizer()
- vect.fit(X0)
- X = vect.transform(X0)
- mnb = Multinum NB().fit(X, y)
- model.predict(Xnew0)
- Xnew0 를 넣었을 때 Xnew = vect.fransform(xnew0)
- mnb.predict(Xnew)