Given a training set of size $N$, consisting samples K classes, each with $n_k$ training samples, in LDA, we use Bayes rule to model
$$P(Y = k \mid x) = \frac{\pi_k f_k(x)}{\sum\limits_{i=1}^{K}{\pi_k f_k(x)}}$$where
$$\pi_i = \frac{n_k}{N}$$is the prior probability for class k,
$$f_k(x) = \frac{1}{(2\pi)^p \lvert \Sigma_k \rvert^{\frac{1}{2}}} e^{\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) } $$is the density of the Gaussian (aka Normal) distribution computed at $x$.
Given a sample $x$, we predict its class to be
$$ C = \underset{k}{\mathrm{\arg \max}} P(Y = k \mid x) $$The key assumption here is that all K classes share the same covariance matrix. This is a form of parameter-tying, which can be viewed as a way to limit the complexity of the model.
We learn the model parameters, $\Sigma, \{ \mu_i \}_{i=1}^K$ using MLE estimation.
$\Sigma$ is computed as
$$\Sigma = \frac{1}{N - K} \sum\limits_{k=1}^{K}{\sum\limits_{i : y_i = k}{(x_i - \mu_k)} } $$and $\mu_k$ is the mean vector of class $k$,
$$\mu_k = \frac{1}{n_k} \sum\limits_{i : y_i = k}{ x_i } $$If we take $log$ of $P(Y = k \mid x)$ and consider only terms involving class specific parameters $\mu_k$, we see that finding $ \underset{k}{\mathrm{\arg \max}} P(Y = k \mid x) $ reduces to finding
$$ C = \underset{k}{\arg\max} \delta_k(x) $$where $\delta_k(x)$ is the discriminant function,
$$\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k\Sigma^{-1}\mu_k + \log(\pi_k)$$$\delta_k(x)$ linear in $x$ as $\frac{1}{2}\mu_k\Sigma^{-1}\mu_k \text{and} \log(\pi_k)$ are just scalar constants.
In [2]:
def compute_means(X, y):
m, n = X.shape[:2]
K = len(np.unique(y))
means = np.empty((K, n))
for k in np.unique(y):
means[k] = np.mean(X[y == k], axis=0)
return means
def compute_cov(X, y):
N, d = X.shape[:2]
K = len(np.unique(y))
C = np.zeros((d, d))
for k in range(K):
C += np.cov(X[y == k], rowvar=0)
return C / (N - K)
In [5]:
import numpy as np
from sklearn import datasets
from sklearn.cross_validation import train_test_split
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=13)
N_train, d = X_train.shape[:2] # d-dim vectors
N_test, _ = X_test.shape[:2]
K = len(np.unique(y_train)) # Number of classes
# Compute model parameters
means = compute_means(X_train, y_train)
covariance = compute_cov(X_train, y_train)
pk = np.asarray([np.sum(y_train == k) / float(N_train) for k in range(len(np.unique(y)))])
invCov = np.linalg.pinv(covariance) # Inverse of covariance
# Compute the parts of the discriminant functions
b = np.log(pk) - 0.5 * np.diag(means.dot(invCov).dot(means.T))
b = b.reshape(K, 1)
W = means.dot(invCov) # Row k contains weight vector for class k
y_pred_train = np.argmax(W.dot(X_train.T) + b, axis=0)
y_pred_test = np.argmax(W.dot(X_test.T) + b, axis=0)
print("Training accuracy: {}".format(np.sum(y_pred_train == y_train) / float(N_train)))
print("Testing accuracy: {}".format(np.sum(y_pred_test == y_test) / float(N_test)))
In [7]:
import GDA
reload(GDA)
clf = GDA.GDA()
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print("Training accuracy: {}".format(np.sum(y_pred_train == y_train) / float(X_train.shape[0])))
print("Testing accuracy: {}".format(np.sum(y_pred_test == y_test) / float(X_test.shape[0])))
In [87]:
np.random.randint?
In [94]:
X_rand = np.random.randn(100, 4)
y_rand = np.random.randint(0, len(np.unique(y_train)), size=(100, 1))
X_rand[:10]
y_pred_rand = clf.predict(X_rand)
print("Random data accuracy: {}".format(np.sum(y_pred_rand == y_rand) / float(X_rand.shape[0])))