In the tutorial, we used a Gaussian Naive Bayes Classifier to separate Quasars And Stars. In this exercise, we will extend this classification scheme using Gaussian Mixture Models.

The Gaussian Naive Bayes method works by computing a Gaussian approximation to the distribution of each training class, and using these distributions to predict the most probable label for any new point.

Here we're going to extend this, and rather than fitting a *single* Gaussian
to each label distribution, we're going to fit a *mixture* of Gaussians, which
should better approximate the distribution of each class of points.

```
In [ ]:
```%pylab inline

```
In [ ]:
```import numpy as np
import pylab as pl
np.random.seed(0)
X = np.vstack([np.random.normal(0, 1, size=(100, 2)),
np.random.normal(3, 1, size=(100, 2))])
y = np.zeros(200)
y[100:] = 1
pl.scatter(X[:, 0], X[:, 1], c=y, linewidth=0)

`sklearn.naive_bayes.GaussianNB`

,
you'll see that internally it finds a best-fit Gaussian for each
distribution, and uses these as a smooth description of each distribution.
We can use the internals of `GaussianNB`

to visualize the distributions
it uses:

```
In [ ]:
```from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X, y)
print(gnb.theta_) # centers of the distributions
print(gnb.sigma_) # widths of the distributions

```
In [ ]:
```# create a grid on which to evaluate the distributions
grid = np.linspace(-3, 6, 100)
xgrid, ygrid = np.meshgrid(grid, grid)
Xgrid = np.vstack([xgrid.ravel(), ygrid.ravel()]).T
# now evaluate and plot the probability grid
prob_grid = np.exp(gnb._joint_log_likelihood(Xgrid))
for i, c in enumerate(['blue', 'red']):
pl.contour(xgrid, ygrid, prob_grid[:, i].reshape((100, 100)), 3, colors=c)
# plot the points as above
pl.scatter(X[:, 0], X[:, 1], c=y, linewidth=0)

When a new item is to be classified, its probability is evaluated for each distribution, and the distribution with the highest probability wins. We can see now why this is called "naive". What if the distribution is not well-fit by an uncorrelated Gaussian?

In the following, we'll develop a classifier which solves this issue
by fitting a *sum* of gaussians to each distribution. This should
lead to improved classifications. For our data, we'll use photometric
observations of stars and quasars from the Sloan Digital Sky Survey.

`fetch_data.py`

script has been run
to download the data locally.
If the data is in a different location, you can change the
`DATA_HOME`

variable below.

```
In [ ]:
```import os
DATA_HOME = os.path.abspath('../data/sdss_colors')

```
In [ ]:
```import numpy as np
train_data = np.load(os.path.join(DATA_HOME,
'sdssdr6_colors_class_train.npy'))
test_data = np.load(os.path.join(DATA_HOME,
'sdssdr6_colors_class.200000.npy'))

```
In [ ]:
```# set the number of training points: using all points leads to a very
# long running time. We'll start with 10000 training points. This
# can be increased if desired.
Ntrain = 10000
#Ntrain = len(train_data)

```
In [ ]:
```# Split training data into training and cross-validation sets
np.random.seed(0)
np.random.shuffle(train_data)
train_data = train_data[:Ntrain]
N_crossval = Ntrain / 5
train_data = train_data[:-N_crossval]
crossval_data = train_data[-N_crossval:]

```
In [ ]:
```# construct training data
X_train = np.zeros((train_data.size, 4), dtype=float)
X_train[:, 0] = train_data['u-g']
X_train[:, 1] = train_data['g-r']
X_train[:, 2] = train_data['r-i']
X_train[:, 3] = train_data['i-z']
y_train = (train_data['redshift'] > 0).astype(int)
Ntrain = len(y_train)

```
In [ ]:
```# construct cross-validation data
X_crossval = np.zeros((crossval_data.size, 4), dtype=float)
X_crossval[:, 0] = crossval_data['u-g']
X_crossval[:, 1] = crossval_data['g-r']
X_crossval[:, 2] = crossval_data['r-i']
X_crossval[:, 3] = crossval_data['i-z']
y_crossval = (crossval_data['redshift'] > 0).astype(int)
Ncrossval = len(y_crossval)

```
In [ ]:
```pl.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=4, linewidths=0)
pl.xlim(-2, 5)
pl.ylim(-1, 3)

Gaussian Naive Bayes is a very fast estimator, and predicted labels can be computed as follows:

```
In [ ]:
```from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_gnb = gnb.predict(X_crossval)

`sklearn.gmm.GMM()`

classifier instances, named `clf_0`

and `clf_1`

. Each should be
initialized with a single component, and diagonal covariance.
(hint: look at the doc string for `sklearn.gmm.GMM`

to see how to set
this up). The results should be compared to Gaussian Naive Bayes
to check if they're correct.

```
In [ ]:
```from sklearn.mixture import gmm

```
In [ ]:
```# Objects to create:
# - clf_0 : trained on the portion of the training data with y == 0
# - clf_1 : trained on the portion of the training data with y == 1

```
In [ ]:
```%load soln/01-01.py

```
In [ ]:
```# 01-01.py
clf_0 = gmm.GMM(1, 'diag')
i0 = (y_train == 0)
clf_0.fit(X_train[i0])
clf_1 = gmm.GMM(1, 'diag')
i1 = (y_train == 1)
clf_1.fit(X_train[i1])

Next we must construct the prior. The prior is the fraction of training points of each type.

```
In [ ]:
```# variables to compute:
# - prior0 : fraction of training points with y == 0
# - prior1 : fraction of training points with y == 1

```
In [ ]:
```%load soln/01-02.py

Now we use the prior and the classifiation to compute the log-likelihoods
of the cross-validation points. The log likelihood for a test point `x`

is given by

```
logL(x) = clf.score(x) + log(prior)
```

You can use the function `np.log()`

to compute the logarithm of the prior.

```
In [ ]:
```# variables to compute:
# logL : array, shape = (2, Ncrossval)
# logL[0] is the log-likelihood for y == 0
# logL[1] is the log-likelihood for y == 1

```
In [ ]:
```%load soln/01-03.py

`logL`

is computed, the predicted value for each sample
is the index with the largest log-likelihood.

```
In [ ]:
```y_pred = np.argmax(logL, 0)

`y_gnb`

, the labels we
printed above. We'll use the built-in classification
report function in sklearn.metrics. This computes the precision,
recall, and f1-score for each class.

```
In [ ]:
```from sklearn import metrics
print "-----------------------------------------------------"
print "One-component Gaussian Mixture:"
print metrics.classification_report(y_crossval, y_pred,
target_names=['stars', 'QSOs'])
print "-----------------------------------------------------"
print "Gaussian Naive Bayes:"
print metrics.classification_report(y_crossval, y_gnb,
target_names=['stars', 'QSOs'])

Now we'll take some time to experiment with the *hyperparameters* of our
GMM Bayesian classifier. These include the number of components for each
model and the covariance type for each model (i.e. parameters that are
decided prior to the fitting of the model on training data.

Note that for a large number of components, the fit can take a long time, and will be dependent on the starting position. Use the documentation string of GMM to determine the options for covariance.

Note that there are tools within scikit-learn to perform hyperparameter
estimation, in the module `sklearn.grid_search`

.
Here we will be doing it by hand.

`fit()`

, `predict()`

, `predict_proba()`

, etc.
This would be an interesting project (and could even be a useful contribution
to scikit-learn!). For now, we'll take a shortcut and just define a
stand-alone function.

```
In [ ]:
```# finish this function. For n_components=1 and
# covariance_type='diag', it should give results
# identical to what we saw above.
# the function should return the predicted
# labels y_pred.
def GMMBayes(X_test, n_components, covariance_type):
pass

```
In [ ]:
```%load soln/01-04.py

```
In [ ]:
```n_components = 3
covariance_type = 'full'
y_pred = GMMBayes(X_train, n_components, 'full')
f1 = metrics.f1_score(y_train, y_pred)
print f1

Try changing the number of components and the covariance type.
To see a description of the various `covariance_type`

options,
you can type

```
gmm.GMM?
```

in a code cell to see the documentation. You might also wish to loop over several values of the hyperparameters and plot the learning curves for the data.

```
In [ ]:
```X_test = np.zeros((test_data.size, 4), dtype=float)
X_test[:, 0] = test_data['u-g']
X_test[:, 1] = test_data['g-r']
X_test[:, 2] = test_data['r-i']
X_test[:, 3] = test_data['i-z']
y_pred_literature = (test_data['label'] == 0).astype(int)
Ntest = len(y_pred_literature)
print Ntest

```
In [ ]:
```# variables to compute:
# y_pred_gmm : predicted labels for X_test from GMM Bayes model
# y_pred_gnb : predicted labels for X_test with Naive Bayes model.

```
In [ ]:
```%load soln/01-05.py

```
In [ ]:
```print "------------------------------------------------------------------"
print "Comparison of current results with published results (Naive Bayes)"
print metrics.classification_report(y_pred_literature, y_pred_gnb,
target_names=['stars', 'QSOs'])
print "------------------------------------------------------------------"
print "Comparison of current results with published results (GMM Bayes)"
print metrics.classification_report(y_pred_literature, y_pred_gmm,
target_names=['stars', 'QSOs'])