Getting Started

In this this notebook, we introduce the core functionalities of mclearn. For more information on the package, refer to the documentation and the API Reference.



In [2]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from mclearn.active import run_active_learning_with_heuristic
from mclearn.classifier import (train_classifier,
                                grid_search_logistic,
                                grid_search_svm_poly,
                                grid_search_svm_rbf,
                                learning_curve)
from mclearn.heuristics import (random_h, margin_h, entropy_h,
                                qbb_margin_h, qbb_kl_h)
from mclearn.photometry import (fetch_sloan_data,
                                reddening_correction_sfd98,
                                reddening_correction_sf11,
                                reddening_correction_w14,
                                correct_magnitudes)
from mclearn.viz import (plot_filters_and_spectrum,
                         plot_class_distribution,
                         plot_hex_map,
                         plot_scatter_with_classes,
                         plot_average_learning_curve)
%matplotlib inline
plt.style.use('ggplot')

1. About the Dataset

There are two ways we can measure objects on the sky. The first option is to obtain a full spectrum. This is expensive but it contains a lot of valuable information. For example astronmers can look at a spectrum and determine what kind of star or galaxy the object is. A cheaper method is to obtain photometric measurments. Here we set up a filter and only allow light from certain wavelengths to go through. We then add up the amount of light (i.e. integrate) and obtain a photometric measurment. For example, below we plot the spectrum of Vega (one of the brighest stars in the sky) and the 5 ugriz photometric filters. Using these five filters, we will end up with five photometric values of Vega, one in each light band. Since now we only have 5 numbers (instead of a spectrum), it's a lot more difficult to manually classify the object. Let's see if machine learning can help us. For more information about the differences between photometry and spectroscopy, here's a good description on StackExchange.



In [10]:

    
vega_spectrum_url = 'http://www.astro.washington.edu/users/ivezic/DMbook/data/1732526_nic_002.ascii'
ugriz_filter_url = 'http://www.sdss.org/dr7/instruments/imager/filters/%s.dat'
filter_dir = '../data/filters'
spectra_dir = '../data/spectra'

fig = plt.figure(figsize=(10,5))
ax = plot_filters_and_spectrum(ugriz_filter_url, vega_spectrum_url, filter_dir=filter_dir, spectra_dir=spectra_dir)

The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. In this notebook, we will explore a small subset of the survey, namely the photometric measurements and spectroscopic labels of around 2,000 objects in a small section of the sky. Let's submit an SQL query to the Sloan Sky Server to get the dataset.



In [3]:

    
output = '../data/sdss_small.csv'
sql_query = '''
    SELECT
        p.ra, p.dec,
        CASE s.class WHEN 'GALAXY' THEN 'Galaxy'
                     WHEN 'STAR' THEN 'Star'
                     WHEN 'QSO' THEN 'Quasar'
                     END AS class,
        p.psfMag_u, p.psfMag_g, p.psfMag_r, p.psfMag_i, p.psfMag_z,
        p.petroMag_u, p.petroMag_g, p.petroMag_r, p.petroMag_i, p.petroMag_z,
        p.petroRad_r, extinction_r
    FROM PhotoObj AS p
       JOIN SpecObj AS s ON s.bestobjid = p.objid
    WHERE
       p.psfMag_r < 21 AND
       p.ra BETWEEN 202 AND 207 AND
       p.dec BETWEEN 0 AND 2 '''

fetch_sloan_data(sql_query, output, verbose=True)









    



Connecting to the server...
Writing to file...
Success!

2. Exploratory Analysis

Okay, let's read in the data and see what we have.



In [4]:

    
sdss = pd.read_csv("../data/sdss_small.csv")
sdss.head()









    Out[4]:






  
    
      
      ra
      dec
      class
      psfMag_u
      psfMag_g
      psfMag_r
      psfMag_i
      psfMag_z
      petroMag_u
      petroMag_g
      petroMag_r
      petroMag_i
      petroMag_z
      petroRad_r
      extinction_r
    
  
  
    
      0
      205.041772
      1.225637
      Quasar
      22.33957
      20.69193
      20.60227
      20.48185
      20.59905
      22.64906
      20.67106
      20.67370
      20.55878
      20.61250
      1.332299
      0.064220
    
    
      1
      205.056272
      0.496612
      Galaxy
      24.15581
      22.71687
      20.99946
      19.97933
      19.51431
      25.93601
      22.09508
      20.50595
      19.45818
      18.98700
      1.734146
      0.081771
    
    
      2
      205.074412
      0.882936
      Galaxy
      23.37353
      21.91115
      20.22815
      19.61556
      19.30375
      24.58035
      21.27997
      19.43404
      18.75463
      18.53402
      2.939431
      0.076804
    
    
      3
      205.081585
      1.416685
      Star
      18.25325
      17.27528
      16.88841
      16.77056
      16.72031
      18.29956
      17.30037
      16.95146
      16.83360
      16.79116
      1.378984
      0.064170
    
    
      4
      205.827414
      1.799152
      Galaxy
      24.07808
      22.59162
      20.83035
      20.02864
      19.46696
      22.87317
      21.52034
      20.02671
      19.02419
      18.55864
      2.969733
      0.067893

The first two columns - ra and dec - are the coordinates of the objects (see the wiki page for more information). Our target column is called class - this can be one of Galaxy, Quasar, or Star. Looking at the distribution of the classes, we can see that the dataset is very unbalanced. We have, for example, five times as many galaxies as stars and quasars.



In [5]:

    
fig = plt.figure(figsize=(10, 5))
ax = plot_class_distribution(sdss['class'])

Given this extreme imbalance, we will be mainly using the balanced accuracy rate when reporting the performance of a classifier. To compute this measure, we first define $A_i$, the accuracy rate of class $i$, as:

\begin{align} A_i = \dfrac{C_{ii}}{m_i} \end{align}

where $C$ is the confusion matrix with entry $C_{ij}$ being the number of objects in class $i$ but have been as class $j$, and $m_i$ is the number of objects in class $i$. Before running a classifier, we know nothing of its performance, so we can assume the accuracy rate follows a flat prior distribution. In particular, the Beta distribution with parameters $\alpha = \beta = 1$ (i.e. a uniform distriubtion) seems appropriate here:

\begin{align} A_i \sim Beta(1, 1) \qquad \forall i \end{align}

Given an accuracy rate $A_i$ for each class $i$, the number of correct predictions in class $i$ will follow a Binomial distribution with $A_i$ as the probability of success:

\begin{align} \big( C_{ii} \mid A_i \big) \sim Bin\big(m_i, A_i\big) \qquad \forall i \end{align}

In Bayesian terms, this is our likelihood. Now we know that with respect to a Binomial likelihood, the Beta distribution is conjugate to itself. Thus the posterior distribution of $A_i$ will also be Beta with parameters:

\begin{align} \big( A_i \mid C_{ii} \big) \sim Beta \big( 1 + C_{ii}, 1 + m_i - C_{ii} \big) \qquad \forall i \end{align}

The (posterior) balanced accuracy rate $A$ can now be defined as the average of the individual accuracy rates $A_i$:

\begin{align} A = \dfrac{1}{k} \sum_i A_i \end{align}

where $k$ is the number of classes.

When presenting our results, we will often be using the Mollweide projection. Here's the layout of the coordinates, with the red line being the plane of the Milky Way.



In [6]:

    
fig = plt.figure(figsize=(10,5))
zero_values = np.zeros(1)
ax = plot_hex_map(zero_values, zero_values, axisbg=None, colorbar=False, labels=True)

For example, this is where the objects in our dataset are located in the sky. It's a very small region roughly in the middle. For now, it seems that the map is not very useful. However, when we look at the whole SDSS dataset, being able to plot the results on a map will become quite handy.



In [7]:

    
fig = plt.figure(figsize=(10,5))
ax = plot_hex_map(sdss['ra'], sdss['dec'], vmax=5)

3. Magnitudes, Colours, and Dust

In the datasets, we have 11 features - 5 Petrosian magitudes, 5 PSF magnitudes, and a size measurement. To get a feel of how separated the three classes are, let's use PCA and reduce these 11 features down to 2 dimensions.



In [8]:

    
feature_cols = ['psfMag_u', 'psfMag_g', 'psfMag_r', 'psfMag_i', 'psfMag_z', 'petroMag_u',
                'petroMag_g', 'petroMag_r', 'petroMag_i', 'petroMag_z', 'petroRad_r']
target_col = 'class'
classes = ['Galaxy', 'Quasar', 'Star']

pca = PCA(n_components=2)
projection = pca.fit_transform(sdss[feature_cols])
target = np.array(sdss[target_col])

fig = plt.figure(figsize=(10, 5))
ax = plot_scatter_with_classes(projection, target, classes, size=30, alpha=0.3,
                               scatterpoints=3)
ax.set_xlim(-70, -40)
ax.set_ylim(-15, 0)
plt.show()

As light travels through space to Earth, some of it get absorbed and scattered by the galactic dust. Light in the higher frequency range is more affected, hence extinction causes an object to become redder. This gives the phenomenon the name 'interstellar reddening'.

The amount of extinction varies across the sky. For example, there is a lot of dust in the Milky Way band, as can be seen in the following reddening map made by LAMBDA, where red indicates a high level of extinction.

There are three sets of extinctions that we can apply - the SFD98, the SF11, and the W14 sets. Refer to the doc to learn how these corrections are computed. We will use random forest to see which set offers the best improvement in the balanced accuracy rate.



In [9]:

    
# compute the three sets of reddening correction
sfd98_corrections = reddening_correction_sfd98(sdss['extinction_r'])
sf11_corrections = reddening_correction_sf11(sdss['extinction_r'])
w14_corrections = reddening_correction_w14(sdss['extinction_r'])

# column names of the magnitudes to be corrected
psf_magnitudes = ['psfMag_u', 'psfMag_g', 'psfMag_r', 'psfMag_i', 'psfMag_z']
petro_magnitudes = ['petroMag_u', 'petroMag_g', 'petroMag_r', 'petroMag_i', 'petroMag_z']

# calculate the corrected magnitudes
correct_magnitudes(sdss, psf_magnitudes, sfd98_corrections, '_sfd98')
correct_magnitudes(sdss, petro_magnitudes, sfd98_corrections, '_sfd98')
correct_magnitudes(sdss, psf_magnitudes, sf11_corrections, '_sf11')
correct_magnitudes(sdss, petro_magnitudes, sf11_corrections, '_sf11')
correct_magnitudes(sdss, psf_magnitudes, w14_corrections, '_w14')
correct_magnitudes(sdss, petro_magnitudes, w14_corrections, '_w14')

Here are the four sets of features and the target.



In [10]:

    
uncorrected_features = ['psfMag_u', 'psfMag_g', 'psfMag_r', 'psfMag_i', 'psfMag_z',
                        'petroMag_u', 'petroMag_g', 'petroMag_r', 'petroMag_i',
                        'petroMag_z', 'petroRad_r']

sfd98_features = ['psfMag_u_sfd98', 'psfMag_g_sfd98', 'psfMag_r_sfd98', 'psfMag_i_sfd98',
                  'psfMag_z_sfd98', 'petroMag_u_sfd98', 'petroMag_g_sfd98',
                  'petroMag_r_sfd98', 'petroMag_i_sfd98', 'petroMag_z_sfd98',  'petroRad_r']

sf11_features = ['psfMag_u_sf11', 'psfMag_g_sf11', 'psfMag_r_sf11', 'psfMag_i_sf11',
                 'psfMag_z_sf11', 'petroMag_u_sf11', 'petroMag_g_sf11', 'petroMag_r_sf11',
                 'petroMag_i_sf11', 'petroMag_z_sf11', 'petroRad_r']

w14_features = ['psfMag_u_w14', 'psfMag_g_w14', 'psfMag_r_w14', 'psfMag_i_w14',
                'psfMag_z_w14', 'petroMag_u_w14', 'petroMag_g_w14', 'petroMag_r_w14',
                'petroMag_i_w14', 'petroMag_z_w14', 'petroRad_r']

target_col = 'class'

Let's now train a random forest on each of the four sets of features.



In [11]:

    
correct_predictions, confusion_baseline = train_classifier(sdss, uncorrected_features, target_col,
    train_size=0.7, test_size=0.3, balanced=False, recall_maps=False, coords=False, report=False, random_state=2)









    



Here's the confusion matrix:






    






  
    
      
      
      Predicted
    
    
      
      
      Galaxy
      Quasar
      Star
    
  
  
    
      Actual
      Galaxy
      417
      3
      2
    
    
      Quasar
      8
      63
      15
    
    
      Star
      8
      11
      63
    
  








    



The balanced accuracy rate is 82.50%.



In [12]:

    
correct_predictions, confusion_sfd98 = train_classifier(sdss, sfd98_features, target_col,
    train_size=0.7, test_size=0.3, balanced=False, recall_maps=False, coords=False, report=False, random_state=2)









    



Here's the confusion matrix:






    






  
    
      
      
      Predicted
    
    
      
      
      Galaxy
      Quasar
      Star
    
  
  
    
      Actual
      Galaxy
      417
      3
      2
    
    
      Quasar
      8
      65
      13
    
    
      Star
      8
      10
      64
    
  








    



The balanced accuracy rate is 83.66%.



In [13]:

    
correct_predictions, confusion_sf11 = train_classifier(sdss, sf11_features, target_col,
    train_size=0.7, test_size=0.3, balanced=False, recall_maps=False, coords=False, report=False, random_state=2)









    



Here's the confusion matrix:






    






  
    
      
      
      Predicted
    
    
      
      
      Galaxy
      Quasar
      Star
    
  
  
    
      Actual
      Galaxy
      417
      3
      2
    
    
      Quasar
      8
      63
      15
    
    
      Star
      8
      10
      64
    
  








    



The balanced accuracy rate is 82.90%.



In [14]:

    
correct_predictions, confusion_w14 = classifier.train_classifier(sdss, w14_features, target_col,
    train_size=0.7, test_size=0.3, balanced=False, recall_maps=False, coords=False, report=False, random_state=2)









    



Here's the confusion matrix:






    






  
    
      
      
      Predicted
    
    
      
      
      Galaxy
      Quasar
      Star
    
  
  
    
      Actual
      Galaxy
      417
      3
      2
    
    
      Quasar
      8
      63
      15
    
    
      Star
      8
      10
      64
    
  








    



The balanced accuracy rate is 82.90%.

It seems like there is not much difference between the three sets of corrections (which is expected given the very small sample size). The SF11 correction produces slightly better results, so let's use this from now on.

Another thing that we might want to consider is whether to use colours as features. Magnitudes themselves are correlated with the distance of the objects. That is, if a star is closer to us, it would appear brighter, resulting in a lower magnitude (yes astronomers are weird!). But this doesn't change the fact that the object is a star. Thus we could take differences between magnitudes (and call them colours). This will remove any dependence on the distance. To keep it simple, let's ignore colours for now.

3. Learning Curves

We now want to compare the performance of Logistic Regression, SVM, and Random Forest. We start by normalising the features to zero mean and unit variance.



In [16]:

    
scaler = StandardScaler()
X_scaled = scaler.fit_transform(sdss[sf11_features])
y = sdss[target_col]

We need to determine what the best hyperparamters for Logistic Regression and SVM are. This can be done via a grid search (basically an exhaustive search through the parameter space). For each combination of paramters, we use 5-fold cross validation, each fold having 50 examples in the training set and 50 in the test set. Each search will take a few minutes to complete.



In [17]:

    
grid_search_svm_rbf(X_scaled, y, train_size=50, test_size=50)









    



The best parameters for SVM RBF are {'gamma': 1.0, 'C': 1000.0} with a score of 92.00%.



In [18]:

    
grid_search_svm_poly(X_scaled, y, train_size=50, test_size=50)



In [19]:

    
grid_search_logistic(X_scaled, y, train_size=50, test_size=50)

Let's now compute some learning curves.



In [22]:

    
train_sizes = np.concatenate((np.arange(10, 100, 10), np.arange(100, 600, 100)))
test_sizes = [500] * 15

# initialise the classifiers
rbf_svm = SVC(kernel='rbf', gamma=1, C=1000, cache_size=200)
poly_svm = LinearSVC(C=10000, loss='squared_hinge', penalty='l1', dual=False, multi_class='ovr',
                         fit_intercept=True, random_state=21)
logistic = LogisticRegression(penalty='l1', dual=False, C=10000, multi_class='ovr', solver='liblinear', random_state=21)
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, class_weight='subsample', random_state=21)

# train the classifiers
lc_forest = learning_curve(sdss, sfd98_features, target_col, forest, train_sizes,
            test_sizes, random_state=2, normalise=False, balanced=False, degree=1)

lc_rbf_svm = learning_curve(sdss, sfd98_features, target_col, rbf_svm, train_sizes,
             test_sizes, random_state=2, normalise=True, balanced=False, degree=1)

lc_poly_svm_2 = learning_curve(sdss, sfd98_features, target_col, poly_svm, train_sizes,
                test_sizes, random_state=2, normalise=True, balanced=False, degree=2)

lc_poly_svm_3 = learning_curve(sdss, sfd98_features, target_col, poly_svm, train_sizes,
                test_sizes, random_state=2, normalise=True, balanced=False, degree=3)

lc_logistic_2 = learning_curve(sdss, sfd98_features, target_col, logistic, train_sizes,
                test_sizes, random_state=2, normalise=True, balanced=False, degree=2)

lc_logistic_3 = learning_curve(sdss, sfd98_features, target_col, logistic, train_sizes,
                test_sizes, random_state=2, normalise=True, balanced=False, degree=3)



In [23]:

    
learning_curves = [lc_forest, lc_rbf_svm, lc_poly_svm_2, lc_poly_svm_3, lc_logistic_2, lc_logistic_3]

curve_labels = ['Random Forest', 'SVM RBF', 'SVM Polynomial Degree 2', 'SVM Polynomial Degree 3',
                'Logistic Regression Degree 2', 'Logistic Regression Degree 3']

fig = plt.figure(figsize=(10, 5))
ax = mclearn.viz.plot_learning_curve(train_sizes, learning_curves, curve_labels, xscale='linear')

5. Logistic Active Learning

Let's see if we can beat random sampling. The goal here is to get the best possible accuracy rate with the smallest possible training set. Thus we will need to be smart when choosing the next candidate to put in the training pool.

We start by training the logistic classifier with 20 random examples. Then each time we look at a random sample of the unlabelled pool and assign each example a score. We query the example with the best score and then train the classifier again. This is repeated until we have 200 examples in the training set. Normally we should do around 10 trials and take the average, but in this notebook, let's just do 1 trial of the experiment.

Two uncertainty sampling heuristics are tested:

Picking the candidate that displays the greatest Shannon entropy:

\begin{align} -\sum_c q_c \log(q_c) \end{align}

Picking the candidate with the smallest margin:

\begin{align} M_n = \left| p(c \mid \mathbf{x}_n) - p(c' \mid \mathbf{x}_n) \right| \end{align}

And two query by bagging heuristics are tested:

Picking the candidate with the smallest average margin.
Picking the candidate with the largest value of the average Kullback-Leibler divergence from the average:

\begin{align} \dfrac{1}{B} \sum_{b=1}^B \text{KL}(f_b \mid\mid f_\text{avg}) \end{align}

Note that the KL divergence of $Q$ from $P$ is defined as

\begin{align} D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}. \end{align}

This KL divergence measures the amount of information lost when $Q$ is used to approximate $P$. In the active learning context, $Q$ is the average prediction probability of the committee, while $P$ is the prediction of a particular committee member.

Okay, let's the run the experiments. This will take roughly 5 minutes to run.



In [24]:

    
# initialise the main classifier
logistic = LogisticRegression(multi_class='ovr', penalty='l1', C=10000, random_state=2, class_weight='auto')

# list of all heuristics to be used (and the paths where results are saved)
heuristics = [random_h, entropy_h, margin_h, qbb_margin_h, qbb_kl_h]

# set up the logistic committee
committee = []
committee_size = 3
for i in range(committee_size):
    member = LogisticRegression(multi_class='ovr', penalty='l1', C=10000, random_state=2, class_weight='auto')
    committee.append(member)
    
# 70/30 split of training and test sets
training_pool, testing_pool, training_oracle, testing_oracle = train_test_split(
    np.array(sdss[sfd98_features]), np.array(sdss[target_col]), train_size=0.7)

# shuffle and randomise data
training_pool, training_oracle = shuffle(training_pool, training_oracle, random_state=14)

# run active learning experiment for each heuristic
learning_curves = []
for heuristic in heuristics:
    lc = run_active_learning_with_heuristic(heuristic, logistic, training_pool,
         testing_pool, training_oracle, testing_oracle, n_trials=1, full_sample_size=200, total_n=200,
         random_n=30, initial_n=20, bag_size=30, committee=committee, pickle_path=None, balanced_pool=False)
    learning_curves.append(lc)









    



..

..

..

..

..



In [26]:

    
curve_labels = ['Random', 'Entropy', 'Margin', 'QBB Margin', 'QBB KL']
fig = plt.figure(figsize=(10, 5))
ax = plot_average_learning_curve(np.arange(20, 201), learning_curves, curve_labels)

	ra	dec	class	psfMag_u	psfMag_g	psfMag_r	psfMag_i	psfMag_z	petroMag_u	petroMag_g	petroMag_r	petroMag_i	petroMag_z	petroRad_r	extinction_r
0	205.041772	1.225637	Quasar	22.33957	20.69193	20.60227	20.48185	20.59905	22.64906	20.67106	20.67370	20.55878	20.61250	1.332299	0.064220
1	205.056272	0.496612	Galaxy	24.15581	22.71687	20.99946	19.97933	19.51431	25.93601	22.09508	20.50595	19.45818	18.98700	1.734146	0.081771
2	205.074412	0.882936	Galaxy	23.37353	21.91115	20.22815	19.61556	19.30375	24.58035	21.27997	19.43404	18.75463	18.53402	2.939431	0.076804
3	205.081585	1.416685	Star	18.25325	17.27528	16.88841	16.77056	16.72031	18.29956	17.30037	16.95146	16.83360	16.79116	1.378984	0.064170
4	205.827414	1.799152	Galaxy	24.07808	22.59162	20.83035	20.02864	19.46696	22.87317	21.52034	20.02671	19.02419	18.55864	2.969733	0.067893