Training with Features

From notebook 14, we now have radio features. From notebook 13, we now have astronomical features and potential host galaxies. It's now time to put all of these together into a set of vectors and train a classifier.

I'll quickly go over the pipeline up to now. First, make sure you have MongoDB running with the radio database containing the Radio Galaxy Zoo data. Then, convert all of the raw RGZ classifications into sanitised and nice-to-work-with classifications:

python -m crowdastro raw_classifications crowdastro-data/processed.db classifications

Next, compile the consensus database. For now, I'm only dealing with ATLAS data, so remember to specify the --atlas flag.

python -m crowdastro consensuses crowdastro-data/processed.db classifications atlas_consensuses_raw --atlas

We need to generate the training data. If you don't have a Gator cache, it will be generated.

python -m crowdastro training_data \
    crowdastro-data/processed.db atlas_consensuses_raw \
    gator_cache \
    crowdastro-data/training.h5 \
    --atlas

This dumps a file with astronomy features and potential hosts. Then, run 15_cnn to get CNN features (or just use the h5 and json files I already prepared) and run 16_pca to get a PCA matrix.

The pipeline is as follows:

Get potential hosts from training.h5.
Using the CDFS/ELAIS images, get radio patches around each potential host.
Run patches through CNN. Output the second convolutional layer.
Run CNN output through PCA.
Append astronomy features from training.h5. This is the input data.



In [379]:

    
import itertools
import sys

import bson
import h5py
import keras.layers
import keras.models
import matplotlib.pyplot
import numpy
import pandas
import sklearn.cross_validation
import sklearn.dummy
import sklearn.linear_model
import sklearn.metrics

sys.path.insert(1, '..')
import crowdastro.data
import crowdastro.show



In [173]:

    
with pandas.HDFStore('../crowdastro-data/training.h5') as store:
    data = store['data']

data.head()









    Out[173]:






  
    
      
      subject_id
      x
      y
      flux_ap2_24
      flux_ap2_36
      flux_ap2_45
      flux_ap2_58
      flux_ap2_80
      is_host
    
    
      
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  
  
    
      0
      b'54b7f9ee0136916b75000002'
      32.412434
      65.519434
      NaN
      15.00
      14.98
      NaN
      NaN
      0
    
    
      1
      b'54b7f9ee0136916b75000002'
      161.776174
      130.278864
      NaN
      30.31
      20.21
      NaN
      NaN
      0
    
    
      2
      b'54b7f9ee0136916b75000002'
      101.349871
      100.381640
      NaN
      104.39
      79.63
      56.32
      NaN
      1
    
    
      3
      b'54b7f9ee0136916b75000002'
      54.028687
      186.482581
      NaN
      144.58
      101.51
      72.96
      80.26
      0
    
    
      4
      b'54b7f9ee0136916b75000002'
      102.648851
      71.261272
      NaN
      11.36
      8.96
      NaN
      NaN
      0

We'll just look at a small number of potential hosts for now. I'll have to do batches to scale this up and I just want to check it works for now.



In [316]:

    
n = 5000

# I'm gathering up the radio patches first so I can run them through the CNN at the same time
# as one big matrix operation. In principle this would run on the GPU.
radio_patches = numpy.zeros((n, 80, 80))
labels = numpy.zeros((n,))

radius = 40
padding = 150

for idx, row in data.head(n).iterrows():
    sid = bson.objectid.ObjectId(row['subject_id'][0].decode('ascii'))
    x = row['x'][0]
    y = row['y'][0]
    label = row['is_host'][0]
    
    labels[idx] = label
    
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': sid})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    patch = radio[x - radius + padding : x + radius + padding, y - radius + padding : y + radius + padding]
    radio_patches[idx, :] = patch









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:21: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [254]:

    
# Load the CNN.

with open('../crowdastro-data/cnn_model_2.json', 'r') as f:
    cnn = keras.models.model_from_json(f.read())

cnn.load_weights('../crowdastro-data/cnn_weights_2.h5')



In [255]:

    
cnn.layers = cnn.layers[:5]  # Pop the layers after the second convolution's activation.
cnn.add(keras.layers.Flatten())



In [256]:

    
cnn.compile(optimizer='sgd', loss='mse')  # I don't actually care about the optimiser or loss.



In [263]:

    
# Load the PCA.
with h5py.File('../crowdastro-data/pca.h5') as f:
    pca = f['conv_2'][:]



In [318]:

    
# Find the radio features.
radio_features = cnn.predict(radio_patches.reshape(n, 1, 80, 80)) @ pca.T



In [320]:

    
# Add on the astronomy features.
features = numpy.hstack([radio_features, data.ix[:n-1, 'flux_ap2_24':'flux_ap2_80'].as_matrix()])
features = numpy.nan_to_num(features)



In [321]:

    
# Split into training and testing data.
xs_train, xs_test, ts_train, ts_test = sklearn.cross_validation.train_test_split(features, labels, test_size=0.2)



In [322]:

    
# Classify!
lr = sklearn.linear_model.LogisticRegression(class_weight='balanced')
lr.fit(xs_train, ts_train)









    Out[322]:





LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)



In [323]:

    
lr.score(xs_test, ts_test)









    Out[323]:





0.83399999999999996



In [324]:

    
sklearn.metrics.confusion_matrix(ts_test, lr.predict(xs_test), [0, 1])









    Out[324]:





array([[811, 129],
       [ 37,  23]])

So we get ~84% accuracy on just predicting labels. Let's compare to a random classifier.



In [326]:

    
dc = sklearn.dummy.DummyClassifier(strategy='stratified')
dc.fit(xs_train, ts_train)









    Out[326]:





DummyClassifier(constant=None, random_state=None, strategy='stratified')



In [327]:

    
dc.score(xs_test, ts_test)









    Out[327]:





0.88300000000000001

A stratified random classifier gets 88% accuracy, which doesn't look good for our logistic regression!

I am curious as to whether we can do better if we're considering the full problem, i.e. we know that exactly one potential host is the true host. Note that I'm ignoring the problem of multiple radio emitters for now. Let's try that: We'll get a subject, find the potential hosts, get their patches, and use the logistic regression and dummy classifiers to predict all the associated probabilities, and hence find the radio emitter. I'll only look at subjects not in the first n potential hosts, else we'd overlap with the training data.

To get a feel for how the predictor works, I'll try colour-coding potential hosts based on how likely they are to be the true host. To do that, I'll softmax the scores.



In [376]:

    
def softmax(x):
    exp = numpy.exp(x)
    return exp / numpy.sum(exp, axis=0)



In [330]:

    
subject_ids = set()
for idx, row in data.ix[n:n * 2].iterrows():
    sid = row['subject_id'][0]
    subject_ids.add(sid)



In [382]:

    
for subject_id in itertools.islice(subject_ids, 0, 10):
    # Pandas *really* doesn't like fancy indexing against string comparisons.
    indices = (data['subject_id'] == subject_id).as_matrix().reshape(-1)
    potential_hosts = numpy.nan_to_num(data.as_matrix()[indices][:, 1:-1].astype(float))
    
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': bson.objectid.ObjectId(subject_id.decode('ascii'))})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    
    radio_patches = numpy.zeros((len(potential_hosts), 1, radius * 2, radius * 2))
    for index, (x, y, *astro) in enumerate(potential_hosts):
        patch = radio[x - radius + padding : x + radius + padding, y - radius + padding : y + radius + padding]
        radio_patches[index, 0, :] = patch
    
    radio_features = cnn.predict(radio_patches) @ pca.T
    astro_features = potential_hosts[:, 2:]
    features = numpy.hstack([radio_features, astro_features])
    
    scores = lr.predict_proba(features)[:, 1].T
    probs = softmax(scores)
    
    crowdastro.show.subject(subject)
    matplotlib.pyplot.scatter(potential_hosts[:, 0], potential_hosts[:, 1], c=probs)
    matplotlib.pyplot.show()









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

This is quite interesting! Lots of points (blue) are not really considered, and sometimes there are a few candidates (red). These usually look pretty reasonable, but it also seems a lot like the predictor is just looking for bright things.

Let's try and get an accuracy out. There is still the problem of multiple radio sources, so I'll just say that if the predictor hits any true host, that's a hit.



In [396]:

    
hits = 0
attempts = 0

for subject_id in subject_ids:
    indices = (data['subject_id'] == subject_id).as_matrix().reshape(-1)
    potential_hosts = numpy.nan_to_num(data.as_matrix()[indices][:, 1:-1].astype(float))
    labels = numpy.nan_to_num(data.as_matrix()[indices][:, -1].astype(bool))
    
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': bson.objectid.ObjectId(subject_id.decode('ascii'))})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    
    radio_patches = numpy.zeros((len(potential_hosts), 1, radius * 2, radius * 2))
    for index, (x, y, *astro) in enumerate(potential_hosts):
        patch = radio[x - radius + padding : x + radius + padding, y - radius + padding : y + radius + padding]
        radio_patches[index, 0, :] = patch
    
    radio_features = cnn.predict(radio_patches) @ pca.T
    astro_features = potential_hosts[:, 2:]
    features = numpy.hstack([radio_features, astro_features])
    
    scores = lr.predict_proba(features)[:, 1].reshape(-1)
    predicted_host = scores.argmax()
    if labels[predicted_host]:
        hits += 1
    attempts += 1









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:14: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [397]:

    
print('Accuracy: {:.02%}'.format(hits / attempts))









    



Accuracy: 62.32%

Against a random classifier...



In [398]:

    
hits = 0
attempts = 0

for subject_id in subject_ids:
    indices = (data['subject_id'] == subject_id).as_matrix().reshape(-1)
    potential_hosts = numpy.nan_to_num(data.as_matrix()[indices][:, 1:-1].astype(float))
    labels = numpy.nan_to_num(data.as_matrix()[indices][:, -1].astype(bool))
    
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': bson.objectid.ObjectId(subject_id.decode('ascii'))})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    
    radio_patches = numpy.zeros((len(potential_hosts), 1, radius * 2, radius * 2))
    for index, (x, y, *astro) in enumerate(potential_hosts):
        patch = radio[x - radius + padding : x + radius + padding, y - radius + padding : y + radius + padding]
        radio_patches[index, 0, :] = patch
    
    radio_features = cnn.predict(radio_patches) @ pca.T
    astro_features = potential_hosts[:, 2:]
    features = numpy.hstack([radio_features, astro_features])
    
    scores = dc.predict_proba(features)[:, 1].reshape(-1)
    predicted_host = scores.argmax()
    if labels[predicted_host]:
        hits += 1
    attempts += 1









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:14: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [399]:

    
print('Accuracy: {:.02%}'.format(hits / attempts))









    



Accuracy: 7.97%

It would also be useful to know what the classifier considers "hard" to classify. I think an entropy approach might work (though there are problems with this...). Let's find the highest-entropy subject.



In [400]:

    
max_entropy = float('-inf')
max_subject = None

for subject_id in subject_ids:
    indices = (data['subject_id'] == subject_id).as_matrix().reshape(-1)
    potential_hosts = numpy.nan_to_num(data.as_matrix()[indices][:, 1:-1].astype(float))
    labels = numpy.nan_to_num(data.as_matrix()[indices][:, -1].astype(bool))
    
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': bson.objectid.ObjectId(subject_id.decode('ascii'))})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    
    radio_patches = numpy.zeros((len(potential_hosts), 1, radius * 2, radius * 2))
    for index, (x, y, *astro) in enumerate(potential_hosts):
        patch = radio[x - radius + padding : x + radius + padding, y - radius + padding : y + radius + padding]
        radio_patches[index, 0, :] = patch
    
    radio_features = cnn.predict(radio_patches) @ pca.T
    astro_features = potential_hosts[:, 2:]
    features = numpy.hstack([radio_features, astro_features])
    
    probabilities = softmax(lr.predict_proba(features)[:, 1].reshape(-1))
    entropy = -(probabilities * numpy.log(probabilities)).sum()
    
    if entropy > max_entropy:
        max_entropy = entropy
        max_subject = subject









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:14: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [401]:

    
crowdastro.show.subject(max_subject)









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '



In [402]:

    
indices = (data['subject_id'] == str(max_subject['_id']).encode('ascii')).as_matrix().reshape(-1)
potential_hosts = numpy.nan_to_num(data.as_matrix()[indices][:, 1:-1].astype(float))

subject = max_subject
radio = crowdastro.data.get_radio(subject, size='5x5')

radio_patches = numpy.zeros((len(potential_hosts), 1, radius * 2, radius * 2))
for index, (x, y, *astro) in enumerate(potential_hosts):
    patch = radio[x - radius + padding : x + radius + padding, y - radius + padding : y + radius + padding]
    radio_patches[index, 0, :] = patch

radio_features = cnn.predict(radio_patches) @ pca.T
astro_features = potential_hosts[:, 2:]
features = numpy.hstack([radio_features, astro_features])

scores = lr.predict_proba(features)[:, 1].T
probs = softmax(scores)

crowdastro.show.subject(subject)
matplotlib.pyplot.scatter(potential_hosts[:, 0], potential_hosts[:, 1], c=probs)
matplotlib.pyplot.show()









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:9: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [411]:

    
matplotlib.pyplot.plot(sorted(probs), marker='x')









    Out[411]:





[<matplotlib.lines.Line2D at 0x22daed791d0>]



In [ ]:

	subject_id	x	y	flux_ap2_24	flux_ap2_36	flux_ap2_45	flux_ap2_58	flux_ap2_80	is_host
	0	0	0	0	0	0	0	0	0
0	b'54b7f9ee0136916b75000002'	32.412434	65.519434	NaN	15.00	14.98	NaN	NaN	0
1	b'54b7f9ee0136916b75000002'	161.776174	130.278864	NaN	30.31	20.21	NaN	NaN	0
2	b'54b7f9ee0136916b75000002'	101.349871	100.381640	NaN	104.39	79.63	56.32	NaN	1
3	b'54b7f9ee0136916b75000002'	54.028687	186.482581	NaN	144.58	101.51	72.96	80.26	0
4	b'54b7f9ee0136916b75000002'	102.648851	71.261272	NaN	11.36	8.96	NaN	NaN	0