I'm curious as to how much information there is in each output of the CNN. I remember from ISML that PCA can work as a crude measure of this, so I'll grab some potential hosts, extract radio patches, run the CNN on them, and see how many principal components I need for each to recover most of the variance.
In [2]:
import functools
import numpy
import operator
import sys
import bson.objectid
import keras.models
import pandas
import sklearn.decomposition
sys.path.insert(1, '..')
import crowdastro.data
with open('../crowdastro-data/cnn_model_2.json', 'r') as f:
cnn = keras.models.model_from_json(f.read())
cnn.load_weights('../crowdastro-data/cnn_weights_2.h5')
In [3]:
# Load the potential hosts.
with pandas.HDFStore('../crowdastro-data/training.h5') as store:
dataframe = store['metadata'] # New training data table is 'data'. Haven't run that script yet.
dataframe.head()
Out[3]:
In [4]:
get_convolutional_output_1 = keras.backend.function([cnn.layers[0].input],
[cnn.layers[1].get_output()])
get_max_pooling_output_1 = keras.backend.function([cnn.layers[0].input],
[cnn.layers[2].get_output()])
get_convolutional_output_2 = keras.backend.function([cnn.layers[0].input],
[cnn.layers[4].get_output()])
get_max_pooling_output_2 = keras.backend.function([cnn.layers[0].input],
[cnn.layers[5].get_output()])
In [5]:
# Take the first thousand potential hosts, and get their associated radio patches.
patches = []
radius = 40
padding = 150
for potential_host in dataframe.to_records()[:1000]: # Is there a better way to do this?
oid = bson.objectid.ObjectId(potential_host[1].tobytes().decode('ascii'))
subject = crowdastro.data.db.radio_subjects.find_one({'_id': oid})
radio = crowdastro.data.get_radio(subject, size='5x5')
x = potential_host[2]
y = potential_host[3]
patch = radio[x-radius+padding:x+radius+padding, y-radius+padding:y+radius+padding]
patches.append(patch)
patches = numpy.array(patches).reshape(1000, 1, 80, 80)
In [6]:
# Run the patches through the CNN to get the intermediate outputs.
conv_1 = get_convolutional_output_1([patches])[0]
pool_1 = get_max_pooling_output_1([patches])[0]
conv_2 = get_convolutional_output_2([patches])[0]
pool_2 = get_max_pooling_output_2([patches])[0]
In [7]:
# Reshape everything to be single images. We don't care about pixel order.
conv_1 = conv_1.reshape(1000, -1)
conv_2 = conv_2.reshape(1000, -1)
pool_1 = pool_1.reshape(1000, -1)
pool_2 = pool_2.reshape(1000, -1)
In [8]:
# How many dimensions does each layer have?
print('conv 1 dimension:', functools.reduce(operator.mul, conv_1.shape[1:]))
print('pool 1 dimension:', functools.reduce(operator.mul, pool_1.shape[1:]))
print('conv 2 dimension:', functools.reduce(operator.mul, conv_2.shape[1:]))
print('pool 2 dimension:', functools.reduce(operator.mul, pool_2.shape[1:]))
In [9]:
# Run PCA on each of these.
pca_conv_1 = sklearn.decomposition.PCA(n_components=0.999)
pca_conv_1.fit(conv_1)
pca_conv_1.n_components_
Out[9]:
In [10]:
pca_pool_1 = sklearn.decomposition.PCA(n_components=0.999)
pca_pool_1.fit(pool_1)
pca_pool_1.n_components_
Out[10]:
In [11]:
pca_conv_2 = sklearn.decomposition.PCA(n_components=0.999)
pca_conv_2.fit(conv_2)
pca_conv_2.n_components_
Out[11]:
In [12]:
pca_pool_2 = sklearn.decomposition.PCA(n_components=0.999)
pca_pool_2.fit(pool_2)
pca_pool_2.n_components_
Out[12]:
So to recover 99.9% of the variance, we need to have less than 200 features. However, our convolutional neural network outputs way more than this at each layer (and a lot less at the last layer). Maybe we can take the second convolutional output and run PCA on it as a form of compression? I'd rather have 155 dimensions than 800.
With that in mind, I'll run the PCA again on just the second convolutional output and freeze it. Unfortunately, it seems sklearn doesn't support anything except pickle — which I definitely don't want to use — so I'm going to have to serialise it myself. I'll use h5py.
In [16]:
pca_conv_2.components_.shape
Out[16]:
In [18]:
import h5py
with h5py.File('../crowdastro-data/pca.h5', 'w') as f:
dset = f.create_dataset("conv_2", (155, 800), dtype=float)
dset[:] = pca_conv_2.components_
In [ ]: