PCA

I'm curious as to how much information there is in each output of the CNN. I remember from ISML that PCA can work as a crude measure of this, so I'll grab some potential hosts, extract radio patches, run the CNN on them, and see how many principal components I need for each to recover most of the variance.


In [2]:
import functools
import numpy
import operator
import sys

import bson.objectid
import keras.models
import pandas
import sklearn.decomposition

sys.path.insert(1, '..')
import crowdastro.data

with open('../crowdastro-data/cnn_model_2.json', 'r') as f:
    cnn = keras.models.model_from_json(f.read())

cnn.load_weights('../crowdastro-data/cnn_weights_2.h5')


Using Theano backend.
K:\Languages\Anaconda3\lib\site-packages\theano\tensor\signal\downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")

In [3]:
# Load the potential hosts.
with pandas.HDFStore('../crowdastro-data/training.h5') as store:
    dataframe = store['metadata']  # New training data table is 'data'. Haven't run that script yet.

dataframe.head()


Out[3]:
subject_id x y is_host
0 0 0 0
0 b'54b7f9ee0136916b75000002' 32.412434 65.519434 0
1 b'54b7f9ee0136916b75000002' 161.776174 130.278864 0
2 b'54b7f9ee0136916b75000002' 101.349871 100.381640 1
3 b'54b7f9ee0136916b75000002' 54.028687 186.482581 0
4 b'54b7f9ee0136916b75000002' 102.648851 71.261272 0

In [4]:
get_convolutional_output_1 = keras.backend.function([cnn.layers[0].input],
                                                     [cnn.layers[1].get_output()])
get_max_pooling_output_1 = keras.backend.function([cnn.layers[0].input],
                                                   [cnn.layers[2].get_output()])
get_convolutional_output_2 = keras.backend.function([cnn.layers[0].input],
                                                     [cnn.layers[4].get_output()])
get_max_pooling_output_2 = keras.backend.function([cnn.layers[0].input],
                                                   [cnn.layers[5].get_output()])

In [5]:
# Take the first thousand potential hosts, and get their associated radio patches.
patches = []
radius = 40
padding = 150
for potential_host in dataframe.to_records()[:1000]:  # Is there a better way to do this?
    oid = bson.objectid.ObjectId(potential_host[1].tobytes().decode('ascii'))
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': oid})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    x = potential_host[2]
    y = potential_host[3]
    patch = radio[x-radius+padding:x+radius+padding, y-radius+padding:y+radius+padding]
    patches.append(patch)

patches = numpy.array(patches).reshape(1000, 1, 80, 80)


K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

In [6]:
# Run the patches through the CNN to get the intermediate outputs.
conv_1 = get_convolutional_output_1([patches])[0]
pool_1 = get_max_pooling_output_1([patches])[0]
conv_2 = get_convolutional_output_2([patches])[0]
pool_2 = get_max_pooling_output_2([patches])[0]

In [7]:
# Reshape everything to be single images. We don't care about pixel order.
conv_1 = conv_1.reshape(1000, -1)
conv_2 = conv_2.reshape(1000, -1)
pool_1 = pool_1.reshape(1000, -1)
pool_2 = pool_2.reshape(1000, -1)

In [8]:
# How many dimensions does each layer have?
print('conv 1 dimension:', functools.reduce(operator.mul, conv_1.shape[1:]))
print('pool 1 dimension:', functools.reduce(operator.mul, pool_1.shape[1:]))
print('conv 2 dimension:', functools.reduce(operator.mul, conv_2.shape[1:]))
print('pool 2 dimension:', functools.reduce(operator.mul, pool_2.shape[1:]))


conv 1 dimension: 161312
pool 1 dimension: 6272
conv 2 dimension: 800
pool 2 dimension: 32

In [9]:
# Run PCA on each of these.
pca_conv_1 = sklearn.decomposition.PCA(n_components=0.999)
pca_conv_1.fit(conv_1)
pca_conv_1.n_components_


Out[9]:
189

In [10]:
pca_pool_1 = sklearn.decomposition.PCA(n_components=0.999)
pca_pool_1.fit(pool_1)
pca_pool_1.n_components_


Out[10]:
168

In [11]:
pca_conv_2 = sklearn.decomposition.PCA(n_components=0.999)
pca_conv_2.fit(conv_2)
pca_conv_2.n_components_


Out[11]:
155

In [12]:
pca_pool_2 = sklearn.decomposition.PCA(n_components=0.999)
pca_pool_2.fit(pool_2)
pca_pool_2.n_components_


Out[12]:
20

So to recover 99.9% of the variance, we need to have less than 200 features. However, our convolutional neural network outputs way more than this at each layer (and a lot less at the last layer). Maybe we can take the second convolutional output and run PCA on it as a form of compression? I'd rather have 155 dimensions than 800.

With that in mind, I'll run the PCA again on just the second convolutional output and freeze it. Unfortunately, it seems sklearn doesn't support anything except pickle — which I definitely don't want to use — so I'm going to have to serialise it myself. I'll use h5py.


In [16]:
pca_conv_2.components_.shape


Out[16]:
(155, 800)

In [18]:
import h5py
with h5py.File('../crowdastro-data/pca.h5', 'w') as f:
    dset = f.create_dataset("conv_2", (155, 800), dtype=float)
    dset[:] = pca_conv_2.components_

In [ ]: