PCA

I'm curious as to how much information there is in each output of the CNN. I remember from ISML that PCA can work as a crude measure of this, so I'll grab some potential hosts, extract radio patches, run the CNN on them, and see how many principal components I need for each to recover most of the variance.



In [2]:

    
import functools
import numpy
import operator
import sys

import bson.objectid
import keras.models
import pandas
import sklearn.decomposition

sys.path.insert(1, '..')
import crowdastro.data

with open('../crowdastro-data/cnn_model_2.json', 'r') as f:
    cnn = keras.models.model_from_json(f.read())

cnn.load_weights('../crowdastro-data/cnn_weights_2.h5')









    



Using Theano backend.
K:\Languages\Anaconda3\lib\site-packages\theano\tensor\signal\downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")



In [3]:

    
# Load the potential hosts.
with pandas.HDFStore('../crowdastro-data/training.h5') as store:
    dataframe = store['metadata']  # New training data table is 'data'. Haven't run that script yet.

dataframe.head()









    Out[3]:






  
    
      
      subject_id
      x
      y
      is_host
    
    
      
      0
      0
      0
      0
    
  
  
    
      0
      b'54b7f9ee0136916b75000002'
      32.412434
      65.519434
      0
    
    
      1
      b'54b7f9ee0136916b75000002'
      161.776174
      130.278864
      0
    
    
      2
      b'54b7f9ee0136916b75000002'
      101.349871
      100.381640
      1
    
    
      3
      b'54b7f9ee0136916b75000002'
      54.028687
      186.482581
      0
    
    
      4
      b'54b7f9ee0136916b75000002'
      102.648851
      71.261272
      0



In [4]:

    
get_convolutional_output_1 = keras.backend.function([cnn.layers[0].input],
                                                     [cnn.layers[1].get_output()])
get_max_pooling_output_1 = keras.backend.function([cnn.layers[0].input],
                                                   [cnn.layers[2].get_output()])
get_convolutional_output_2 = keras.backend.function([cnn.layers[0].input],
                                                     [cnn.layers[4].get_output()])
get_max_pooling_output_2 = keras.backend.function([cnn.layers[0].input],
                                                   [cnn.layers[5].get_output()])



In [5]:

    
# Take the first thousand potential hosts, and get their associated radio patches.
patches = []
radius = 40
padding = 150
for potential_host in dataframe.to_records()[:1000]:  # Is there a better way to do this?
    oid = bson.objectid.ObjectId(potential_host[1].tobytes().decode('ascii'))
    subject = crowdastro.data.db.radio_subjects.find_one({'_id': oid})
    radio = crowdastro.data.get_radio(subject, size='5x5')
    x = potential_host[2]
    y = potential_host[3]
    patch = radio[x-radius+padding:x+radius+padding, y-radius+padding:y+radius+padding]
    patches.append(patch)

patches = numpy.array(patches).reshape(1000, 1, 80, 80)









    



K:\Languages\Anaconda3\lib\site-packages\astropy\io\fits\util.py:578: UserWarning: Could not find appropriate MS Visual C Runtime library or library is corrupt/misconfigured; cannot determine whether your file object was opened in append mode.  Please consider using a file object opened in write mode instead.
  'Could not find appropriate MS Visual C Runtime '
K:\Languages\Anaconda3\lib\site-packages\ipykernel\__main__.py:11: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future



In [6]:

    
# Run the patches through the CNN to get the intermediate outputs.
conv_1 = get_convolutional_output_1([patches])[0]
pool_1 = get_max_pooling_output_1([patches])[0]
conv_2 = get_convolutional_output_2([patches])[0]
pool_2 = get_max_pooling_output_2([patches])[0]



In [7]:

    
# Reshape everything to be single images. We don't care about pixel order.
conv_1 = conv_1.reshape(1000, -1)
conv_2 = conv_2.reshape(1000, -1)
pool_1 = pool_1.reshape(1000, -1)
pool_2 = pool_2.reshape(1000, -1)



In [8]:

    
# How many dimensions does each layer have?
print('conv 1 dimension:', functools.reduce(operator.mul, conv_1.shape[1:]))
print('pool 1 dimension:', functools.reduce(operator.mul, pool_1.shape[1:]))
print('conv 2 dimension:', functools.reduce(operator.mul, conv_2.shape[1:]))
print('pool 2 dimension:', functools.reduce(operator.mul, pool_2.shape[1:]))









    



conv 1 dimension: 161312
pool 1 dimension: 6272
conv 2 dimension: 800
pool 2 dimension: 32



In [9]:

    
# Run PCA on each of these.
pca_conv_1 = sklearn.decomposition.PCA(n_components=0.999)
pca_conv_1.fit(conv_1)
pca_conv_1.n_components_









    Out[9]:





189



In [10]:

    
pca_pool_1 = sklearn.decomposition.PCA(n_components=0.999)
pca_pool_1.fit(pool_1)
pca_pool_1.n_components_









    Out[10]:





168



In [11]:

    
pca_conv_2 = sklearn.decomposition.PCA(n_components=0.999)
pca_conv_2.fit(conv_2)
pca_conv_2.n_components_









    Out[11]:





155



In [12]:

    
pca_pool_2 = sklearn.decomposition.PCA(n_components=0.999)
pca_pool_2.fit(pool_2)
pca_pool_2.n_components_









    Out[12]:





20

So to recover 99.9% of the variance, we need to have less than 200 features. However, our convolutional neural network outputs way more than this at each layer (and a lot less at the last layer). Maybe we can take the second convolutional output and run PCA on it as a form of compression? I'd rather have 155 dimensions than 800.

With that in mind, I'll run the PCA again on just the second convolutional output and freeze it. Unfortunately, it seems sklearn doesn't support anything except pickle — which I definitely don't want to use — so I'm going to have to serialise it myself. I'll use h5py.



In [16]:

    
pca_conv_2.components_.shape









    Out[16]:





(155, 800)



In [18]:

    
import h5py
with h5py.File('../crowdastro-data/pca.h5', 'w') as f:
    dset = f.create_dataset("conv_2", (155, 800), dtype=float)
    dset[:] = pca_conv_2.components_



In [ ]:

	subject_id	x	y	is_host
	0	0	0	0
0	b'54b7f9ee0136916b75000002'	32.412434	65.519434	0
1	b'54b7f9ee0136916b75000002'	161.776174	130.278864	0
2	b'54b7f9ee0136916b75000002'	101.349871	100.381640	1
3	b'54b7f9ee0136916b75000002'	54.028687	186.482581	0
4	b'54b7f9ee0136916b75000002'	102.648851	71.261272	0