Loading the training and test set

The first step is to load the training and set that was saved when running the Training and test feature set generation notebook.


In [1]:
cd ../../features/


/data/opencast/MRes/features

In [2]:
posdata = loadtxt("training.HIPPIE.positive.Entrez.vectors.txt", delimiter="\t", dtype="str")

Training with the full negative set is impossible on this computer, it is unable to load the file into RAM. To get around this, can use a negative training set only 10 times larger than the positive training set:


In [3]:
negdata = loadtxt("training.HIPPIE.negative.Entrez.vectors.txt", delimiter="\t", dtype="str")

Then we need to process out the "missing" label strings and the numerical values should be converted to floats. Also, column 101 must be deleted as it contains string class labels from the ENTS classifier of unknown origin. More information about this can be found in the previous version of this notebook.


In [4]:
X = np.concatenate((posdata,negdata))

In [5]:
plottablerows = X=="missing"
plottablerows = where(plottablerows.max(axis=1)==False)

Unfortunately, some "NA" strings have sneaked in somehow, so we will have to zero these. Appears to be due to a feature in Gene_Ontology:


In [8]:
NAinds = where(X=="NA")

In [13]:
X[NAinds] = 0.0

In [14]:
X[X=="missing"] = np.nan
X = X.astype(np.float)

Finally we can create the target vector y from what we know about the lengths of the positive and negative sets:


In [15]:
#create the target y vector
y = array([1]*len(posdata)+[0]*len(negdata))

Visualizing the data

The data has extremely high dimensionality is therefore difficult to plot. Options that have been identified are:

Alternatively, we can try to reduce the dimensionality before plotting, such as using:

Parallel coordinates plot

This is implemented in pandas and the documentation for it can be found here. Using the example code on a sub-sample of the data.

First, though we can build a parallel coordinates plot by just normalising and plotting coloured feature vectors:


In [16]:
Xplot,yplot = X[plottablerows],y[plottablerows]
import sklearn.utils
Xplot,yplot = sklearn.utils.shuffle(Xplot,yplot)

In [17]:
oneindexes = where(yplot>0.5)
zeroindexes = where(yplot<0.5)
rowindexes = concatenate([oneindexes[0][:10],zeroindexes[0][:6000]])

In [18]:
maxes = amax(abs(Xplot),axis=0) + 1e-14

In proportion


In [19]:
for rowi in rowindexes:
    #normalise row values
    row = Xplot[rowi,:]/(maxes)
    #then just plot it
    if yplot[rowi] > 0.5:
        plot(row,color='green',alpha=0.5)
    else:
        plot(row,color='red',alpha=0.05)


Out of proportion


In [20]:
rowindexes = concatenate([oneindexes[0][:100],zeroindexes[0][:100]])

In [21]:
for rowi in rowindexes:
    #normalise row values
    row = Xplot[rowi,:]/(maxes)
    #then just plot it
    if yplot[rowi] > 0.5:
        plot(row,color='green',alpha=0.1)
    else:
        plot(row,color='red',alpha=0.1)


In the above graphs green corresponds to positive training examples and red corresponds to negative training examples.

Unfortunately, due to the extreme dimensionality this cannot be interpreted in as much detail as a parallel co-ordinate plot could normally be interpreted. As the spaces between features are very small it is difficult to discern correlations between adjacent features. However, we can see that some features are more useful in discriminating between positive and negative classes than others.

Andrews Curves

Andrews curve plots are also implemented in pandas.


In [22]:
import pandas as pd
from pandas.tools.plotting import andrews_curves

In proportion


In [23]:
rowindexes = concatenate([oneindexes[0][:10],zeroindexes[0][:6000]])
plotdata = pd.DataFrame(Xplot[rowindexes,:]/maxes)
plotdata['training labels'] = yplot[rowindexes]

In [24]:
andrews_curves(plotdata,'training labels')


Out[24]:
<matplotlib.axes.AxesSubplot at 0x4bcf04050>

Out of proportion


In [25]:
rowindexes = concatenate([oneindexes[0][:100],zeroindexes[0][:100]])
plotdata = pd.DataFrame(Xplot[rowindexes,:]/maxes)
plotdata['training labels'] = yplot[rowindexes]

In [26]:
andrews_curves(plotdata,'training labels')


Out[26]:
<matplotlib.axes.AxesSubplot at 0x4c1a0bcd0>

Unfortunately, interpreting this result is a bit more difficult.

Principle Components Analysis (PCA)

Before moving onto more complicated ways to reduce the dimensionality of our data we should first use a simple technique. Luckily, PCA is implemented in Scikit-learn, so all we need to do is import it and apply it to our dataset:


In [27]:
import sklearn.decomposition

In proportion


In [28]:
nones,nzeros = 1000, 600000
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])

In [29]:
X_pca = sklearn.decomposition.PCA().fit_transform(Xplot[rowindexes,:]/maxes)

In [30]:
ones = scatter(X_pca[:nones,0],X_pca[:nones,1],c='red',alpha=0.1)
zeros = scatter(X_pca[nones:,0],X_pca[nones:,1],c='blue',marker="x",alpha=0.005)
l=legend((ones,zeros),("interactions","non-interactions"),loc=0)



In [31]:
X_pcaip = X_pca[:]

Out of proportion


In [32]:
nones,nzeros = 1000, 1000
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])

In [33]:
X_pca = sklearn.decomposition.PCA().fit_transform(Xplot[rowindexes,:]/maxes)

In [34]:
ones = scatter(X_pca[:nones,0],X_pca[:nones,1],c='red',alpha=0.05)
zeros = scatter(X_pca[nones:,0],X_pca[nones:,1],c='blue',marker="x",alpha=0.05)
l=legend((ones,zeros),("interactions","non-interactions"),loc=0)



In [35]:
X_pcaoop = X_pca[:]

t-Distributed Stochastic Neighbor Embedding (t-SNE)

This method does not plot the data directly, but reduces the dimensionality of the data until it can be plotted in 2d or 3d. In our case we would like to produce a 2d scatterplot. Scikit-learn has integrated the Python implementation. An example of using this in an ipython notebook can be found here.

For sparse data the documentation recommends first reducing the number of dimensions to approximately 50 using TruncatedSVD.


In [36]:
import sklearn.manifold

In proportion


In [37]:
nones,nzeros = 10, 6000
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])

In [38]:
X_tsvd = sklearn.decomposition.TruncatedSVD(n_components=50).fit_transform(Xplot[rowindexes,:]/maxes)

In [39]:
X_tsne = sklearn.manifold.TSNE(learning_rate=100).fit_transform(X_tsvd)

In [40]:
ones = scatter(X_tsne[:nones,0],X_tsne[:nones,1],c='red',alpha=0.6)
zeros = scatter(X_tsne[nones:,0],X_tsne[nones:,1],c='blue',marker="x",alpha=0.1)
l=legend((ones,zeros),("interactions","non-interactions"),loc=3)



In [41]:
X_tsneip = X_tsne[:]

Out of proportion


In [42]:
nones,nzeros = 100, 100
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])

In [43]:
X_tsvd = sklearn.decomposition.TruncatedSVD(n_components=50).fit_transform(Xplot[rowindexes,:]/maxes)

In [44]:
X_tsne = sklearn.manifold.TSNE(learning_rate=100).fit_transform(X_tsvd)

In [45]:
ones = scatter(X_tsne[:nones,0],X_tsne[:nones,1],c='red')
zeros = scatter(X_tsne[nones:,0],X_tsne[nones:,1],c='blue',marker="x")
l=legend((ones,zeros),("interactions","non-interactions"),loc=4)



In [46]:
X_tsneoop = X_tsne[:]

Saving the results

As the current support for directly pickling these plots is not 100% we are simply going to save the data used to plot the above graphs. With reference to the above code it will be easy to recreate the graphs with annotations, titles and generally in a better form for publication.

High-dimensional plots


In [47]:
#save in proportion arrays
rowindexes = concatenate([oneindexes[0][:10],zeroindexes[0][:6000]])
savez("../plots/hippie/parrallel.coordinates.plot.ip.npz",Xplot[rowindexes,:],y[rowindexes])
#save out of proportion arrays
rowindexes = concatenate([oneindexes[0][:100],zeroindexes[0][:100]])
savez("../plots/hippie/parrallel.coordinates.plot.oop.npz",Xplot[rowindexes,:],y[rowindexes])

Low-dimensional plots


In [48]:
savez("../plots/hippie/pca.ip.npz",X_pcaip)
savez("../plots/hippie/pca.oop.npz",X_pcaoop)
savez("../plots/hippie/tdsne.ip.npz",X_tsneip)
savez("../plots/hippie/tdsne.oop.npz",X_tsneoop)