The first step is to load the training and set that was saved when running the Training and test feature set generation notebook.
In [1]:
cd ../../features/
In [2]:
posdata = loadtxt("human.iRefIndex.positive.vectors.txt", delimiter="\t", dtype="str")
Training with the full negative set is impossible on this computer, it is unable to load the file into RAM. To get around this, can use a negative training set only 10 times larger than the positive training set:
In [3]:
negdata = loadtxt("human.iRefIndex.negative.vectors.txt", delimiter="\t", dtype="str")
Then we need to process out the "missing" label strings and the numerical values should be converted to floats. Also, column 101 must be deleted as it contains string class labels from the ENTS classifier of unknown origin. More information about this can be found in the previous version of this notebook.
In [4]:
X = np.concatenate((posdata,negdata))
In [5]:
plottablerows = X=="missing"
plottablerows = where(plottablerows.max(axis=1)==False)
Unfortunately, some "NA" strings have sneaked in somehow, so we will have to zero these. Appears to be due to a feature in Gene_Ontology:
In [6]:
NAinds = where(X=="NA")
In [7]:
X[NAinds] = 0.0
In [8]:
X[X=="missing"] = np.nan
X = X.astype(np.float)
Finally we can create the target vector y from what we know about the lengths of the positive and negative sets:
In [9]:
#create the target y vector
y = array([1]*len(posdata)+[0]*len(negdata))
The data has extremely high dimensionality is therefore difficult to plot. Options that have been identified are:
Alternatively, we can try to reduce the dimensionality before plotting, such as using:
This is implemented in pandas and the documentation for it can be found here. Using the example code on a sub-sample of the data.
First, though we can build a parallel coordinates plot by just normalising and plotting coloured feature vectors:
In [10]:
Xplot,yplot = X[plottablerows],y[plottablerows]
import sklearn.utils
Xplot,yplot = sklearn.utils.shuffle(Xplot,yplot)
In [11]:
oneindexes = where(yplot>0.5)
zeroindexes = where(yplot<0.5)
rowindexes = concatenate([oneindexes[0][:10],zeroindexes[0][:6000]])
In [12]:
maxes = amax(abs(Xplot),axis=0) + 1e-14
In [14]:
for rowi in rowindexes:
#normalise row values
row = Xplot[rowi,:]/(maxes)
#then just plot it
if yplot[rowi] > 0.5:
plot(row,color='green',alpha=0.5)
else:
plot(row,color='red',alpha=0.05)
In [15]:
rowindexes = concatenate([oneindexes[0][:100],zeroindexes[0][:100]])
In [16]:
for rowi in rowindexes:
#normalise row values
row = Xplot[rowi,:]/(maxes)
#then just plot it
if yplot[rowi] > 0.5:
plot(row,color='green',alpha=0.1)
else:
plot(row,color='red',alpha=0.1)
In the above graphs green corresponds to positive training examples and red corresponds to negative training examples.
Unfortunately, due to the extreme dimensionality this cannot be interpreted in as much detail as a parallel co-ordinate plot could normally be interpreted. As the spaces between features are very small it is difficult to discern correlations between adjacent features. However, we can see that some features are more useful in discriminating between positive and negative classes than others.
Andrews curve plots are also implemented in pandas.
In [17]:
import pandas as pd
from pandas.tools.plotting import andrews_curves
In [18]:
rowindexes = concatenate([oneindexes[0][:10],zeroindexes[0][:6000]])
plotdata = pd.DataFrame(Xplot[rowindexes,:]/maxes)
plotdata['training labels'] = yplot[rowindexes]
In [19]:
andrews_curves(plotdata,'training labels')
Out[19]:
In [20]:
rowindexes = concatenate([oneindexes[0][:100],zeroindexes[0][:100]])
plotdata = pd.DataFrame(Xplot[rowindexes,:]/maxes)
plotdata['training labels'] = yplot[rowindexes]
In [21]:
andrews_curves(plotdata,'training labels')
Out[21]:
Unfortunately, interpreting this result is a bit more difficult.
Before moving onto more complicated ways to reduce the dimensionality of our data we should first use a simple technique. Luckily, PCA is implemented in Scikit-learn, so all we need to do is import it and apply it to our dataset:
In [22]:
import sklearn.decomposition
In [23]:
nones,nzeros = 1000, 600000
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])
In [24]:
X_pca = sklearn.decomposition.PCA().fit_transform(Xplot[rowindexes,:]/maxes)
In [25]:
ones = scatter(X_pca[:nones,0],X_pca[:nones,1],c='red',alpha=0.1)
zeros = scatter(X_pca[nones:,0],X_pca[nones:,1],c='blue',marker="x",alpha=0.005)
l=legend((ones,zeros),("interactions","non-interactions"),loc=0)
In [26]:
X_pcaip = X_pca[:]
In [27]:
nones,nzeros = 1000, 1000
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])
In [28]:
X_pca = sklearn.decomposition.PCA().fit_transform(Xplot[rowindexes,:]/maxes)
In [29]:
ones = scatter(X_pca[:nones,0],X_pca[:nones,1],c='red',alpha=0.05)
zeros = scatter(X_pca[nones:,0],X_pca[nones:,1],c='blue',marker="x",alpha=0.05)
l=legend((ones,zeros),("interactions","non-interactions"),loc=0)
In [30]:
X_pcaoop = X_pca[:]
This method does not plot the data directly, but reduces the dimensionality of the data until it can be plotted in 2d or 3d. In our case we would like to produce a 2d scatterplot. Scikit-learn has integrated the Python implementation. An example of using this in an ipython notebook can be found here.
For sparse data the documentation recommends first reducing the number of dimensions to approximately 50 using TruncatedSVD.
In [31]:
import sklearn.manifold
In [32]:
nones,nzeros = 10, 6000
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])
In [33]:
X_tsvd = sklearn.decomposition.TruncatedSVD(n_components=50).fit_transform(Xplot[rowindexes,:]/maxes)
In [34]:
X_tsne = sklearn.manifold.TSNE(learning_rate=100).fit_transform(X_tsvd)
In [35]:
ones = scatter(X_tsne[:nones,0],X_tsne[:nones,1],c='red',alpha=0.6)
zeros = scatter(X_tsne[nones:,0],X_tsne[nones:,1],c='blue',marker="x",alpha=0.1)
l=legend((ones,zeros),("interactions","non-interactions"),loc=3)
In [36]:
X_tsneip = X_tsne[:]
In [37]:
nones,nzeros = 100, 100
rowindexes = concatenate([oneindexes[0][:nones],zeroindexes[0][:nzeros]])
In [38]:
X_tsvd = sklearn.decomposition.TruncatedSVD(n_components=50).fit_transform(Xplot[rowindexes,:]/maxes)
In [39]:
X_tsne = sklearn.manifold.TSNE(learning_rate=100).fit_transform(X_tsvd)
In [40]:
ones = scatter(X_tsne[:nones,0],X_tsne[:nones,1],c='red')
zeros = scatter(X_tsne[nones:,0],X_tsne[nones:,1],c='blue',marker="x")
l=legend((ones,zeros),("interactions","non-interactions"),loc=4)
In [41]:
X_tsneoop = X_tsne[:]
As the current support for directly pickling these plots is not 100% we are simply going to save the data used to plot the above graphs. With reference to the above code it will be easy to recreate the graphs with annotations, titles and generally in a better form for publication.
In [18]:
!git annex unlock ../plots/bayes/parrallel.coordinates.plot.ip.npz
!git annex unlock ../plots/bayes/parrallel.coordinates.plot.oop.npz
In [19]:
#save in proportion arrays
rowindexes = concatenate([oneindexes[0][:10],zeroindexes[0][:6000]])
savez("../plots/bayes/parrallel.coordinates.plot.ip.npz",Xplot[rowindexes,:],yplot[rowindexes])
#save out of proportion arrays
rowindexes = concatenate([oneindexes[0][:100],zeroindexes[0][:100]])
savez("../plots/bayes/parrallel.coordinates.plot.oop.npz",Xplot[rowindexes,:],yplot[rowindexes])
In [43]:
savez("../plots/bayes/pca.ip.npz",X_pcaip)
savez("../plots/bayes/pca.oop.npz",X_pcaoop)
savez("../plots/bayes/tdsne.ip.npz",X_tsneip)
savez("../plots/bayes/tdsne.oop.npz",X_tsneoop)