In [10]:
%autosave 0
from __future__ import print_function
In [11]:
import glob
import barnaba as bb
import numpy as np
flist = glob.glob("snippet/*.pdb")
if(len(flist)==0):
print("# You need to run the example example8_snippet.ipynb")
exit()
# calculate G-VECTORS for all files
gvecs = []
for f in flist:
gvec,seq = bb.dump_gvec(f)
assert len(seq)==4
gvecs.extend(gvec)
Then, we reshape the array so that has the dimension $(N,n \ast 4\ast 4)$, where N is the number of frames and n is the number of nucleotides
In [12]:
gvecs = np.array(gvecs)
gvecs = gvecs.reshape(149,-1)
print(gvecs.shape)
C. We project the data using a simple principal component analysis on the g-vectors
In [13]:
import barnaba.cluster as cc
# calculate PCA
v,w = cc.pca(gvecs,nevecs=3)
print("# Cumulative explained variance of component: 1=%5.1f 2:=%5.1f 3=%5.1f" % (v[0]*100,v[1]*100,v[2]*100))
In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
plt.scatter(w[:,0],w[:,1])
plt.xlabel("PC1")
plt.ylabel("PC2")
Out[14]:
D. We make use of DBSCAN in sklearn to perform clustering. The function cc.dbscan takes four arguments:
i. the list of G-vectors gvec
ii. the list of labels for each point
iii. the eps value
iv. min_samples
v. (optional) the weight of the samples for non-uniform clustering
The function outputs some information on the clustering: the number of clusters, the number of samples assigned to clusters (non noise), and silouetthe.
For each cluster it reports the size, the maximum eRMSD distance between samples in a cluster (IC=intra-cluster), the median intra-cluster eRMSD, the maximum and median distance from the centroid.
In [15]:
new_labels, center_idx = cc.dbscan(gvecs,range(gvecs.shape[0]),eps=0.35,min_samples=8)
We can now color the PCA according to the different cluster and display the centroid as a label:
In [16]:
cp = sns.color_palette("hls",len(center_idx)+1)
colors = [cp[j-1] if(j!=0) else (0.77,0.77,0.77) for j in new_labels]
size = [40 if(j!=0) else 10 for j in new_labels]
#do scatterplot
plt.scatter(w[:,0],w[:,1],s=size,c=colors)
for i,k in enumerate(center_idx):
plt.text(w[k,0],w[k,1],str(i),ha='center',va='center',fontsize=25)
E. We finally visualise the 4 centroids:
In [17]:
import py3Dmol
cluster_0 = open(flist[center_idx[0]],'r').read()
cluster_1 = open(flist[center_idx[1]],'r').read()
cluster_2 = open(flist[center_idx[2]],'r').read()
cluster_3 = open(flist[center_idx[3]],'r').read()
p = py3Dmol.view(width=900,height=600,viewergrid=(2,2))
#p = py3Dmol.view(width=900,height=600)
#p.addModel(query_s,'pdb')
p.addModel(cluster_0,'pdb',viewer=(0,0))
p.addModel(cluster_1,'pdb',viewer=(0,1))
p.addModel(cluster_2,'pdb',viewer=(1,0))
p.addModel(cluster_3,'pdb',viewer=(1,1))
#p.addModel(hit_0,'pdb',viewer=(0,1))
p.setStyle({'stick':{}})
p.setBackgroundColor('0xeeeeee')
p.zoomTo()
p.show()
Out[17]:
It is interesting to observe that cluster 0 and 1 (in close proximity in the PCA projection), correspond to A-form-like structures. Cluster 2 corresponds to the classic GNRA fold.