The pulldown experiments let us identify proteins that are localised to the presynaptic active zone which we are interested in studying. Other than just identifying the proteins it also results in a couple of measures which can be useful for predicting whether proteins interact:
This notebook looks at the second of these two:
The affinity measure which has already been extracted from this is based on this paper by Xie et al. They call their co-complexed score a C2S score and has a probabilistic basis, which is ideal for our use. All we need to do is extract a value for each protein pair we're looking at and see what kind of coverage these values give us.
Due to the convenient structure of the file we are using we can extract the feature with the functionality of ocbio.extract
.
Each protein pair is in a row in Entrez Gene ID format with the corresponding C2S value.
In [2]:
cd ../../
Unfortunately, when running the ocbio.extract code naively we
To improve the coverage we will first use the zeromissinginternal
option in the FeatureVectorAssembler.
This will replace any interactions between proteins which the database knows about but doesn't have a value stored with zeros.
Secondly, we will interpolate the expected value for this feature based on the values of other features with better coverage. To do this, first we will generate a set of feature vectors over the bait and prey combinations. Using this, we will train a linear regression algorithm.
Then, we can pickle the trained linear regressor and store the table to create the feature vectors it must use. These can then be passed as options for the parser in the code to generate the feature vectors for the classification task.
Loading the features which were used for classification in the pipeline prototype notebook:
In [2]:
import sys,csv
In [4]:
sys.path.append("opencast-bio/")
In [5]:
import ocbio.extract
In [6]:
reload(ocbio.extract)
Out[6]:
In [7]:
!git annex unlock datasource.proto.tab
In [8]:
f = open("datasource.proto.tab","w")
c = csv.writer(f,delimiter="\t")
# the HIPPIE feature
c.writerow(["HIPPIE/hippie_current.txt","HIPPIE/feature.HIPPIE.2.db","protindexes=(1,3);valindexes=(4);zeromissing=1"])
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# iRefIndex features
c.writerow(["iRefIndex","iRefIndex","generator=iRefIndex/human.iRefIndex.Entrez.1ofk.pickle"])
# STRING features
c.writerow(["STRING","STRING","generator=string/human.Entrez.string.pickle"])
# ENTS summary feature
c.writerow(['ENTS_summary','ENTS_summary','generator=ents/human.Entrez.ENTS.summary.pickle'])
f.close()
In [16]:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.proto.tab", verbose=True)
In [12]:
assembler.regenerate(verbose=True)
In [17]:
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
"features/pulldown.interactions.interpolate.vectors.txt", verbose=True)
In [18]:
f = open("datasource.affinity.tab","w")
c = csv.writer(f,delimiter="\t")
# just the affinity feature
c.writerow(["affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv",
"affinityresults/results2/affinity.Entrez.db","zeromissinginternal=1"])
f.close()
In [7]:
!git annex unlock affinityresults/results2/affinity.Entrez.db
In [8]:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.affinity.tab", verbose=True)
In [9]:
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
"features/pulldown.interactions.interpolate.affinity.targets.txt", verbose=True)
Scikit-learn has an implementation of linear regression we can train to interpolate missing values. Specifically, in this case we will be using the ridge regression model to avoid overfitting on this large dataset. We will also be using 10-fold cross-validation and looking at the mean squared error over the test set to estimate the effectiveness of the model.
In [3]:
import sklearn.linear_model
In [4]:
#load vectors into memory:
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")
In [3]:
#loading y targets
y = loadtxt("features/pulldown.interactions.interpolate.affinity.targets.txt",dtype=str)
In [4]:
missinglabels = y == "missing"
In [5]:
#zero missing values
y[missinglabels] = 0.0
In [6]:
#convert to float
y = y.astype(np.float)
In [9]:
import sklearn.utils
In [10]:
X,y = sklearn.utils.shuffle(X,y)
In [11]:
import sklearn.cross_validation
In [12]:
kf = sklearn.cross_validation.KFold(y.shape[0],10)
In [13]:
for train,test in kf:
print train,test
In [29]:
scores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
linreg = sklearn.linear_model.LinearRegression()
linreg.fit(X_train,y_train)
#test the classifier
scores.append(linreg.score(X_test,y_test))
In [30]:
print scores
In [31]:
mean(scores)
Out[31]:
According to the documentation:
The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares
((y_true - y_pred) ** 2).sum()
and v is the residual sum of squares((y_true - y_true.mean()) ** 2).sum()
. Best possible score is 1.0, lower values are worse.
So, this isn't a very good regression algorithm. Trying an ensemble method, such as an AdaBoost regressor.
In [15]:
import sklearn.ensemble
In [17]:
adascores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
ada = sklearn.ensemble.AdaBoostRegressor()
ada.fit(X_train,y_train)
#test the classifier
adascores.append(ada.score(X_test,y_test))
break
Had to break the above loop after one iteration as it was taking too long to run.
In [18]:
adascores
Out[18]:
So the regression sum of squares much be three times larger than the residual sum of squares. Obviously, that's worse than the above.
In [24]:
Xycov = cov(X.T,y.T)
In [28]:
print Xycov.shape
In [38]:
print Xycov[:,-1][Xycov[:,-1]>1]
So several of the features are strongly correlated with this feature. Worth checking that this actually was the case.
Trying a ridge regression classifier, with different values for alpha:
In [123]:
ridgescores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
ridge = sklearn.linear_model.Ridge(alpha=1.0)
ridge.fit(X_train,y_train)
#test the classifier
ridgescores.append(ridge.score(X_test,y_test))
break
In [47]:
ridgescores
Out[47]:
Still not better than simple linear regression.
In [49]:
plot(X[:,-2][:100],y[:100])
Out[49]:
In [61]:
import time
In [125]:
N = 100
In [191]:
XN = X[:,:][N-100:N]
plot(XN/amax(XN,axis=0))
plot(y[N-100:N]/max(abs(y[N-100:N])), label="target")
ypred = ridge.predict(X[N-100:N,:])
plot(ypred/max(abs(ypred)))
legend()
N += 100
In [159]:
amax(XN,axis=0)
Out[159]:
In [156]:
len(amax(X,axis=0))
Out[156]:
So it looks like there is not much point to this approach. Would be as useful to simply fill with an average value.
In [7]:
print "Average value of pulldown affinity score over pulldown data: {0}".format(mean(y))
In [8]:
import pickle
In [9]:
f = open("affinityresults/results2/affinity.pulldown.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()