The HMR (Human-Mouse-Rat) database is a combination of BioGrid, CCSB, HPRD, Intact and MDC databases developed at the University of Edinburgh. If you are on the Informatics local network the web service to query this database can be found here. The web service requires a list of proteins, which it will then return interactions between.
To get coverage over as many proteins as possible we would like to simply use all human Entrez IDs which we know map to Ensembl IDs. The first step is to load this dictionary:
In [6]:
import csv, pickle
In [3]:
cd ../../geneconversion/
In [7]:
f = open("human.gene2ensemble.pickle")
gene2ensembl = pickle.load(f)
f.close()
As we are planning to use this feature on the bait and prey proteins in the end we should make sure that these proteins are in this list of proteins. Loading those Gene IDs and putting all IDs into a set:
In [20]:
cd ../forGAVIN/pulldown_data/BAITS/
In [21]:
f = open("baits_entrez_ids.csv")
baits = list(flatten(csv.reader(f)))
f.close()
In [14]:
cd ../PREYS/
In [18]:
f = open("prey_entrez_ids.csv")
preys = list(flatten(csv.reader(f)))
f.close()
In [22]:
proteinIDs = set(gene2ensembl.keys()+baits+preys)
In [23]:
cd ../../../geneconversion/
In [27]:
f=open("human.entrez.HMR.flat.txt","w")
csv.writer(f,delimiter="\n").writerow(list(proteinIDs))
f.close()
In [30]:
cd ../HMR/
In [31]:
!head webformoutput.csv
The fastest way to create a useable feature from this will be to do the same as was done in the STRING notebook and pickle an object which will return a 1 if the pair is present and a 0 otherwise. This is slightly sub-optimal in that any Entrez IDs which we have not supplied to the web form won't be represented, but the list we provided to the web form was fairly comprehensive so almost all human IDs should have been mined.
In [32]:
import sys
In [33]:
sys.path.append("../opencast-bio/")
In [34]:
import ocbio.ppipred
In [35]:
featuredict = {}
f = open("webformoutput.csv")
for line in csv.reader(f):
featuredict[frozenset(line)] = ['1']
f.close()
In [36]:
features = ocbio.ppipred.features(featuredict,1)
In [37]:
realkey = featuredict.keys()[0]
fakekey = frozenset(["1275","4124"])
In [38]:
features[realkey]
Out[38]:
In [39]:
features[fakekey]
Out[39]:
In [40]:
f = open("human.HMR.features.pickle","wb")
pickle.dump(features,f)
f.close()