Aiming to load in the Entrez IDs for the proteins of interest in the active zone from the lists Colin's provided and connect these with the NCBI database so I can fetch the database entry for any one of them. Seems like a fairly useful thing to do.
Note here - NCBI requires you specify an email address when accessing their services.
In [2]:
from Bio import Entrez
Entrez.email = "gavingray1729@gmail.com"
Navigate to the right file:
In [4]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS/
In [5]:
ls
Then load in the Entrez-IDs from the file:
In [17]:
import csv
f = open("baits_entrez_ids.csv")
c = csv.reader(f)
baitids = []
for l in c:
#to avoid empty rows
if l:
baitids.append(l[0])
Code below practically copied from ipython notebook tutorial.
In [20]:
efetch_handle = Entrez.efetch(db="nucleotide", id=baitids[0], rettype="gb", retmode="text")
Ok, so the next thing depends on using Biopython to parse the input file it's going to grab from the server. Because it's in the tutorial going to focus on sequence file parsing.
Basically, you just point SeqIO at the file and tell it what type of file it's looking at and it does the rest.
The code above retreives a GenBank filetype. So we can parse that:
In [21]:
from Bio import SeqIO
In [22]:
ncbi_record = SeqIO.read(efetch_handle, 'genbank')
print ncbi_record
So it works!
Now we want to use the Entrez-IDs to retreive time course gene expression data from (to start with) GEO. Luckily, there is a section on this in the tutorial and cookbook for biopython. Unluckily, it doesn't appear to retreive exactly what I want. Better check that exactly what I want is on there.
This page was given as an example. I think the .txt file at the base of that might be the experimental time series data. Could open it up and have a look at it.
First, I think I'll follow through the cookbook example:
In [32]:
# think I just need to change the database to gds
efetch_handle = Entrez.efetch(db="gds", id=baitids[0])
In [33]:
record = efetch_handle.read()
In [34]:
print record
So that doesn't work exactly. According to the cookbook the GEO files can't be accessed in this way anyway:
Unfortunately, at the time of writing the NCBI don’t seem to support downloading GEO files using Entrez (not as XML, nor in the Simple Omnibus Format in Text (SOFT) format).
The solution apparently is to track down the link yourself and grab it through ftp. Of course, we can do that with Python as well, probably.
As an example, searching the GEO database with the Entrez-ID for the first entry above: X55968.
Turns up a list of experiments involving this gene (as you might expect). Unclear how to get a time series experiment data from this, plus I thought the time series data we were looking for was of a whole load of genes at the same time.
In [40]:
print baitids