Aiming to load in the Entrez IDs for the proteins of interest in the active zone from the lists Colin's provided and connect these with the NCBI database so I can fetch the database entry for any one of them. Seems like a fairly useful thing to do.

Note here - NCBI requires you specify an email address when accessing their services.


In [2]:
from Bio import Entrez
Entrez.email = "gavingray1729@gmail.com"

Navigate to the right file:


In [4]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS/


/home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS

In [5]:
ls


baits.csv  baits_entrez_ids_ActiveZone.csv  baits_entrez_ids.csv  ensembl_bait_human_ids.csv

Then load in the Entrez-IDs from the file:


In [17]:
import csv
f = open("baits_entrez_ids.csv")
c = csv.reader(f)
baitids = []
for l in c:
    #to avoid empty rows
    if l:
        baitids.append(l[0])

Code below practically copied from ipython notebook tutorial.


In [20]:
efetch_handle = Entrez.efetch(db="nucleotide", id=baitids[0], rettype="gb", retmode="text")

Ok, so the next thing depends on using Biopython to parse the input file it's going to grab from the server. Because it's in the tutorial going to focus on sequence file parsing.

Basically, you just point SeqIO at the file and tell it what type of file it's looking at and it does the rest.

The code above retreives a GenBank filetype. So we can parse that:


In [21]:
from Bio import SeqIO

In [22]:
ncbi_record = SeqIO.read(efetch_handle, 'genbank')
print ncbi_record


ID: X55968.1
Name: X55968
Description: Mouse mRNA for cGMP phosphodiesterase beta-subunit.
Number of features: 3
/sequence_version=1
/source=Mus musculus (house mouse)
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Glires', 'Rodentia', 'Sciurognathi', 'Muroidea', 'Muridae', 'Murinae', 'Mus', 'Mus']
/keywords=["3',5'-cyclic-nucleotide phosphodiesterase", 'membrane protein', 'transducin']
/references=[Reference(title='Retinal degeneration in the rd mouse is caused by a defect in the beta subunit of rod cGMP-phosphodiesterase', ...), Reference(title='Direct Submission', ...)]
/accessions=['X55968']
/data_file_division=ROD
/date=21-OCT-2008
/organism=Mus musculus
/gi=53616
Seq('ATGAGCNNNAGTGGGGAACAGGTACGCAGCTTCCTGGATGGGAACCCCACGTTT...ATA', IUPACAmbiguousDNA())

So it works!


Using Entrez-IDs to retreive gene expression data

Now we want to use the Entrez-IDs to retreive time course gene expression data from (to start with) GEO. Luckily, there is a section on this in the tutorial and cookbook for biopython. Unluckily, it doesn't appear to retreive exactly what I want. Better check that exactly what I want is on there.

This page was given as an example. I think the .txt file at the base of that might be the experimental time series data. Could open it up and have a look at it.

First, I think I'll follow through the cookbook example:


In [32]:
# think I just need to change the database to gds
efetch_handle = Entrez.efetch(db="gds", id=baitids[0])

In [33]:
record  = efetch_handle.read()

In [34]:
print record


1.  Error occurred: cannot get document summary
Accession: 	ID: 53616

So that doesn't work exactly. According to the cookbook the GEO files can't be accessed in this way anyway:

Unfortunately, at the time of writing the NCBI don’t seem to support downloading GEO files using Entrez (not as XML, nor in the Simple Omnibus Format in Text (SOFT) format).

The solution apparently is to track down the link yourself and grab it through ftp. Of course, we can do that with Python as well, probably.

As an example, searching the GEO database with the Entrez-ID for the first entry above: X55968.

Turns up a list of experiments involving this gene (as you might expect). Unclear how to get a time series experiment data from this, plus I thought the time series data we were looking for was of a whole load of genes at the same time.


In [40]:
print baitids


['53616', '118', '273', '161', '320', '321', '8874', '523', '526', '8927', '8618', '8573', '57524', '8476', '26507', '8506', '26047', '10815', '10814', '1457', '1460', '1495', '1496', '1499', '1500', '23191', '23312', '80331', '1759', '1808', '56896', '29924', '58513', '28964', '348980', '6453', '3736', '3737', '8514', '8825', '64130', '57497', '78999', '145581', '4035', '23263', '4355', '8775', '63908', '54550', '4842', '9722', '4905', '9253', '5058', '27445', '8500', '8499', '8541', '5534', '5864', '10928', '22999', '22895', '6305', '23513', '56904', '246213', '6616', '9892', '6801', '6804', '6812', '134957', '9515', '9900', '6853', '6854', '8224', '8867', '6855', '6857', '7249', '6844', '7415', '143187', '10490', '8936', '10444']