Author: Florian Wagner
Email: florian.wagner@duke.edu
This notebook demonstrates the application of GO-PCA to the DMAP dataset by Novershtern et al..
In [1]:
%%capture output
# get information about GO-PCA package versions
!pip show genometools
!pip show goparser
!pip show xlmhg
!pip show gopca
In [2]:
lines = output.stdout.split('\r\n')
first = 1
for i in range(len(lines)):
if lines[i] == '---':
if not first: print '---'
else: first = 0
print '\n'.join(lines[(i+2):(i+5)])
In [3]:
# location of data files
data_file = 'DMap_data.gct' # the location of the raw data
genome_annotation_file = 'Homo_sapiens.GRCh38.79.gtf.gz' # Ensembl annotations of the human genome
gene2accession_file = 'gene2accession_2015-05-26_human.tsv.gz'
ontology_file = 'go-basic_2015-05-25.obo'
association_file = 'gene_association.goa_human_2015-05-26.gz'
# location of output files
gene_file = 'protein_coding_genes_human.tsv'
expression_file = 'dmap_expression.tsv'
go_annotation_file = 'go_annotations_human.tsv'
gopca_file = 'dmap_gopca.pickle'
signature_matrix_image_file = 'dmap_signatures_matrix.png'
signature_file = 'dmap_signatures.tsv'
signature_excel_file = 'dmap_signatures.xlsx'
The notebook will attempt to automatically download the data to the locations specified in the configuration section (see above) using curl. If you don't have curl installed, or you're not working on Linux, you can download the data yourself using the direct download links provided.
We're first downloading the (processed) DMAP data from the Differentiation Map Portal (direct download).
In [4]:
!curl -o "$data_file" \
"http://www.broadinstitute.org/dmap/downloadFile/DefaultSystemRoot/exp_1/ds_1/DMap_data.gct?downloadff=true&fileId=38"
Next, we're downloading the Ensembl human genome annotations (direct download).
In [5]:
!curl -o "$genome_annotation_file" \
"ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/Homo_sapiens.GRCh38.79.gtf.gz"
We also need data from the NCBI's Gene database (gene2accession.gz), containing a mapping between Entrez IDs and gene symbols. The full-sized file is over 700 MB in size, contains data for multiple species, and is updated daily. The NCBI does not appear to keep snapshots from older versions. To ensure that this analysis produces a consistent result, I am providing a version of this file that I downloaded on 5/26/2015, and filtered to only contain information for human (taxon ID 9606; direct download).
In [6]:
!curl -L -o "$gene2accession_file" \
"https://www.dropbox.com/s/ggjrvnigtrfue3x/gene2accession_human_2015-05-26.tsv.gz?dl=1"
Next, we're downloading the Gene Ontology (direct download).
In [7]:
!curl -o "$ontology_file" \
"http://viewvc.geneontology.org/viewvc/GO-SVN/ontology-releases/2015-05-25/go-basic.obo?revision=26059"
And finally, we're downloading the human Uniprot-GOA gene association file (direct download).
In [8]:
!curl -o "$association_file" "ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/HUMAN/gene_association.goa_human.145.gz"
In this step, we will extract a list of the names of all human protein-coding genes, and filter the DMAP data against this list. However, we will do this by first converting the converted Entrez IDs from the DMAP data to gene symbols, instead of relying on the gene symbols provided in the DMAP data. (This results in a greater number of protein-coding genes being identified.)
First, we extract the list of protein-coding genes.
In [9]:
!extract_protein_coding_genes.py -s human -a "$genome_annotation_file" -o "$gene_file"
Now, we're using the Entrez ID->Gene Symbol conversion table in combination with our list of protein-coding genes to filter the DMAP data for protein-coding genes.
In [10]:
import csv
import gzip
import numpy as np
from genometools import misc
from gopca import common
def read_dmap_expression(fn):
entrez = []
genes = []
expr = []
n = None
p = None
with open(fn) as fh:
reader = csv.reader(fh,dialect='excel-tab')
reader.next()
n,p = [int(f) for f in reader.next()]
samples = reader.next()[2:]
assert len(samples) == p
for l in reader:
entrez.append(l[0])
genes.append(l[1])
expr.append(l[2:])
E = np.float64(expr)
assert E.shape[0] == n
print E.shape
return entrez, genes,samples,E
def read_entrez2gene(fn):
e2g = {}
with gzip.open(fn) as fh:
reader = csv.reader(fh,dialect='excel-tab')
for l in reader:
e2g[l[1]] = l[15]
return e2g
protein_coding = misc.read_single(gene_file)
all_genes = set(protein_coding)
#print len(all_genes)
e2g = read_entrez2gene(gene2accession_file)
entrez_dmap, genes_dmap, samples_dmap, E_dmap = read_dmap_expression(data_file)
p,n = E_dmap.shape
known_entrez = 0
unknown = 0
genes = []
E = []
for i in range(p):
# try gene conversion...
# if it fails, or if we don't recognize the converted name as a protein-coding gene, skip it
g = None
try:
converted = e2g[entrez_dmap[i]]
except KeyError:
pass
else:
known_entrez += 1
if converted in all_genes:
genes.append(converted)
E.append(E_dmap[i,:])
E = np.float64(E)
# sort genes alphabetically
a = np.lexsort([genes])
genes = [genes[i] for i in a]
E = E[a,:]
print known_entrez
print E.shape
common.write_expression(expression_file,genes,samples_dmap,E)
In [11]:
min_genes = 5
max_genes = 200
evidence = ['IDA','IGI','IMP','ISO','ISS','IC','NAS','TAS']
ev_str = ' '.join(evidence)
!gopca_extract_go_annotations.py -o "$go_annotation_file" -g "$gene_file" -t "$ontology_file" -a "$association_file" \
-e $ev_str --min-genes-per-term $min_genes --max-genes-per-term $max_genes --part-of-cc-only
In [12]:
!go-pca.py -L 1000 -e "$expression_file" -t "$ontology_file" -a "$go_annotation_file" -o "$gopca_file" \
-s 123456789 --go-part-of-cc-only
In [13]:
from IPython.display import Image
dpi = 90.0
!gopca_plot_signature_matrix.py -g "$gopca_file" -o "$signature_matrix_image_file" -r $dpi -t
Image(filename=signature_matrix_image_file,width=600)
Out[13]:
In [14]:
!gopca_extract_signatures.py -g "$gopca_file" -o "$signature_file"
In [15]:
!gopca_extract_signatures_excel.py -g "$gopca_file" -o "$signature_excel_file"
Copyright (c) 2015 Florian Wagner.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.