Download gene expression and clinical data from the UCSC Xena Toil re-compute dataset and the Treehouse Childhood Cancer Initiative, wrangle, and store in an hdf5 file for quick loading machine learning. This dataset comprises gene expression data for twenty thousand tumor and normal samples processed using the exact same genomics pipeline and therefore can be compared to each other. Treehouse contains many of the same samples from TCGA and TARGET as Toil which we can use to verify our conversion. It also includes unique samples (all prefixed with TH or TR) which we can use as a hold-out set.
Each of the source data set consists of a float vector, log2(TPM+0.001) in the case of TCGA+TARGET+GTEX or log2(TPM+1.0) in the case of Treehouse normalized, of gene expression for each of ~60k genes. Toil expression is labeled using Ensembl gene ids vs. Treehouse which uses Hugo. Associated with these data is clinical information on each sample such as type (tumor vs. normal), disease, primary site (where the sample came from in the human body) etc... We use this information to label the samples normal/0 vs. tumor/1 as well as to provide additional information for visualization and interpretation of models.
In [1]:
import os
import requests
import numpy as np
import pandas as pd
import h5py
if not os.path.exists("data"):
os.makedirs("data")
Download expression data files from Xena and save in an hdf5 file. This can take around 30 minutes each between the download and the conversion from tsv into float32 dataframes. We download manually vs. passing read_csv a url directly as the latter times out with this size file.
In [2]:
%%time
if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.gz"):
print("Downloading TCGA, TARGET and GTEX expression data from UCSC Xena")
r = requests.get(
"https://toil.xenahubs.net/download/TcgaTargetGtex_rsem_gene_tpm.gz",
stream=True)
response.raise_for_status()
with open("data/TcgaTargetGtex_rsem_gene_tpm.gz", "wb") as f:
for chunk in r.iter_content(chunk_size=32768):
f.write(chunk)
if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.h5"):
print("Converting expression to dataframe and storing in hdf5 file")
pd.read_csv("data/TcgaTargetGtex_rsem_gene_tpm.gz", sep="\t", index_col=0) \
.astype(np.float32).to_hdf("data/TcgaTargetGtex_rsem_gene_tpm.h5",
"expression", mode="w", format="fixed")
tcga_target_gtex_expression = pd.read_hdf(
"data/TcgaTargetGtex_rsem_gene_tpm.h5", "expression").dropna(
axis="index").sort_index(axis="columns")
print("tcga_target_gtex_expression: samples={} genes={}".format(
*tcga_target_gtex_expression.shape))
In [3]:
tcga_target_gtex_expression.head()
Out[3]:
In [4]:
# Verify sum of all expression for a sample in TPM space sums to ~1 million
tcga_target_gtex_expression[["GTEX-146FH-1726-SM-5QGQ2",
"GTEX-WZTO-2926-SM-3NM9I",
"TCGA-ZS-A9CE-01", "TCGA-AB-2965-03"]].apply(
np.exp2).apply(lambda x: x - 0.001).sum()
Out[4]:
Toil's expression values are per Ensembl gene id, which have a one or more to one relationship to Hugo gene names so we need to convert back into TPM, average (or add?), and then convert back to log2(tpm+0.001). We're using an assembled table from John Vivian @ UCSC here. Another option would be ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
In [5]:
if not os.path.exists("data/ensembl_to_hugo.tsv"):
with open("data/ensembl_to_hugo.tsv", "wb") as f:
f.write(requests.get("https://github.com/jvivian/docker_tools/blob/master/gencode_hugo_mapping/attrs.tsv?raw=true").content)
ensemble_to_hugo = pd.read_table(
"data/ensembl_to_hugo.tsv",index_col=0).sort_index(axis="index")
# Remove duplicates
ensemble_to_hugo = ensemble_to_hugo[~ensemble_to_hugo.index.duplicated(keep='first')]
ensemble_to_hugo.head()
Out[5]:
In [6]:
# Create a new data frame replacing the ensembl based index with hugo dropping any where there is no conversion
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression.copy()
tcga_target_gtex_expression_hugo.index = ensemble_to_hugo.reindex(tcga_target_gtex_expression.index).geneName.values
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression_hugo[tcga_target_gtex_expression_hugo.index.notnull()]
tcga_target_gtex_expression_hugo.head()
Out[6]:
In [7]:
%%time
# Multiple Ensemble genes map to the same Hugo name. Each of these values has been normalized via log2(TPM+0.001)
# so we convert back into TPM, sum, and re-normalize.
tcga_target_gtex_expression_hugo_tpm = tcga_target_gtex_expression_hugo \
.apply(np.exp2).subtract(0.001).groupby(level=0).aggregate(np.sum).add(0.001).apply(np.log2)
In [8]:
tcga_target_gtex_expression_hugo_tpm.head()
Out[8]:
In [9]:
# Verify that the sum of all expression levels for a sample in TPM space still sum to ~1 million
tcga_target_gtex_expression_hugo[
[
"GTEX-146FH-1726-SM-5QGQ2", "GTEX-WZTO-2926-SM-3NM9I", "TCGA-AB-2965-03"
]].apply(np.exp2).apply(lambda x: x - 0.001).sum()
Out[9]:
In [10]:
%%time
if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz"):
print("Downloading Treehouse Public Compendium")
r = requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz",
stream=True)
r.raise_for_status()
with open("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", "wb") as f:
for chunk in r.iter_content(chunk_size=32768):
f.write(chunk)
if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5"):
print("Converting expression to dataframe and storing in hdf5 file")
pd.read_csv("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", sep="\t", index_col=0) \
.astype(np.float32).to_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5",
"expression", mode="w", format="fixed")
treehouse_expression = pd.read_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5", "expression")
print("treehouse_expression: samples={} genes={}".format(*treehouse_expression.shape))
In [11]:
# Check that we don't have any null/nan at this point
assert not tcga_target_gtex_expression_hugo_tpm.isnull().values.any()
assert not treehouse_expression.isnull().values.any()
# Make sure they have identical hugo gene indexes
assert np.array_equal(tcga_target_gtex_expression_hugo_tpm.index, treehouse_expression.index)
In [12]:
# Convert into log2(tpm+0.001)
treehouse_expression_hugo_tpm = treehouse_expression.apply(np.exp2).subtract(1.0).add(0.001).apply(np.log2)
The current public Treehouse compendium was created by combining expression values that map to the same Hugo gene identify by calculating the mean of their log2(tpm+1) values. As a result those values will not match perfectly with the same samples in the TCGA+TARGET+GTEX dataset. The next public compendium from Treehouse will calculate mean in TPM space. Continue on here but later we need to come back and update this - or calculate things the right way for the TH and TR samples from the raw data.
In [13]:
# Check to verify the TCGA+TARGET samples in the Treehouse compendium
# match TPM wise with our conversions above
sample_id = "TCGA-ZQ-A9CR-01"
np.allclose(tcga_target_gtex_expression_hugo_tpm[sample_id],
treehouse_expression_hugo_tpm[sample_id], 1, 1)
argmax = (tcga_target_gtex_expression_hugo_tpm[sample_id]
- treehouse_expression_hugo_tpm[sample_id]).values.argmax()
gene = tcga_target_gtex_expression_hugo_tpm.index[argmax]
print("Gene with maximum delta:", gene,
tcga_target_gtex_expression_hugo_tpm[sample_id][gene]
- treehouse_expression_hugo_tpm[sample_id][gene])
(tcga_target_gtex_expression_hugo_tpm[sample_id]
- treehouse_expression_hugo_tpm[sample_id]).describe()
Out[13]:
In [14]:
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/TcgaTargetGTEX_phenotype.txt.gz"):
with open("data/TcgaTargetGTEX_phenotype.txt.gz", "wb") as f:
f.write(requests.get("https://toil.xenahubs.net/download/TcgaTargetGTEX_phenotype.txt.gz").content)
tcga_target_gtex_labels = pd.read_table(
"data/TcgaTargetGTEX_phenotype.txt.gz", compression="gzip",
header=0, names=["id", "category", "disease", "primary_site", "sample_type", "gender", "study"],
sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")
# Compute and add a tumor/normal column - TCGA and TARGET have some normal samples, GTEX is all normal.
tcga_target_gtex_labels["tumor_normal"] = tcga_target_gtex_labels.apply(
lambda row: "Normal" if row["sample_type"] in ["Cell Line", "Normal Tissue", "Solid Tissue Normal"]
else "Tumor", axis=1)
tcga_target_gtex_labels.head()
Out[14]:
In [15]:
# Remove rows where we have no labels or the primary_site label is null
intersection = tcga_target_gtex_expression_hugo_tpm.columns.intersection(
tcga_target_gtex_labels[pd.notnull(tcga_target_gtex_labels["primary_site"])].index)
tcga_target_gtex_expression_common = tcga_target_gtex_expression_hugo_tpm.loc[:, tcga_target_gtex_expression_hugo_tpm.columns.isin(intersection)]
tcga_target_gtex_labels_common = tcga_target_gtex_labels[tcga_target_gtex_labels.index.isin(intersection)]
# Make sure the label and example samples are in the same order
assert(tcga_target_gtex_expression_common.columns.equals(tcga_target_gtex_labels_common.index))
In [16]:
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz"):
with open("data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz", "wb") as f:
f.write(requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz").content)
treehouse_labels = pd.read_table(
"data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz", compression="gzip",
header=0, names=["id", "age_in_years", "gender", "disease"],
sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis=0)
# Title case to match TARGET+TCGA+GTEX
treehouse_labels["disease"] = treehouse_labels["disease"].str.title()
treehouse_labels["gender"] = treehouse_labels["gender"].str.title()
treehouse_labels.head()
Out[16]:
In [17]:
# Treehouse samples are prefixed with TH (prospective patient) or THR (pediatric research study)
# so filter for only these excluding the TCGA and TARGET that are already in our other dataset
treehouse_labels_pruned = treehouse_labels.filter(regex='\ATH|\ATHR', axis="index")
In [18]:
# See how many disease labels overlap
set(tcga_target_gtex_labels.disease).intersection(treehouse_labels_pruned.disease)
Out[18]:
In [19]:
# Subset treeshouse expression to those not in TCGA+TARGET+GTEX
treehouse_expression_hugo_tpm_pruned = \
treehouse_expression_hugo_tpm.loc[:, treehouse_labels_pruned.index]
print(treehouse_expression_hugo_tpm_pruned.shape)
assert(tcga_target_gtex_expression_common.index.equals(
treehouse_expression_hugo_tpm_pruned.index))
In [20]:
# Export h5 format files
with pd.HDFStore("data/tcga_target_gtex.h5", "w") as store:
store["expression"] = tcga_target_gtex_expression_common.T.sort_index(axis="columns")
store["labels"] = tcga_target_gtex_labels_common.astype(str)
with pd.HDFStore("data/treehouse.h5", "w") as store:
store["expression"] = treehouse_expression_hugo_tpm_pruned.T.sort_index(axis="columns")
store["labels"] = treehouse_labels_pruned.astype(str)