Ingest Gene Expression and Clinical Data from TCGA+TARGET+GTEX and Treehouse

Download gene expression and clinical data from the UCSC Xena Toil re-compute dataset and the Treehouse Childhood Cancer Initiative, wrangle, and store in an hdf5 file for quick loading machine learning. This dataset comprises gene expression data for twenty thousand tumor and normal samples processed using the exact same genomics pipeline and therefore can be compared to each other. Treehouse contains many of the same samples from TCGA and TARGET as Toil which we can use to verify our conversion. It also includes unique samples (all prefixed with TH or TR) which we can use as a hold-out set.

Each of the source data set consists of a float vector, log2(TPM+0.001) in the case of TCGA+TARGET+GTEX or log2(TPM+1.0) in the case of Treehouse normalized, of gene expression for each of ~60k genes. Toil expression is labeled using Ensembl gene ids vs. Treehouse which uses Hugo. Associated with these data is clinical information on each sample such as type (tumor vs. normal), disease, primary site (where the sample came from in the human body) etc... We use this information to label the samples normal/0 vs. tumor/1 as well as to provide additional information for visualization and interpretation of models.


In [1]:
import os
import requests
import numpy as np
import pandas as pd
import h5py

if not os.path.exists("data"):
    os.makedirs("data")


/opt/conda/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Download TGCA+TARGET+GTEX Expression

Download expression data files from Xena and save in an hdf5 file. This can take around 30 minutes each between the download and the conversion from tsv into float32 dataframes. We download manually vs. passing read_csv a url directly as the latter times out with this size file.


In [2]:
%%time
if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.gz"):
    print("Downloading TCGA, TARGET and GTEX expression data from UCSC Xena")
    r = requests.get(
        "https://toil.xenahubs.net/download/TcgaTargetGtex_rsem_gene_tpm.gz", 
        stream=True)
    response.raise_for_status()
    with open("data/TcgaTargetGtex_rsem_gene_tpm.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=32768):
            f.write(chunk)

if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.h5"):
    print("Converting expression to dataframe and storing in hdf5 file")
    pd.read_csv("data/TcgaTargetGtex_rsem_gene_tpm.gz", sep="\t", index_col=0) \
        .astype(np.float32).to_hdf("data/TcgaTargetGtex_rsem_gene_tpm.h5", 
                                   "expression", mode="w", format="fixed")

tcga_target_gtex_expression = pd.read_hdf(
    "data/TcgaTargetGtex_rsem_gene_tpm.h5", "expression").dropna(
    axis="index").sort_index(axis="columns")
print("tcga_target_gtex_expression: samples={} genes={}".format(
    *tcga_target_gtex_expression.shape))


tcga_target_gtex_expression: samples=60498 genes=19260
CPU times: user 6.79 s, sys: 9.09 s, total: 15.9 s
Wall time: 18.2 s

In [3]:
tcga_target_gtex_expression.head()


Out[3]:
GTEX-1117F-0226-SM-5GZZ7 GTEX-1117F-0426-SM-5EGHI GTEX-1117F-0526-SM-5EGHJ GTEX-1117F-0626-SM-5N9CS GTEX-1117F-0726-SM-5GIEN GTEX-1117F-1326-SM-5EGHH GTEX-1117F-2226-SM-5N9CH GTEX-1117F-2426-SM-5EGGH GTEX-1117F-2826-SM-5GZXL GTEX-1117F-3026-SM-5GZYU ... TCGA-ZR-A9CJ-01 TCGA-ZS-A9CD-01 TCGA-ZS-A9CE-01 TCGA-ZS-A9CF-01 TCGA-ZS-A9CF-02 TCGA-ZS-A9CG-01 TCGA-ZT-A8OM-01 TCGA-ZU-A8S4-01 TCGA-ZU-A8S4-11 TCGA-ZX-AA5X-01
sample
ENSG00000242268.2 -9.9658 -9.9658 -9.9658 -1.2481 -3.8160 -1.7809 -9.9658 -9.9658 -3.6259 -9.9658 ... -9.9658 -4.6082 -9.9658 -9.9658 -4.6082 -9.9658 -3.6259 -9.9658 -9.9658 -9.9658
ENSG00000259041.1 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 ... -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658
ENSG00000270112.3 -4.2934 0.0014 -9.9658 -5.5735 0.3573 -9.9658 -6.5064 -5.0116 -9.9658 -5.0116 ... -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -6.5064
ENSG00000167578.16 5.1190 4.1277 4.4067 5.6860 4.0357 4.6849 4.5009 5.3954 4.9402 5.4683 ... 4.1780 4.5547 3.6737 4.9331 3.6254 3.7646 5.5201 5.4216 3.3647 4.7991
ENSG00000278814.1 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 ... -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658

5 rows × 19260 columns


In [4]:
# Verify sum of all expression  for a sample in TPM space sums to ~1 million
tcga_target_gtex_expression[["GTEX-146FH-1726-SM-5QGQ2", 
                             "GTEX-WZTO-2926-SM-3NM9I", 
                             "TCGA-ZS-A9CE-01", "TCGA-AB-2965-03"]].apply(
    np.exp2).apply(lambda x: x - 0.001).sum()


Out[4]:
GTEX-146FH-1726-SM-5QGQ2    1.000001e+06
GTEX-WZTO-2926-SM-3NM9I     9.999974e+05
TCGA-ZS-A9CE-01             1.000000e+06
TCGA-AB-2965-03             9.999970e+05
dtype: float64

Covert Ensembl to Hugo

Toil's expression values are per Ensembl gene id, which have a one or more to one relationship to Hugo gene names so we need to convert back into TPM, average (or add?), and then convert back to log2(tpm+0.001). We're using an assembled table from John Vivian @ UCSC here. Another option would be ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt


In [5]:
if not os.path.exists("data/ensembl_to_hugo.tsv"):
    with open("data/ensembl_to_hugo.tsv", "wb") as f:
        f.write(requests.get("https://github.com/jvivian/docker_tools/blob/master/gencode_hugo_mapping/attrs.tsv?raw=true").content)
ensemble_to_hugo = pd.read_table(
    "data/ensembl_to_hugo.tsv",index_col=0).sort_index(axis="index")

# Remove duplicates
ensemble_to_hugo = ensemble_to_hugo[~ensemble_to_hugo.index.duplicated(keep='first')]
ensemble_to_hugo.head()


Out[5]:
geneName geneType geneStatus transcriptId transcriptName transcriptType transcriptStatus havanaGeneId havanaTranscriptId ccdsId level transcriptClass
geneId
ENSG00000000003.14 TSPAN6 protein_coding KNOWN ENST00000612152.4 TSPAN6-201 protein_coding KNOWN OTTHUMG00000022002.1 NaN CCDS76001.1 3 coding
ENSG00000000005.5 TNMD protein_coding KNOWN ENST00000373031.4 TNMD-001 protein_coding KNOWN OTTHUMG00000022001.1 OTTHUMT00000057481.1 CCDS14469.1 2 coding
ENSG00000000419.12 DPM1 protein_coding KNOWN ENST00000371582.8 DPM1-005 protein_coding KNOWN OTTHUMG00000032742.2 OTTHUMT00000079720.2 NaN 2 coding
ENSG00000000457.13 SCYL3 protein_coding KNOWN ENST00000470238.1 SCYL3-004 processed_transcript KNOWN OTTHUMG00000035941.4 OTTHUMT00000087552.1 NaN 2 nonCoding
ENSG00000000460.16 C1orf112 protein_coding KNOWN ENST00000466580.6 C1orf112-008 processed_transcript KNOWN OTTHUMG00000035821.7 OTTHUMT00000087524.1 NaN 2 nonCoding

In [6]:
# Create a new data frame replacing the ensembl based index with hugo dropping any where there is no conversion
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression.copy()
tcga_target_gtex_expression_hugo.index = ensemble_to_hugo.reindex(tcga_target_gtex_expression.index).geneName.values
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression_hugo[tcga_target_gtex_expression_hugo.index.notnull()]
tcga_target_gtex_expression_hugo.head()


Out[6]:
GTEX-1117F-0226-SM-5GZZ7 GTEX-1117F-0426-SM-5EGHI GTEX-1117F-0526-SM-5EGHJ GTEX-1117F-0626-SM-5N9CS GTEX-1117F-0726-SM-5GIEN GTEX-1117F-1326-SM-5EGHH GTEX-1117F-2226-SM-5N9CH GTEX-1117F-2426-SM-5EGGH GTEX-1117F-2826-SM-5GZXL GTEX-1117F-3026-SM-5GZYU ... TCGA-ZR-A9CJ-01 TCGA-ZS-A9CD-01 TCGA-ZS-A9CE-01 TCGA-ZS-A9CF-01 TCGA-ZS-A9CF-02 TCGA-ZS-A9CG-01 TCGA-ZT-A8OM-01 TCGA-ZU-A8S4-01 TCGA-ZU-A8S4-11 TCGA-ZX-AA5X-01
RP11-368I23.2 -9.9658 -9.9658 -9.9658 -1.2481 -3.8160 -1.7809 -9.9658 -9.9658 -3.6259 -9.9658 ... -9.9658 -4.6082 -9.9658 -9.9658 -4.6082 -9.9658 -3.6259 -9.9658 -9.9658 -9.9658
RP11-167B3.1 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 ... -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658
RP11-742D12.2 -4.2934 0.0014 -9.9658 -5.5735 0.3573 -9.9658 -6.5064 -5.0116 -9.9658 -5.0116 ... -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -6.5064
RAB4B 5.1190 4.1277 4.4067 5.6860 4.0357 4.6849 4.5009 5.3954 4.9402 5.4683 ... 4.1780 4.5547 3.6737 4.9331 3.6254 3.7646 5.5201 5.4216 3.3647 4.7991
AC104071.1 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 ... -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658 -9.9658

5 rows × 19260 columns


In [7]:
%%time
# Multiple Ensemble genes map to the same Hugo name. Each of these values has been normalized via log2(TPM+0.001)
# so we convert back into TPM, sum, and re-normalize.
tcga_target_gtex_expression_hugo_tpm = tcga_target_gtex_expression_hugo \
    .apply(np.exp2).subtract(0.001).groupby(level=0).aggregate(np.sum).add(0.001).apply(np.log2)


CPU times: user 1min 11s, sys: 1min 4s, total: 2min 16s
Wall time: 2min 4s

In [8]:
tcga_target_gtex_expression_hugo_tpm.head()


Out[8]:
GTEX-1117F-0226-SM-5GZZ7 GTEX-1117F-0426-SM-5EGHI GTEX-1117F-0526-SM-5EGHJ GTEX-1117F-0626-SM-5N9CS GTEX-1117F-0726-SM-5GIEN GTEX-1117F-1326-SM-5EGHH GTEX-1117F-2226-SM-5N9CH GTEX-1117F-2426-SM-5EGGH GTEX-1117F-2826-SM-5GZXL GTEX-1117F-3026-SM-5GZYU ... TCGA-ZR-A9CJ-01 TCGA-ZS-A9CD-01 TCGA-ZS-A9CE-01 TCGA-ZS-A9CF-01 TCGA-ZS-A9CF-02 TCGA-ZS-A9CG-01 TCGA-ZT-A8OM-01 TCGA-ZU-A8S4-01 TCGA-ZU-A8S4-11 TCGA-ZX-AA5X-01
5S_rRNA -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 ... -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042 -9.966042
5_8S_rRNA -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 ... -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816 -9.965816
7SK -9.965880 -9.965880 -9.965880 -9.965880 -0.833902 -9.965880 -9.965880 0.576259 0.334600 -0.886300 ... -9.965880 -9.965880 -9.965880 -9.965880 -9.965880 -9.965880 -9.965880 -9.965880 -9.965880 -9.965880
A1BG 4.459500 1.151200 5.241100 5.475800 4.553400 4.622400 4.502800 5.613000 4.447600 4.730200 ... 3.646300 11.151800 11.893000 8.946600 7.634700 9.145400 7.642600 3.599400 10.959300 3.593500
A1BG-AS1 0.934300 -1.282800 0.848800 2.632500 1.305100 1.551400 0.864700 3.114500 2.176600 2.459700 ... 1.623400 0.556800 0.903800 -0.471900 -1.318300 -1.552200 2.508700 -0.452100 -1.430500 1.803600

5 rows × 19260 columns


In [9]:
# Verify that the sum of all expression levels for a sample in TPM space still sum to ~1 million
tcga_target_gtex_expression_hugo[
    [
        "GTEX-146FH-1726-SM-5QGQ2", "GTEX-WZTO-2926-SM-3NM9I", "TCGA-AB-2965-03"
    ]].apply(np.exp2).apply(lambda x: x - 0.001).sum()


Out[9]:
GTEX-146FH-1726-SM-5QGQ2    1.000001e+06
GTEX-WZTO-2926-SM-3NM9I     9.999974e+05
TCGA-AB-2965-03             9.999970e+05
dtype: float64

Download Treehouse Expression

The Treehouse public compendium is in Hugo log2(tpm+1). We need to download and convert into lot2(tpm+0.001) to match our TCGA+TARGET+GTEXt dataset above.


In [10]:
%%time
if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz"):
    print("Downloading Treehouse Public Compendium")
    r = requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz",
                     stream=True)
    r.raise_for_status()
    with open("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=32768):
            f.write(chunk)

if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5"):
    print("Converting expression to dataframe and storing in hdf5 file")
    pd.read_csv("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", sep="\t", index_col=0) \
    .astype(np.float32).to_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5", 
                               "expression", mode="w", format="fixed")

treehouse_expression = pd.read_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5", "expression")
print("treehouse_expression: samples={} genes={}".format(*treehouse_expression.shape))


treehouse_expression: samples=58581 genes=11078
CPU times: user 179 ms, sys: 2.08 s, total: 2.26 s
Wall time: 4.11 s

In [11]:
# Check that we don't have any null/nan at this point
assert not tcga_target_gtex_expression_hugo_tpm.isnull().values.any()
assert not treehouse_expression.isnull().values.any()

# Make sure they have identical hugo gene indexes
assert np.array_equal(tcga_target_gtex_expression_hugo_tpm.index, treehouse_expression.index)

In [12]:
# Convert into log2(tpm+0.001)
treehouse_expression_hugo_tpm = treehouse_expression.apply(np.exp2).subtract(1.0).add(0.001).apply(np.log2)

NOTE and REMINDER

The current public Treehouse compendium was created by combining expression values that map to the same Hugo gene identify by calculating the mean of their log2(tpm+1) values. As a result those values will not match perfectly with the same samples in the TCGA+TARGET+GTEX dataset. The next public compendium from Treehouse will calculate mean in TPM space. Continue on here but later we need to come back and update this - or calculate things the right way for the TH and TR samples from the raw data.


In [13]:
# Check to verify the TCGA+TARGET samples in the Treehouse compendium 
# match TPM wise with our conversions above
sample_id = "TCGA-ZQ-A9CR-01"

np.allclose(tcga_target_gtex_expression_hugo_tpm[sample_id], 
            treehouse_expression_hugo_tpm[sample_id], 1, 1)

argmax = (tcga_target_gtex_expression_hugo_tpm[sample_id] 
          - treehouse_expression_hugo_tpm[sample_id]).values.argmax()
gene = tcga_target_gtex_expression_hugo_tpm.index[argmax]
print("Gene with maximum delta:", gene,
      tcga_target_gtex_expression_hugo_tpm[sample_id][gene] 
      - treehouse_expression_hugo_tpm[sample_id][gene])

(tcga_target_gtex_expression_hugo_tpm[sample_id] 
 - treehouse_expression_hugo_tpm[sample_id]).describe()


Gene with maximum delta: Metazoa_SRP 13.833809
Out[13]:
count    58581.000000
mean         0.004649
std          0.143267
min         -0.013058
25%         -0.000108
50%         -0.000016
75%         -0.000016
max         13.833809
Name: TCGA-ZQ-A9CR-01, dtype: float64

Download Labels and Conform to Expression Indexes


In [14]:
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/TcgaTargetGTEX_phenotype.txt.gz"):
    with open("data/TcgaTargetGTEX_phenotype.txt.gz", "wb") as f:
        f.write(requests.get("https://toil.xenahubs.net/download/TcgaTargetGTEX_phenotype.txt.gz").content)

tcga_target_gtex_labels = pd.read_table(
    "data/TcgaTargetGTEX_phenotype.txt.gz", compression="gzip", 
    header=0, names=["id", "category", "disease", "primary_site", "sample_type", "gender", "study"],
    sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")

# Compute and add a tumor/normal column - TCGA and TARGET have some normal samples, GTEX is all normal.
tcga_target_gtex_labels["tumor_normal"] = tcga_target_gtex_labels.apply(
    lambda row: "Normal" if row["sample_type"] in ["Cell Line", "Normal Tissue", "Solid Tissue Normal"]
    else "Tumor", axis=1)

tcga_target_gtex_labels.head()


Out[14]:
category disease primary_site sample_type gender study tumor_normal
id
GTEX-1117F-0226-SM-5GZZ7 Adipose - Subcutaneous Adipose - Subcutaneous Adipose Tissue Normal Tissue Female GTEX Normal
GTEX-1117F-0426-SM-5EGHI Muscle - Skeletal Muscle - Skeletal Muscle Normal Tissue Female GTEX Normal
GTEX-1117F-0526-SM-5EGHJ Artery - Tibial Artery - Tibial Blood Vessel Normal Tissue Female GTEX Normal
GTEX-1117F-0626-SM-5N9CS Artery - Coronary Artery - Coronary Blood Vessel Normal Tissue Female GTEX Normal
GTEX-1117F-0726-SM-5GIEN Heart - Atrial Appendage Heart - Atrial Appendage Heart Normal Tissue Female GTEX Normal

In [15]:
# Remove rows where we have no labels or the primary_site label is null
intersection = tcga_target_gtex_expression_hugo_tpm.columns.intersection(
    tcga_target_gtex_labels[pd.notnull(tcga_target_gtex_labels["primary_site"])].index)
tcga_target_gtex_expression_common = tcga_target_gtex_expression_hugo_tpm.loc[:, tcga_target_gtex_expression_hugo_tpm.columns.isin(intersection)]
tcga_target_gtex_labels_common = tcga_target_gtex_labels[tcga_target_gtex_labels.index.isin(intersection)]

# Make sure the label and example samples are in the same order
assert(tcga_target_gtex_expression_common.columns.equals(tcga_target_gtex_labels_common.index))

In [16]:
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz"):
    with open("data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz", "wb") as f:
        f.write(requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz").content)

treehouse_labels = pd.read_table(
    "data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz", compression="gzip", 
    header=0, names=["id", "age_in_years", "gender", "disease"],
    sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis=0)

# Title case to match TARGET+TCGA+GTEX
treehouse_labels["disease"] = treehouse_labels["disease"].str.title()
treehouse_labels["gender"] = treehouse_labels["gender"].str.title()

treehouse_labels.head()


Out[16]:
age_in_years gender disease
id
TARGET-10-PAKSWW-03 15.11 Male Acute Lymphoblastic Leukemia
TARGET-10-PAMXHJ-09 6.08 Male Acute Lymphoblastic Leukemia
TARGET-10-PAMXSP-09 3.35 Male Acute Lymphoblastic Leukemia
TARGET-10-PANCVR-03 6.4 Male Acute Lymphoblastic Leukemia
TARGET-10-PANCVR-04 6.4 Male Acute Lymphoblastic Leukemia

In [17]:
# Treehouse samples are prefixed with TH (prospective patient) or THR (pediatric research study)
# so filter for only these excluding the TCGA and TARGET that are already in our other dataset
treehouse_labels_pruned = treehouse_labels.filter(regex='\ATH|\ATHR', axis="index")

In [18]:
# See how many disease labels overlap
set(tcga_target_gtex_labels.disease).intersection(treehouse_labels_pruned.disease)


Out[18]:
{'Acute Lymphoblastic Leukemia',
 'Acute Myeloid Leukemia',
 'Glioblastoma Multiforme',
 'Lung Adenocarcinoma',
 'Neuroblastoma',
 'Sarcoma',
 'Thyroid Carcinoma',
 'Wilms Tumor'}

In [19]:
# Subset treeshouse expression to those not in TCGA+TARGET+GTEX
treehouse_expression_hugo_tpm_pruned = \
    treehouse_expression_hugo_tpm.loc[:, treehouse_labels_pruned.index]
print(treehouse_expression_hugo_tpm_pruned.shape)
assert(tcga_target_gtex_expression_common.index.equals(
    treehouse_expression_hugo_tpm_pruned.index))


(58581, 549)

Export

Write expression and labels for both datasets out to hdf5 files wrangles and in machine learning format of rows = samples


In [20]:
# Export h5 format files
with pd.HDFStore("data/tcga_target_gtex.h5", "w") as store:
    store["expression"] = tcga_target_gtex_expression_common.T.sort_index(axis="columns")
    store["labels"] = tcga_target_gtex_labels_common.astype(str)
    
with pd.HDFStore("data/treehouse.h5", "w") as store:
    store["expression"] = treehouse_expression_hugo_tpm_pruned.T.sort_index(axis="columns")
    store["labels"] = treehouse_labels_pruned.astype(str)