Ingest Gene Expression and Clinical Data from TCGA+TARGET+GTEX and Treehouse

Download gene expression and clinical data from the UCSC Xena Toil re-compute dataset and the Treehouse Childhood Cancer Initiative, wrangle, and store in an hdf5 file for quick loading machine learning. This dataset comprises gene expression data for twenty thousand tumor and normal samples processed using the exact same genomics pipeline and therefore can be compared to each other. Treehouse contains many of the same samples from TCGA and TARGET as Toil which we can use to verify our conversion. It also includes unique samples (all prefixed with TH or TR) which we can use as a hold-out set.

Each of the source data set consists of a float vector, log2(TPM+0.001) in the case of TCGA+TARGET+GTEX or log2(TPM+1.0) in the case of Treehouse normalized, of gene expression for each of ~60k genes. Toil expression is labeled using Ensembl gene ids vs. Treehouse which uses Hugo. Associated with these data is clinical information on each sample such as type (tumor vs. normal), disease, primary site (where the sample came from in the human body) etc... We use this information to label the samples normal/0 vs. tumor/1 as well as to provide additional information for visualization and interpretation of models.



In [1]:

    
import os
import requests
import numpy as np
import pandas as pd
import h5py

if not os.path.exists("data"):
    os.makedirs("data")









    



/opt/conda/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Download TGCA+TARGET+GTEX Expression

Download expression data files from Xena and save in an hdf5 file. This can take around 30 minutes each between the download and the conversion from tsv into float32 dataframes. We download manually vs. passing read_csv a url directly as the latter times out with this size file.



In [2]:

    
%%time
if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.gz"):
    print("Downloading TCGA, TARGET and GTEX expression data from UCSC Xena")
    r = requests.get(
        "https://toil.xenahubs.net/download/TcgaTargetGtex_rsem_gene_tpm.gz", 
        stream=True)
    response.raise_for_status()
    with open("data/TcgaTargetGtex_rsem_gene_tpm.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=32768):
            f.write(chunk)

if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.h5"):
    print("Converting expression to dataframe and storing in hdf5 file")
    pd.read_csv("data/TcgaTargetGtex_rsem_gene_tpm.gz", sep="\t", index_col=0) \
        .astype(np.float32).to_hdf("data/TcgaTargetGtex_rsem_gene_tpm.h5", 
                                   "expression", mode="w", format="fixed")

tcga_target_gtex_expression = pd.read_hdf(
    "data/TcgaTargetGtex_rsem_gene_tpm.h5", "expression").dropna(
    axis="index").sort_index(axis="columns")
print("tcga_target_gtex_expression: samples={} genes={}".format(
    *tcga_target_gtex_expression.shape))









    



tcga_target_gtex_expression: samples=60498 genes=19260
CPU times: user 6.79 s, sys: 9.09 s, total: 15.9 s
Wall time: 18.2 s



In [3]:

    
tcga_target_gtex_expression.head()









    Out[3]:







  
    
      
      GTEX-1117F-0226-SM-5GZZ7
      GTEX-1117F-0426-SM-5EGHI
      GTEX-1117F-0526-SM-5EGHJ
      GTEX-1117F-0626-SM-5N9CS
      GTEX-1117F-0726-SM-5GIEN
      GTEX-1117F-1326-SM-5EGHH
      GTEX-1117F-2226-SM-5N9CH
      GTEX-1117F-2426-SM-5EGGH
      GTEX-1117F-2826-SM-5GZXL
      GTEX-1117F-3026-SM-5GZYU
      ...
      TCGA-ZR-A9CJ-01
      TCGA-ZS-A9CD-01
      TCGA-ZS-A9CE-01
      TCGA-ZS-A9CF-01
      TCGA-ZS-A9CF-02
      TCGA-ZS-A9CG-01
      TCGA-ZT-A8OM-01
      TCGA-ZU-A8S4-01
      TCGA-ZU-A8S4-11
      TCGA-ZX-AA5X-01
    
    
      sample
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      ENSG00000242268.2
      -9.9658
      -9.9658
      -9.9658
      -1.2481
      -3.8160
      -1.7809
      -9.9658
      -9.9658
      -3.6259
      -9.9658
      ...
      -9.9658
      -4.6082
      -9.9658
      -9.9658
      -4.6082
      -9.9658
      -3.6259
      -9.9658
      -9.9658
      -9.9658
    
    
      ENSG00000259041.1
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      ...
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
    
    
      ENSG00000270112.3
      -4.2934
      0.0014
      -9.9658
      -5.5735
      0.3573
      -9.9658
      -6.5064
      -5.0116
      -9.9658
      -5.0116
      ...
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -6.5064
    
    
      ENSG00000167578.16
      5.1190
      4.1277
      4.4067
      5.6860
      4.0357
      4.6849
      4.5009
      5.3954
      4.9402
      5.4683
      ...
      4.1780
      4.5547
      3.6737
      4.9331
      3.6254
      3.7646
      5.5201
      5.4216
      3.3647
      4.7991
    
    
      ENSG00000278814.1
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      ...
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
    
  

5 rows × 19260 columns



In [4]:

    
# Verify sum of all expression  for a sample in TPM space sums to ~1 million
tcga_target_gtex_expression[["GTEX-146FH-1726-SM-5QGQ2", 
                             "GTEX-WZTO-2926-SM-3NM9I", 
                             "TCGA-ZS-A9CE-01", "TCGA-AB-2965-03"]].apply(
    np.exp2).apply(lambda x: x - 0.001).sum()









    Out[4]:





GTEX-146FH-1726-SM-5QGQ2    1.000001e+06
GTEX-WZTO-2926-SM-3NM9I     9.999974e+05
TCGA-ZS-A9CE-01             1.000000e+06
TCGA-AB-2965-03             9.999970e+05
dtype: float64

Covert Ensembl to Hugo

Toil's expression values are per Ensembl gene id, which have a one or more to one relationship to Hugo gene names so we need to convert back into TPM, average (or add?), and then convert back to log2(tpm+0.001). We're using an assembled table from John Vivian @ UCSC here. Another option would be ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt



In [5]:

    
if not os.path.exists("data/ensembl_to_hugo.tsv"):
    with open("data/ensembl_to_hugo.tsv", "wb") as f:
        f.write(requests.get("https://github.com/jvivian/docker_tools/blob/master/gencode_hugo_mapping/attrs.tsv?raw=true").content)
ensemble_to_hugo = pd.read_table(
    "data/ensembl_to_hugo.tsv",index_col=0).sort_index(axis="index")

# Remove duplicates
ensemble_to_hugo = ensemble_to_hugo[~ensemble_to_hugo.index.duplicated(keep='first')]
ensemble_to_hugo.head()









    Out[5]:







  
    
      
      geneName
      geneType
      geneStatus
      transcriptId
      transcriptName
      transcriptType
      transcriptStatus
      havanaGeneId
      havanaTranscriptId
      ccdsId
      level
      transcriptClass
    
    
      geneId
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      ENSG00000000003.14
      TSPAN6
      protein_coding
      KNOWN
      ENST00000612152.4
      TSPAN6-201
      protein_coding
      KNOWN
      OTTHUMG00000022002.1
      NaN
      CCDS76001.1
      3
      coding
    
    
      ENSG00000000005.5
      TNMD
      protein_coding
      KNOWN
      ENST00000373031.4
      TNMD-001
      protein_coding
      KNOWN
      OTTHUMG00000022001.1
      OTTHUMT00000057481.1
      CCDS14469.1
      2
      coding
    
    
      ENSG00000000419.12
      DPM1
      protein_coding
      KNOWN
      ENST00000371582.8
      DPM1-005
      protein_coding
      KNOWN
      OTTHUMG00000032742.2
      OTTHUMT00000079720.2
      NaN
      2
      coding
    
    
      ENSG00000000457.13
      SCYL3
      protein_coding
      KNOWN
      ENST00000470238.1
      SCYL3-004
      processed_transcript
      KNOWN
      OTTHUMG00000035941.4
      OTTHUMT00000087552.1
      NaN
      2
      nonCoding
    
    
      ENSG00000000460.16
      C1orf112
      protein_coding
      KNOWN
      ENST00000466580.6
      C1orf112-008
      processed_transcript
      KNOWN
      OTTHUMG00000035821.7
      OTTHUMT00000087524.1
      NaN
      2
      nonCoding



In [6]:

    
# Create a new data frame replacing the ensembl based index with hugo dropping any where there is no conversion
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression.copy()
tcga_target_gtex_expression_hugo.index = ensemble_to_hugo.reindex(tcga_target_gtex_expression.index).geneName.values
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression_hugo[tcga_target_gtex_expression_hugo.index.notnull()]
tcga_target_gtex_expression_hugo.head()









    Out[6]:







  
    
      
      GTEX-1117F-0226-SM-5GZZ7
      GTEX-1117F-0426-SM-5EGHI
      GTEX-1117F-0526-SM-5EGHJ
      GTEX-1117F-0626-SM-5N9CS
      GTEX-1117F-0726-SM-5GIEN
      GTEX-1117F-1326-SM-5EGHH
      GTEX-1117F-2226-SM-5N9CH
      GTEX-1117F-2426-SM-5EGGH
      GTEX-1117F-2826-SM-5GZXL
      GTEX-1117F-3026-SM-5GZYU
      ...
      TCGA-ZR-A9CJ-01
      TCGA-ZS-A9CD-01
      TCGA-ZS-A9CE-01
      TCGA-ZS-A9CF-01
      TCGA-ZS-A9CF-02
      TCGA-ZS-A9CG-01
      TCGA-ZT-A8OM-01
      TCGA-ZU-A8S4-01
      TCGA-ZU-A8S4-11
      TCGA-ZX-AA5X-01
    
  
  
    
      RP11-368I23.2
      -9.9658
      -9.9658
      -9.9658
      -1.2481
      -3.8160
      -1.7809
      -9.9658
      -9.9658
      -3.6259
      -9.9658
      ...
      -9.9658
      -4.6082
      -9.9658
      -9.9658
      -4.6082
      -9.9658
      -3.6259
      -9.9658
      -9.9658
      -9.9658
    
    
      RP11-167B3.1
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      ...
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
    
    
      RP11-742D12.2
      -4.2934
      0.0014
      -9.9658
      -5.5735
      0.3573
      -9.9658
      -6.5064
      -5.0116
      -9.9658
      -5.0116
      ...
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -6.5064
    
    
      RAB4B
      5.1190
      4.1277
      4.4067
      5.6860
      4.0357
      4.6849
      4.5009
      5.3954
      4.9402
      5.4683
      ...
      4.1780
      4.5547
      3.6737
      4.9331
      3.6254
      3.7646
      5.5201
      5.4216
      3.3647
      4.7991
    
    
      AC104071.1
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      ...
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
      -9.9658
    
  

5 rows × 19260 columns



In [7]:

    
%%time
# Multiple Ensemble genes map to the same Hugo name. Each of these values has been normalized via log2(TPM+0.001)
# so we convert back into TPM, sum, and re-normalize.
tcga_target_gtex_expression_hugo_tpm = tcga_target_gtex_expression_hugo \
    .apply(np.exp2).subtract(0.001).groupby(level=0).aggregate(np.sum).add(0.001).apply(np.log2)









    



CPU times: user 1min 11s, sys: 1min 4s, total: 2min 16s
Wall time: 2min 4s



In [8]:

    
tcga_target_gtex_expression_hugo_tpm.head()









    Out[8]:







  
    
      
      GTEX-1117F-0226-SM-5GZZ7
      GTEX-1117F-0426-SM-5EGHI
      GTEX-1117F-0526-SM-5EGHJ
      GTEX-1117F-0626-SM-5N9CS
      GTEX-1117F-0726-SM-5GIEN
      GTEX-1117F-1326-SM-5EGHH
      GTEX-1117F-2226-SM-5N9CH
      GTEX-1117F-2426-SM-5EGGH
      GTEX-1117F-2826-SM-5GZXL
      GTEX-1117F-3026-SM-5GZYU
      ...
      TCGA-ZR-A9CJ-01
      TCGA-ZS-A9CD-01
      TCGA-ZS-A9CE-01
      TCGA-ZS-A9CF-01
      TCGA-ZS-A9CF-02
      TCGA-ZS-A9CG-01
      TCGA-ZT-A8OM-01
      TCGA-ZU-A8S4-01
      TCGA-ZU-A8S4-11
      TCGA-ZX-AA5X-01
    
  
  
    
      5S_rRNA
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      ...
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
      -9.966042
    
    
      5_8S_rRNA
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      ...
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
      -9.965816
    
    
      7SK
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -0.833902
      -9.965880
      -9.965880
      0.576259
      0.334600
      -0.886300
      ...
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -9.965880
      -9.965880
    
    
      A1BG
      4.459500
      1.151200
      5.241100
      5.475800
      4.553400
      4.622400
      4.502800
      5.613000
      4.447600
      4.730200
      ...
      3.646300
      11.151800
      11.893000
      8.946600
      7.634700
      9.145400
      7.642600
      3.599400
      10.959300
      3.593500
    
    
      A1BG-AS1
      0.934300
      -1.282800
      0.848800
      2.632500
      1.305100
      1.551400
      0.864700
      3.114500
      2.176600
      2.459700
      ...
      1.623400
      0.556800
      0.903800
      -0.471900
      -1.318300
      -1.552200
      2.508700
      -0.452100
      -1.430500
      1.803600
    
  

5 rows × 19260 columns



In [9]:

    
# Verify that the sum of all expression levels for a sample in TPM space still sum to ~1 million
tcga_target_gtex_expression_hugo[
    [
        "GTEX-146FH-1726-SM-5QGQ2", "GTEX-WZTO-2926-SM-3NM9I", "TCGA-AB-2965-03"
    ]].apply(np.exp2).apply(lambda x: x - 0.001).sum()









    Out[9]:





GTEX-146FH-1726-SM-5QGQ2    1.000001e+06
GTEX-WZTO-2926-SM-3NM9I     9.999974e+05
TCGA-AB-2965-03             9.999970e+05
dtype: float64

Download Treehouse Expression

The Treehouse public compendium is in Hugo log2(tpm+1). We need to download and convert into lot2(tpm+0.001) to match our TCGA+TARGET+GTEXt dataset above.



In [10]:

    
%%time
if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz"):
    print("Downloading Treehouse Public Compendium")
    r = requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz",
                     stream=True)
    r.raise_for_status()
    with open("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=32768):
            f.write(chunk)

if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5"):
    print("Converting expression to dataframe and storing in hdf5 file")
    pd.read_csv("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", sep="\t", index_col=0) \
    .astype(np.float32).to_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5", 
                               "expression", mode="w", format="fixed")

treehouse_expression = pd.read_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.h5", "expression")
print("treehouse_expression: samples={} genes={}".format(*treehouse_expression.shape))









    



treehouse_expression: samples=58581 genes=11078
CPU times: user 179 ms, sys: 2.08 s, total: 2.26 s
Wall time: 4.11 s



In [11]:

    
# Check that we don't have any null/nan at this point
assert not tcga_target_gtex_expression_hugo_tpm.isnull().values.any()
assert not treehouse_expression.isnull().values.any()

# Make sure they have identical hugo gene indexes
assert np.array_equal(tcga_target_gtex_expression_hugo_tpm.index, treehouse_expression.index)



In [12]:

    
# Convert into log2(tpm+0.001)
treehouse_expression_hugo_tpm = treehouse_expression.apply(np.exp2).subtract(1.0).add(0.001).apply(np.log2)

NOTE and REMINDER

The current public Treehouse compendium was created by combining expression values that map to the same Hugo gene identify by calculating the mean of their log2(tpm+1) values. As a result those values will not match perfectly with the same samples in the TCGA+TARGET+GTEX dataset. The next public compendium from Treehouse will calculate mean in TPM space. Continue on here but later we need to come back and update this - or calculate things the right way for the TH and TR samples from the raw data.



In [13]:

    
# Check to verify the TCGA+TARGET samples in the Treehouse compendium 
# match TPM wise with our conversions above
sample_id = "TCGA-ZQ-A9CR-01"

np.allclose(tcga_target_gtex_expression_hugo_tpm[sample_id], 
            treehouse_expression_hugo_tpm[sample_id], 1, 1)

argmax = (tcga_target_gtex_expression_hugo_tpm[sample_id] 
          - treehouse_expression_hugo_tpm[sample_id]).values.argmax()
gene = tcga_target_gtex_expression_hugo_tpm.index[argmax]
print("Gene with maximum delta:", gene,
      tcga_target_gtex_expression_hugo_tpm[sample_id][gene] 
      - treehouse_expression_hugo_tpm[sample_id][gene])

(tcga_target_gtex_expression_hugo_tpm[sample_id] 
 - treehouse_expression_hugo_tpm[sample_id]).describe()









    



Gene with maximum delta: Metazoa_SRP 13.833809






    Out[13]:





count    58581.000000
mean         0.004649
std          0.143267
min         -0.013058
25%         -0.000108
50%         -0.000016
75%         -0.000016
max         13.833809
Name: TCGA-ZQ-A9CR-01, dtype: float64

Download Labels and Conform to Expression Indexes



In [14]:

    
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/TcgaTargetGTEX_phenotype.txt.gz"):
    with open("data/TcgaTargetGTEX_phenotype.txt.gz", "wb") as f:
        f.write(requests.get("https://toil.xenahubs.net/download/TcgaTargetGTEX_phenotype.txt.gz").content)

tcga_target_gtex_labels = pd.read_table(
    "data/TcgaTargetGTEX_phenotype.txt.gz", compression="gzip", 
    header=0, names=["id", "category", "disease", "primary_site", "sample_type", "gender", "study"],
    sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")

# Compute and add a tumor/normal column - TCGA and TARGET have some normal samples, GTEX is all normal.
tcga_target_gtex_labels["tumor_normal"] = tcga_target_gtex_labels.apply(
    lambda row: "Normal" if row["sample_type"] in ["Cell Line", "Normal Tissue", "Solid Tissue Normal"]
    else "Tumor", axis=1)

tcga_target_gtex_labels.head()









    Out[14]:







  
    
      
      category
      disease
      primary_site
      sample_type
      gender
      study
      tumor_normal
    
    
      id
      
      
      
      
      
      
      
    
  
  
    
      GTEX-1117F-0226-SM-5GZZ7
      Adipose - Subcutaneous
      Adipose - Subcutaneous
      Adipose Tissue
      Normal Tissue
      Female
      GTEX
      Normal
    
    
      GTEX-1117F-0426-SM-5EGHI
      Muscle - Skeletal
      Muscle - Skeletal
      Muscle
      Normal Tissue
      Female
      GTEX
      Normal
    
    
      GTEX-1117F-0526-SM-5EGHJ
      Artery - Tibial
      Artery - Tibial
      Blood Vessel
      Normal Tissue
      Female
      GTEX
      Normal
    
    
      GTEX-1117F-0626-SM-5N9CS
      Artery - Coronary
      Artery - Coronary
      Blood Vessel
      Normal Tissue
      Female
      GTEX
      Normal
    
    
      GTEX-1117F-0726-SM-5GIEN
      Heart - Atrial Appendage
      Heart - Atrial Appendage
      Heart
      Normal Tissue
      Female
      GTEX
      Normal



In [15]:

    
# Remove rows where we have no labels or the primary_site label is null
intersection = tcga_target_gtex_expression_hugo_tpm.columns.intersection(
    tcga_target_gtex_labels[pd.notnull(tcga_target_gtex_labels["primary_site"])].index)
tcga_target_gtex_expression_common = tcga_target_gtex_expression_hugo_tpm.loc[:, tcga_target_gtex_expression_hugo_tpm.columns.isin(intersection)]
tcga_target_gtex_labels_common = tcga_target_gtex_labels[tcga_target_gtex_labels.index.isin(intersection)]

# Make sure the label and example samples are in the same order
assert(tcga_target_gtex_expression_common.columns.equals(tcga_target_gtex_labels_common.index))



In [16]:

    
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz"):
    with open("data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz", "wb") as f:
        f.write(requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz").content)

treehouse_labels = pd.read_table(
    "data/treehouse_public_samples_clinical_metadata.2017-09-11.tsv.gz", compression="gzip", 
    header=0, names=["id", "age_in_years", "gender", "disease"],
    sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis=0)

# Title case to match TARGET+TCGA+GTEX
treehouse_labels["disease"] = treehouse_labels["disease"].str.title()
treehouse_labels["gender"] = treehouse_labels["gender"].str.title()

treehouse_labels.head()









    Out[16]:







  
    
      
      age_in_years
      gender
      disease
    
    
      id
      
      
      
    
  
  
    
      TARGET-10-PAKSWW-03
      15.11
      Male
      Acute Lymphoblastic Leukemia
    
    
      TARGET-10-PAMXHJ-09
      6.08
      Male
      Acute Lymphoblastic Leukemia
    
    
      TARGET-10-PAMXSP-09
      3.35
      Male
      Acute Lymphoblastic Leukemia
    
    
      TARGET-10-PANCVR-03
      6.4
      Male
      Acute Lymphoblastic Leukemia
    
    
      TARGET-10-PANCVR-04
      6.4
      Male
      Acute Lymphoblastic Leukemia



In [17]:

    
# Treehouse samples are prefixed with TH (prospective patient) or THR (pediatric research study)
# so filter for only these excluding the TCGA and TARGET that are already in our other dataset
treehouse_labels_pruned = treehouse_labels.filter(regex='\ATH|\ATHR', axis="index")



In [18]:

    
# See how many disease labels overlap
set(tcga_target_gtex_labels.disease).intersection(treehouse_labels_pruned.disease)









    Out[18]:





{'Acute Lymphoblastic Leukemia',
 'Acute Myeloid Leukemia',
 'Glioblastoma Multiforme',
 'Lung Adenocarcinoma',
 'Neuroblastoma',
 'Sarcoma',
 'Thyroid Carcinoma',
 'Wilms Tumor'}



In [19]:

    
# Subset treeshouse expression to those not in TCGA+TARGET+GTEX
treehouse_expression_hugo_tpm_pruned = \
    treehouse_expression_hugo_tpm.loc[:, treehouse_labels_pruned.index]
print(treehouse_expression_hugo_tpm_pruned.shape)
assert(tcga_target_gtex_expression_common.index.equals(
    treehouse_expression_hugo_tpm_pruned.index))









    



(58581, 549)

Export

Write expression and labels for both datasets out to hdf5 files wrangles and in machine learning format of rows = samples



In [20]:

    
# Export h5 format files
with pd.HDFStore("data/tcga_target_gtex.h5", "w") as store:
    store["expression"] = tcga_target_gtex_expression_common.T.sort_index(axis="columns")
    store["labels"] = tcga_target_gtex_labels_common.astype(str)
    
with pd.HDFStore("data/treehouse.h5", "w") as store:
    store["expression"] = treehouse_expression_hugo_tpm_pruned.T.sort_index(axis="columns")
    store["labels"] = treehouse_labels_pruned.astype(str)

	GTEX-1117F-0226-SM-5GZZ7	GTEX-1117F-0426-SM-5EGHI	GTEX-1117F-0526-SM-5EGHJ	GTEX-1117F-0626-SM-5N9CS	GTEX-1117F-0726-SM-5GIEN	GTEX-1117F-1326-SM-5EGHH	GTEX-1117F-2226-SM-5N9CH	GTEX-1117F-2426-SM-5EGGH	GTEX-1117F-2826-SM-5GZXL	GTEX-1117F-3026-SM-5GZYU	...	TCGA-ZR-A9CJ-01	TCGA-ZS-A9CD-01	TCGA-ZS-A9CE-01	TCGA-ZS-A9CF-01	TCGA-ZS-A9CF-02	TCGA-ZS-A9CG-01	TCGA-ZT-A8OM-01	TCGA-ZU-A8S4-01	TCGA-ZU-A8S4-11	TCGA-ZX-AA5X-01
sample
ENSG00000242268.2	-9.9658	-9.9658	-9.9658	-1.2481	-3.8160	-1.7809	-9.9658	-9.9658	-3.6259	-9.9658	...	-9.9658	-4.6082	-9.9658	-9.9658	-4.6082	-9.9658	-3.6259	-9.9658	-9.9658	-9.9658
ENSG00000259041.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000270112.3	-4.2934	0.0014	-9.9658	-5.5735	0.3573	-9.9658	-6.5064	-5.0116	-9.9658	-5.0116	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-6.5064
ENSG00000167578.16	5.1190	4.1277	4.4067	5.6860	4.0357	4.6849	4.5009	5.3954	4.9402	5.4683	...	4.1780	4.5547	3.6737	4.9331	3.6254	3.7646	5.5201	5.4216	3.3647	4.7991
ENSG00000278814.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658

	geneName	geneType	geneStatus	transcriptId	transcriptName	transcriptType	transcriptStatus	havanaGeneId	havanaTranscriptId	ccdsId	level	transcriptClass
geneId
ENSG00000000003.14	TSPAN6	protein_coding	KNOWN	ENST00000612152.4	TSPAN6-201	protein_coding	KNOWN	OTTHUMG00000022002.1	NaN	CCDS76001.1	3	coding
ENSG00000000005.5	TNMD	protein_coding	KNOWN	ENST00000373031.4	TNMD-001	protein_coding	KNOWN	OTTHUMG00000022001.1	OTTHUMT00000057481.1	CCDS14469.1	2	coding
ENSG00000000419.12	DPM1	protein_coding	KNOWN	ENST00000371582.8	DPM1-005	protein_coding	KNOWN	OTTHUMG00000032742.2	OTTHUMT00000079720.2	NaN	2	coding
ENSG00000000457.13	SCYL3	protein_coding	KNOWN	ENST00000470238.1	SCYL3-004	processed_transcript	KNOWN	OTTHUMG00000035941.4	OTTHUMT00000087552.1	NaN	2	nonCoding
ENSG00000000460.16	C1orf112	protein_coding	KNOWN	ENST00000466580.6	C1orf112-008	processed_transcript	KNOWN	OTTHUMG00000035821.7	OTTHUMT00000087524.1	NaN	2	nonCoding

	GTEX-1117F-0226-SM-5GZZ7	GTEX-1117F-0426-SM-5EGHI	GTEX-1117F-0526-SM-5EGHJ	GTEX-1117F-0626-SM-5N9CS	GTEX-1117F-0726-SM-5GIEN	GTEX-1117F-1326-SM-5EGHH	GTEX-1117F-2226-SM-5N9CH	GTEX-1117F-2426-SM-5EGGH	GTEX-1117F-2826-SM-5GZXL	GTEX-1117F-3026-SM-5GZYU	...	TCGA-ZR-A9CJ-01	TCGA-ZS-A9CD-01	TCGA-ZS-A9CE-01	TCGA-ZS-A9CF-01	TCGA-ZS-A9CF-02	TCGA-ZS-A9CG-01	TCGA-ZT-A8OM-01	TCGA-ZU-A8S4-01	TCGA-ZU-A8S4-11	TCGA-ZX-AA5X-01
RP11-368I23.2	-9.9658	-9.9658	-9.9658	-1.2481	-3.8160	-1.7809	-9.9658	-9.9658	-3.6259	-9.9658	...	-9.9658	-4.6082	-9.9658	-9.9658	-4.6082	-9.9658	-3.6259	-9.9658	-9.9658	-9.9658
RP11-167B3.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
RP11-742D12.2	-4.2934	0.0014	-9.9658	-5.5735	0.3573	-9.9658	-6.5064	-5.0116	-9.9658	-5.0116	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-6.5064
RAB4B	5.1190	4.1277	4.4067	5.6860	4.0357	4.6849	4.5009	5.3954	4.9402	5.4683	...	4.1780	4.5547	3.6737	4.9331	3.6254	3.7646	5.5201	5.4216	3.3647	4.7991
AC104071.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658

	GTEX-1117F-0226-SM-5GZZ7	GTEX-1117F-0426-SM-5EGHI	GTEX-1117F-0526-SM-5EGHJ	GTEX-1117F-0626-SM-5N9CS	GTEX-1117F-0726-SM-5GIEN	GTEX-1117F-1326-SM-5EGHH	GTEX-1117F-2226-SM-5N9CH	GTEX-1117F-2426-SM-5EGGH	GTEX-1117F-2826-SM-5GZXL	GTEX-1117F-3026-SM-5GZYU	...	TCGA-ZR-A9CJ-01	TCGA-ZS-A9CD-01	TCGA-ZS-A9CE-01	TCGA-ZS-A9CF-01	TCGA-ZS-A9CF-02	TCGA-ZS-A9CG-01	TCGA-ZT-A8OM-01	TCGA-ZU-A8S4-01	TCGA-ZU-A8S4-11	TCGA-ZX-AA5X-01
5S_rRNA	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	...	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042	-9.966042
5_8S_rRNA	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	...	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816	-9.965816
7SK	-9.965880	-9.965880	-9.965880	-9.965880	-0.833902	-9.965880	-9.965880	0.576259	0.334600	-0.886300	...	-9.965880	-9.965880	-9.965880	-9.965880	-9.965880	-9.965880	-9.965880	-9.965880	-9.965880	-9.965880
A1BG	4.459500	1.151200	5.241100	5.475800	4.553400	4.622400	4.502800	5.613000	4.447600	4.730200	...	3.646300	11.151800	11.893000	8.946600	7.634700	9.145400	7.642600	3.599400	10.959300	3.593500
A1BG-AS1	0.934300	-1.282800	0.848800	2.632500	1.305100	1.551400	0.864700	3.114500	2.176600	2.459700	...	1.623400	0.556800	0.903800	-0.471900	-1.318300	-1.552200	2.508700	-0.452100	-1.430500	1.803600

	category	disease	primary_site	sample_type	gender	study	tumor_normal
id
GTEX-1117F-0226-SM-5GZZ7	Adipose - Subcutaneous	Adipose - Subcutaneous	Adipose Tissue	Normal Tissue	Female	GTEX	Normal
GTEX-1117F-0426-SM-5EGHI	Muscle - Skeletal	Muscle - Skeletal	Muscle	Normal Tissue	Female	GTEX	Normal
GTEX-1117F-0526-SM-5EGHJ	Artery - Tibial	Artery - Tibial	Blood Vessel	Normal Tissue	Female	GTEX	Normal
GTEX-1117F-0626-SM-5N9CS	Artery - Coronary	Artery - Coronary	Blood Vessel	Normal Tissue	Female	GTEX	Normal
GTEX-1117F-0726-SM-5GIEN	Heart - Atrial Appendage	Heart - Atrial Appendage	Heart	Normal Tissue	Female	GTEX	Normal

	age_in_years	gender	disease
id
TARGET-10-PAKSWW-03	15.11	Male	Acute Lymphoblastic Leukemia
TARGET-10-PAMXHJ-09	6.08	Male	Acute Lymphoblastic Leukemia
TARGET-10-PAMXSP-09	3.35	Male	Acute Lymphoblastic Leukemia
TARGET-10-PANCVR-03	6.4	Male	Acute Lymphoblastic Leukemia
TARGET-10-PANCVR-04	6.4	Male	Acute Lymphoblastic Leukemia