Extraction of TCGA Metadata from Sevenbridges Genomics

The purpose of this IPython notebook is to pull the majority of relevant TCGA metadata from a SPARQL ontology that is currently hosted by Sevenbridges Genomics. By the end of this notebook, we will have stored them in the following data structures:

  1. A hierarchical dictionary of Pandas DataFrames, indexed by TCGA case (i.e. patient) universally unique identifiers (UUIDs)
  2. A hierarchical dictionary of Pandas DataFrames, indexed by TCGA file universally unique identifiers (UUIDs)
  3. One large, sorted CSV file that has been flattened to a 2D format but no longer has unique metadata fields (i.e. columns) due to mixed numbers of entries per patient.

Note that the hierarchical dictionaries can be easily formatted for NoSQL databases, such as MongoDB, if desired. If you want more information on the ontology, it can be found by clicking here. In addition to this notebook, we have written a Python script that pulls a subset of this metadata (~20 fields vs. 59 fields here) directly from your file directory on Sevenbridges Genomics website (first link above). A number of important considerations went into this analysis, and they are as follows:

  1. The SPARQL query had to be broken up into 2 separate queries, specifically two "long" queries. The practical reason for this is that a single POST-based retrieval was too large for the involved server to handle (likely due to the OPTIONAL query statements).
  2. Based on our desired analysis, we have implemented a variety of filters in the SPARQL query. These are given below and are the basis of the 2 queries. Note, if you want to run similar metadata query(ies) but have different filters, you should change these in every query below.

     # First filter by sample type
     filter(?sample_type_label='Primary Tumor' || ?sample_type_label='Additional - New Primary' || ?sample_type_label='Additional Metastatic' || ?sample_type_label='Metastatic' || ?sample_type_label='Primary Blood Derived Cancer - Peripheral Blood' || ?sample_type_label='Recurrent Tumor' || ?sample_type_label='Blood Derived Normal' || ?sample_type_label='Bone Marrow Normal' || ?sample_type_label='Buccal Cell Normal' || ?sample_type_label='Solid Tissue Normal')
    
     # Second filter by experimental strategy --> 'RNA-Seq' OR 'WGS'
     filter(?experimental_strategy_label='RNA-Seq' || ?experimental_strategy_label='WGS') .
    
     # Third filter by data type --> only 'Raw sequencing data'
     filter(?data_type_label='Raw sequencing data') .
    
     # Fourth filter by data format --> only 'BAM' files
     filter(?data_format_label='BAM') .
  3. One major consideration of this script (and likely the difficulty of this problem for most) is that the queried metadata is stored in a multiplexed, hierarchical, nested array structures (e.g. XML, JSON). Due to the variety of sampled fields, these data comprise a mix of static (e.g. demographic), dynamic (e.g. performance evaluations), and time-dependent (e.g. days until death) information that are not easily translatable to data formats most researchers are used to. For instance, take a moment to examine the table below. If one queries the metadata of a particular file, it is generally expected that there will be just one Karnofsky performance score for the file in question. Yet, as shown below, one file apparently had 3 performance scores, another had none, and the last one had 2 scores diametrically opposite of one another (0 vs. 100). Without context, this result looks more like a corrupted database than a query problem, but it's actually the opposite. The issue actually arose because the query focused on file name and not the case UUID for the patient; in reality, the first file name belongs to a patient that received multiple performance scores across a duration of time. Thus, if all you selected for in your query was file name associated with patients' metadata, this is the type of result you would get.

file_name gdc_file_uuid perf_score_karnofsky_label
TCGA-KM-8442-10A-01D-2311-10_wgs_Illumina.bam 0273C55A-A89A-432E-A324-B90E1B28CED0 90
TCGA-KM-8442-10A-01D-2311-10_wgs_Illumina.bam 0273C55A-A89A-432E-A324-B90E1B28CED0 100
TCGA-KM-8442-10A-01D-2311-10_wgs_Illumina.bam 0273C55A-A89A-432E-A324-B90E1B28CED0 80
UNCID_1147822.fef221ff-634b-4188-92cf-5723b9a3c04d.sorted_genome_alignments.bam 0273D20C-0DB7-44E0-B159-2A2D5FEB03D2 Not available
UNCID_1076554.a6af2ef5-6774-422c-99d1-826a686334d0.sorted_genome_alignments.bam 2FE88F0F-62BB-41BB-8B49-1B6E1E7EF368 0
UNCID_1076554.a6af2ef5-6774-422c-99d1-826a686334d0.sorted_genome_alignments.bam 2FE88F0F-62BB-41BB-8B49-1B6E1E7EF368 100

The Takeaway

Based on the above discussion, a decision was made to extract as much metadata as possible while retaining a minimum amount of the hierarchical structure. Through manipulations of Pandas data structures, this came about as a dictionary object with key-value pairs linked to their associated DataFrames. It should be noted that while they take a while to build (20-40 min), they permit the user to subsequently pull up all the metadata on a particular file or case (i.e. patient) by simply entering their file or case UUIDs as the dictionary key. Since these two key types (file UUID vs. case UUID) have different numbers of entries represented in the TCGA database and their associated dictionaries, they were created and packaged in separate Pickle binaries (i.e. Python dictionary files).

For those who are uncomfortable with any hierarchical data manipulation, we have also produced a 2D table of all the data herein (see very bottom of notebook). It is important to note that none of the metadata fields in this 2D table are entirely unique due to the metadata problems described above, meaning some level of grouping or summarization will need to be done by the end-user. It will also not be indexable by SQL-based formats, for those who use SQL-light or MySQL pipelines.

Despite these limitations, we hope this notebook will be of use to you for future metadata analyses.


In [187]:
# Initialize
from SPARQLWrapper import SPARQLWrapper, JSON
from pandas.io.json import json_normalize
import pandas as pd
import numpy as np
import json, csv, tqdm, pickle

In [188]:
# Use the public endpoint

sparql_endpoint = "https://opensparql.sbgenomics.com/bigdata/namespace/tcga_metadata_kb/sparql"

# Initialize the SPARQL wrapper with the endpoint
sparql = SPARQLWrapper(sparql_endpoint)

In [189]:
longQuery1 = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#> 
 
select distinct ?gdc_file_uuid ?file_name ?file_submitter_id ?case_name ?aliquot_name ?ref_genome_label ?sample_type_label ?experimental_strategy_label ?data_type_label ?data_format_label ?gender_label ?race_label ?ethnicity_label ?ageAtDiagnosis ?disease_type_label ?sample_label ?investigation_label ?histological_diagnosis_label ?primary_site_label ?prior_dx ?clinical_m_label ?clinical_n_label ?clinical_t_label ?pathologic_m_label ?pathologic_n_label ?pathologic_t_label ?pathologic_stage_label ?perf_score_karnofsky_label ?perf_score_eastern_cancer_oncology_group_label ?perf_score_timing_label ?primary_therapy_outcome_success_label ?vital_status_label ?new_tumor_event_label ?new_tumor_event_after_initial_trtmt ?radiation_therapy_code_label ?radiation_therapy_site_label ?radiation_therapy_type_label ?days_to_last_followup ?year_of_diagnosis ?icd10 ?icd03_histology_label ?icd03_histology_site ?data_submitting_center_label ?seq_platform_label ?aliquot_concentration  ?analyte_A260A280Ratio ?analyte_type_label ?analyte_amount ?analyte_well_number ?spectrophotometer_method_label ?file_upload_date ?file_published_date ?file_last_modified_date ?portion_weight ?portion_is_ffpe ?portion_number ?portion_slide_label ?freezing_method_label ?tissue_source_site_label ?country_of_sample_procurement
{ 
  ?file a tcga:File .
  ?file rdfs:label ?file_name .
  
  ?file tcga:hasGDCFileUUID ?gdc_file_uuid .
  
  ?file tcga:hasSample ?sample .
  ?sample tcga:hasSampleType ?st .
  ?st rdfs:label ?sample_type_label 
  filter(?sample_type_label='Primary Tumor' || ?sample_type_label='Additional - New Primary' || ?sample_type_label='Additional Metastatic' || ?sample_type_label='Metastatic' || ?sample_type_label='Primary Blood Derived Cancer - Peripheral Blood' || ?sample_type_label='Recurrent Tumor' || ?sample_type_label='Blood Derived Normal' || ?sample_type_label='Bone Marrow Normal' || ?sample_type_label='Buccal Cell Normal' || ?sample_type_label='Solid Tissue Normal')
      
  ?file tcga:hasExperimentalStrategy ?xs .
  ?xs rdfs:label ?experimental_strategy_label .
  filter(?experimental_strategy_label='RNA-Seq' || ?experimental_strategy_label='WGS') .
  
  ?file tcga:hasDataType ?type .
  ?type rdfs:label ?data_type_label .
  filter(?data_type_label='Raw sequencing data') .
  
  ?file tcga:hasDataFormat ?format .
  ?format rdfs:label ?data_format_label .
  filter(?data_format_label='BAM') .
  
  OPTIONAL {
  ?file tcga:hasSubmitterId ?file_submitter_id .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPerformanceStatusScoreKarnofsky ?perf_score_karnofsky .
  ?perf_score_karnofsky rdfs:label ?perf_score_karnofsky_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .
  ?case rdfs:label ?case_name .
  }
  
  OPTIONAL {
  ?file tcga:hasAliquot ?aliquot .  
  ?aliquot rdfs:label ?aliquot_name .
  }
  
  OPTIONAL {
  ?file tcga:hasReferenceGenome ?ref_genome .   
  ?ref_genome rdfs:label ?ref_genome_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .
  ?case tcga:hasGender ?gender .
  ?gender rdfs:label ?gender_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasEthnicity ?ethnicity .
  ?ethnicity rdfs:label ?ethnicity_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasRace ?race .
  ?race rdfs:label ?race_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasAgeAtDiagnosis ?ageAtDiagnosis .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasDiseaseType ?diseaseType .
  ?diseaseType rdfs:label ?disease_type_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasSample ?sample .
  ?sample rdfs:label ?sample_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .  
  ?case tcga:hasInvestigation ?investigation .
  ?investigation rdfs:label ?inv_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case . 
  ?case tcga:hasHistologicalDiagnosis ?hd .
  ?hd rdfs:label ?histological_diagnosis_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPrimarySite ?primary_site .
  ?primary_site rdfs:label ?primary_site_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPriorDiagnosis ?prior_dx_base .
  ?prior_dx_base rdfs:label ?prior_dx .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasClinicalM ?clinical_m .
  ?clinical_m rdfs:label ?clinical_m_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasClinicalM ?clinical_n .
  ?clinical_n rdfs:label ?clinical_n_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasClinicalM ?clinical_t .
  ?clinical_t rdfs:label ?clinical_t_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasClinicalStage ?clinical_stage .
  ?clinical_stage rdfs:label ?clinical_stage_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPathologicM ?pathologic_m .
  ?pathologic_m rdfs:label ?pathologic_m_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPathologicN ?pathologic_n .
  ?pathologic_n rdfs:label ?pathologic_n_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPathologicT ?pathologic_t .
  ?pathologic_t rdfs:label ?pathologic_t_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPathologicStage ?pathologic_stage .
  ?pathologic_stage rdfs:label ?pathologic_stage_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPerformanceStatusScoreECOG ?perf_score_eastern_cancer_oncology_group .
  ?perf_score_eastern_cancer_oncology_group rdfs:label ?perf_score_eastern_cancer_oncology_group_label .
  }
}

"""

In [190]:
# Set SPARQL extraction method to POST, return in the JSON format, and then convert JSON to a Python dictionary object
sparql.method = 'POST'
sparql.setReturnFormat(JSON)
sparql.setQuery(longQuery1)
resultsLong1 = sparql.query().convert()

In [191]:
longQuery2 = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#> 
 
select distinct ?gdc_file_uuid ?file_name ?file_submitter_id ?case_name ?aliquot_name ?ref_genome_label ?sample_type_label ?experimental_strategy_label ?data_type_label ?data_format_label ?gender_label ?race_label ?ethnicity_label ?ageAtDiagnosis ?disease_type_label ?sample_label ?investigation_label ?histological_diagnosis_label ?primary_site_label ?prior_dx ?clinical_m_label ?clinical_n_label ?clinical_t_label ?pathologic_m_label ?pathologic_n_label ?pathologic_t_label ?pathologic_stage_label ?perf_score_karnofsky_label ?perf_score_eastern_cancer_oncology_group_label ?perf_score_timing_label ?primary_therapy_outcome_success_label ?vital_status_label ?new_tumor_event_label ?new_tumor_event_after_initial_trtmt ?radiation_therapy_code_label ?radiation_therapy_site_label ?radiation_therapy_type_label ?days_to_last_followup ?year_of_diagnosis ?icd10 ?icd03_histology_label ?icd03_histology_site ?data_submitting_center_label ?seq_platform_label ?aliquot_concentration  ?analyte_A260A280Ratio ?analyte_type_label ?analyte_amount ?analyte_well_number ?spectrophotometer_method_label ?file_upload_date ?file_published_date ?file_last_modified_date ?portion_weight ?portion_is_ffpe ?portion_number ?portion_slide_label ?freezing_method_label ?tissue_source_site_label ?country_of_sample_procurement
{ 
  ?file a tcga:File .
  ?file rdfs:label ?file_name .
  
  ?file tcga:hasGDCFileUUID ?gdc_file_uuid .
  
  ?file tcga:hasSample ?sample .
  ?sample tcga:hasSampleType ?st .
  ?st rdfs:label ?sample_type_label 
  filter(?sample_type_label='Primary Tumor' || ?sample_type_label='Additional - New Primary' || ?sample_type_label='Additional Metastatic' || ?sample_type_label='Metastatic' || ?sample_type_label='Primary Blood Derived Cancer - Peripheral Blood' || ?sample_type_label='Recurrent Tumor' || ?sample_type_label='Blood Derived Normal' || ?sample_type_label='Bone Marrow Normal' || ?sample_type_label='Buccal Cell Normal' || ?sample_type_label='Solid Tissue Normal')
      
  ?file tcga:hasExperimentalStrategy ?xs .
  ?xs rdfs:label ?experimental_strategy_label .
  filter(?experimental_strategy_label='RNA-Seq' || ?experimental_strategy_label='WGS') .
  
  ?file tcga:hasDataType ?type .
  ?type rdfs:label ?data_type_label .
  filter(?data_type_label='Raw sequencing data') .
  
  ?file tcga:hasDataFormat ?format .
  ?format rdfs:label ?data_format_label .
  filter(?data_format_label='BAM') .
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPerformanceStatusScoreTiming ?perf_score_timing .
  ?perf_score_timing rdfs:label ?perf_score_timing_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasPrimaryTherapyOutcomeSuccess ?primary_therapy_outcome_success .
  ?primary_therapy_outcome_success rdfs:label ?primary_therapy_outcome_success_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasVitalStatus ?vital_status .
  ?vital_status rdfs:label ?vital_status_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasNewTumorEvent ?new_tumor_event .
  ?new_tumor_event rdfs:label ?new_tumor_event_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasRadiationTherapy ?radiation_therapy .
  ?radiation_therapy rdfs:label ?radiation_therapy_code_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasRadiationTherapy ?radiation_therapy .
  ?radiation_therapy tcga:hasRadiationTherapySite ?radiation_therapy_site .
  ?radiation_therapy_site rdfs:label ?radiation_therapy_site_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasRadiationTherapy ?radiation_therapy .
  ?radiation_therapy tcga:hasRadiationType ?radiation_therapy_type .
  ?radiation_therapy_type rdfs:label ?radiation_therapy_type_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasNewTumorEventAfterInitialTreatment ?new_tumor_event_after_initial_trtmt .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasDaysToLastFollowUp ?days_to_last_followup .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasYearOfDiagnosis ?year_of_diagnosis .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .   
  ?case tcga:hasIcd10 ?icd10 .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case . 
  ?case tcga:hasIcdO3Histology ?icd03_histology_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case . 
  ?case tcga:hasIcdO3Site ?icd03_histology_site .
  }
  
  OPTIONAL {
  ?file tcga:hasDataSubmittingCenter ?data_submitting_center .   
  ?data_submitting_center rdfs:label ?data_submitting_center_label .
  }
  
  OPTIONAL {
  ?file tcga:hasPlatform ?seq_platform .   
  ?seq_platform rdfs:label ?seq_platform_label .
  }
  
  OPTIONAL {
  ?file tcga:hasAliquot ?aliquot .   
  ?aliquot tcga:hasConcentration ?aliquot_concentration .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .   
  ?analyte tcga:hasA260A280Ratio ?analyte_A260A280Ratio .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .   
  ?analyte tcga:hasAnalyteType ?analyte_type .
  ?analyte_type rdfs:label ?analyte_type_label .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .   
  ?analyte tcga:hasAmount ?analyte_amount .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .   
  ?analyte rdfs:label ?analyte_well_number .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .   
  ?analyte tcga:hasSpectrophotometerMethod ?spectrophotometer_method .
  ?spectrophotometer_method rdfs:label ?spectrophotometer_method_label .
  }
  
  OPTIONAL {
  ?file tcga:uploadDate ?file_upload_date .
  }
  
  OPTIONAL {
  ?file tcga:publishedDate ?file_published_date .
  }
  
  OPTIONAL {
  ?file tcga:lastModifiedDate ?file_last_modified_date .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .
  ?analyte tcga:hasPortion ?portion .
  ?portion tcga:hasWeight ?portion_weight .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .
  ?analyte tcga:hasPortion ?portion .
  ?portion tcga:hasIsFFPE ?portion_is_ffpe .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .
  ?analyte tcga:hasPortion ?portion .
  ?portion tcga:hasPortionNumber ?portion_number .
  }
  
  OPTIONAL {
  ?file tcga:hasAnalyte ?analyte .
  ?analyte tcga:hasPortion ?portion .
  ?portion tcga:hasSlide ?portion_slide .
  ?portion_slide rdfs:label ?portion_slide_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .
  ?case tcga:hasSample ?sample .
  ?sample tcga:hasFreezingMethod ?freezing_method .
  ?freezing_method rdfs:label ?freezing_method_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .
  ?case tcga:hasSample ?sample .
  ?sample tcga:hasTissueSourceSite ?tissue_source_site .
  ?tissue_source_site rdfs:label ?tissue_source_site_label .
  }
  
  OPTIONAL {
  ?file tcga:hasCase ?case .
  ?case tcga:hasSample ?sample .
  ?sample tcga:hasCountryOfSampleProcurement ?country_of_sample_procurement .
  }
}

"""

In [192]:
# Set SPARQL extraction method to POST, return in the JSON format, and then convert JSON to a Python dictionary object
sparql.method = 'POST'
sparql.setReturnFormat(JSON)
sparql.setQuery(longQuery2)
resultsLong2 = sparql.query().convert()

In [193]:
# Find the length of the "long" queries
# Note that the extended lengths are likely due to the duplicate entries discussed at the beginning of this notebook
print("Length of long query 1 list is %r" %len(resultsLong1['results']['bindings']))
print("Length of long query 2 list is %r" %len(resultsLong2['results']['bindings']))


Length of long query 1 list is 18669
Length of long query 2 list is 31766

In [194]:
# Convert individual Python dictionaries into Pandas DataFrame objects, and then concatenate the result
# This is for "long" query 1
framesListLong1 = []
for ind in tqdm.trange(0,len(resultsLong1['results']['bindings'])):
    framesListLong1.append(pd.DataFrame.from_dict(resultsLong1['results']['bindings'][ind]))

# Note that the dictionaries are nested and that the data we want are stored with the 'value' key.
# The issue is that sub-dictionaries (i.e. in a key:value format) is that the their values cannot be extracted directly.
# We solve this by concatenating all of the nested keys (e.g. 'type', 'value') as rows in the DataFrame objects and 
# then slicing the resultant object based on the index key 'value'. One should note that this does not yield a unique
# index, but this is fixed via the join procedures later on before the CSV file is written.
concatDFLong1 = pd.concat(framesListLong1).loc['value',:]
print(concatDFLong1.shape)


100%|██████████| 18669/18669 [01:19<00:00, 236.11it/s]
(18669, 28)

In [195]:
# Convert individual Python dictionaries into Pandas DataFrame objects, and then concatenate the result
# This is for "long" query 2
framesListLong2 = []
for ind in tqdm.trange(0,len(resultsLong2['results']['bindings'])):
    framesListLong2.append(pd.DataFrame.from_dict(resultsLong2['results']['bindings'][ind]))
    
# As described before, note that the dictionaries are nested and that the data we want are stored with the 'value' key.
# The issue is that sub-dictionaries (i.e. in a key:value format) is that the their values cannot be extracted directly.
# We solve this by concatenating all of the nested keys (e.g. 'type', 'value') as rows in the DataFrame objects and 
# then slicing the resultant object based on the index key 'value'. One should note that this does not yield a unique
# index, but this is fixed via the join procedures later on before the CSV file is written.
concatDFLong2 = pd.concat(framesListLong2).loc['value',:]
print(concatDFLong2.shape)


100%|██████████| 31766/31766 [01:43<00:00, 306.39it/s]
(31766, 37)

In [196]:
# Now, we're going to create the equivalent of a 3D lookup table
# using Pandas DataFrames and a Python dictionary.
# First, drop all duplicate columns except 'gdc_file"uuid' on the 'right-hand' file and 
# then do a left join on the results of the two long queries:

reformattedConcatDFLong2 = concatDFLong2.drop(['file_name','data_format_label','data_type_label', 'experimental_strategy_label', 'sample_type_label'], axis=1)

print(concatDFLong1.shape)
print(formattedConcatDFLong2.shape)

# Note that 'gdc_file_uuid' is the only, truly unique identifier in this dataset
combinedDFLong = pd.merge(concatDFLong1, reformattedConcatDFLong2, on = 'gdc_file_uuid')

# Check the resulting DataFrame dimensions and number of unique files
print(combinedDFLong.shape)
print("The number of unique files are: %r (should be 18154)" %len(combinedDFLong['gdc_file_uuid'].unique()))
print("The number of unique cases are: %r (should be 10499)" %len(combinedDFLong['case_name'].unique()))


(18669, 28)
(31766, 32)
(33263, 59)
The number of unique files are: 18154 (should be 18154)
The number of unique cases are: 10499 (should be 10499)

In [197]:
# Next, create groupby objects, one by 'gdc_file_uuid' and one by 'case_name'
fileGroupedDF = combinedDFLong.groupby('gdc_file_uuid')
caseGroupedDF = combinedDFLong.groupby('case_name')

In [198]:
# Lastly, create dictionaries of DataFrames for each 'gdc_file_uuid' and 'case_name'.
# The dictionary for 'case_name' is created in this cell and 'gdc_file_uuid' in the next cell.
# Takes a while to run: est. 18 min on Mac OS X
# with 2.9 GHz dual-core Intel i7 and 8 GB DDR3 RAM.
caseGroupedDict = {}
for v in tqdm.tqdm(list(caseGroupedDF.groups.keys())):
    caseGroupedDict.update({v: combinedDFLong.groupby(['case_name']).get_group(v)})
#     print(v)


100%|██████████| 10499/10499 [15:52<00:00, 13.11it/s]

In [199]:
# Takes a while to run: est. 50 min on Mac OS X
# with 2.9 GHz dual-core Intel i7 and 8 GB DDR3 RAM.
fileGroupedDict = {}
for v in tqdm.tqdm(list(fileGroupedDF.groups.keys())):
    fileGroupedDict.update({v: combinedDFLong.groupby(['gdc_file_uuid']).get_group(v)})


100%|██████████| 18154/18154 [46:19<00:00,  8.15it/s]

In [200]:
# Save dict/DataFrame as a pickle file
pickle.dump(caseGroupedDict, open("caseGroupedDict.p", "wb"), protocol=4)

In [201]:
# Save dict/DataFrame as a pickle file
pickle.dump(fileGroupedDict, open("fileGroupedDict.p", "wb"), protocol=4)

In [202]:
# Load pickle files to demonstrate that the Pandas formatting is not lost in the binary conversions
caseTesting = pickle.load(open("caseGroupedDict.p", "rb"))
fileNameTesting = pickle.load(open("fileGroupedDict.p", "rb"))

In [203]:
# Test for equality between saved and loaded file as a sanity check.
# Note that values saved as 'NaN' will show as 'False'.
caseTestingKey = '40C217EE-429F-41F4-ADC6-DB9186622B17'
caseTesting[caseTestingKey] == caseGroupedDict[caseTestingKey]


Out[203]:
ageAtDiagnosis aliquot_name case_name clinical_m_label clinical_n_label clinical_t_label data_format_label data_type_label disease_type_label ethnicity_label ... portion_weight primary_therapy_outcome_success_label radiation_therapy_code_label radiation_therapy_site_label radiation_therapy_type_label seq_platform_label spectrophotometer_method_label tissue_source_site_label vital_status_label year_of_diagnosis
16759 True True True True True True True True True True ... True True False False False True True True True True
16760 True True True True True True True True True True ... True True False False False True True True True True

2 rows × 59 columns


In [204]:
# Test for equality between saved and loaded file as a sanity check.
# Note that values saved as 'NaN' will show as 'False'.
fileTestingKey = '01249B8C-2E9E-4CEA-B39A-485863AB231A'
fileNameTesting[fileTestingKey] == fileGroupedDict[fileTestingKey]


Out[204]:
ageAtDiagnosis aliquot_name case_name clinical_m_label clinical_n_label clinical_t_label data_format_label data_type_label disease_type_label ethnicity_label ... portion_weight primary_therapy_outcome_success_label radiation_therapy_code_label radiation_therapy_site_label radiation_therapy_type_label seq_platform_label spectrophotometer_method_label tissue_source_site_label vital_status_label year_of_diagnosis
5070 True True True True True True True True True True ... True True False False False True True True True True

1 rows × 59 columns


In [205]:
# Lastly, we're going to make one large metadata table. Some prefer 2D tables versus
# hierarchical data, so this is for those folks. Do note, however, that no metadata field will
# have unique values. We will order the table based first on 'case_name', secondly on 'gdc_file_uuid',
# and then we'll sort the rest of the metadata fields alphabetically to try to help with the table's readability.

sortedCombinedDFLong = combinedDFLong.sort_values(by = ['case_name', 'gdc_file_uuid'])
cols = sortedCombinedDFLong.columns.tolist()
sortedCols = sorted(cols)
sortedCols.remove('case_name')
sortedCols.remove('gdc_file_uuid')
sortedCols.insert(0,'gdc_file_uuid')
sortedCols.insert(0,'case_name')
alphaSortedCombinedDFLong = sortedCombinedDFLong[sortedCols]

# Drop the previous index (rearranged due to sorting) and reset to a new integer index
alphaSortedCombinedDFLongNewIndex = alphaSortedCombinedDFLong.reset_index(drop=True)

In [206]:
# Write the DataFrame as a CSV file to your local directory
alphaSortedCombinedDFLongNewIndex.to_csv('mergedCGCMetadata_tcga_microbiome.csv')

In [ ]: