Initialization Download and process all of the data fom Firehose

Notebook Summary

Here we are downloading and processing most of the necissary data to run this analysis pipeline. I have set up a series of scripts to do this in an automated fashon in order to allow for reproduction of this study by others as well as for updating the results obtained here as more TCGA data is collected and reseased.

Downloading this data can take a considerable amount of time (~5 hours) and disk space (~45GB), be prepared.

We use the firehose_get script provided by the Broad to download the data, please see the firehose_get documentation for troubleshooting. As we are making using the Broad's initial processing pipeline and data formats we can not promise this initial code will not break upon future update that they make.

Make sure to edit the [Imports Notebook](./Imports) and change __OUT_PATH__ to directory on your machine where you want to store the data



In [123]:

    
import NotebookImport
from Imports import *









    




importing IPython notebook from Imports






    



Populating the interactive namespace from numpy and matplotlib
changing to source dirctory

Initialization



In [4]:

    
!curl http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip -o fh_get.zip
!unzip fh_get.zip









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6542  100  6542    0     0  15389      0 --:--:-- --:--:-- --:--:-- 19297
Archive:  fh_get.zip
  inflating: firehose_get

Download Data



In [6]:

    
d = 'http://gdac.broadinstitute.org/runs/analyses__{}/ingested_data.tsv'.format(RUN_DATE)
tab = pd.read_table(d, sep='\n', header=None)
skip = tab[0].dropna().apply(lambda s: s.startswith('#'))
skip = list(skip[skip==True].index)
tab = pd.read_table(d, skiprows=skip, index_col=0).dropna()



In [8]:

    
cancers = tab[tab.Clinical>0].index[:-1]
cancer_string = ' '.join(cancers)
cancer_string









    Out[8]:





'ACC BLCA BRCA CESC COAD COADREAD DLBC ESCA GBM HNSC KICH KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG PRAD READ SARC SKCM STAD THCA UCEC UCS'

Firehose has a pretty rough limit on 1MB/s for downloading files. To speed things up I send parallel requests for each cancer.



In [52]:

    
analysis = ['firehose_get -b analyses {} {} > tmp_{} &'.format(RUN_DATE, c, c)
            for c in cancers]
data_types = ['miR_gene_expression','RSEM_genes_normalized','protein_normalization',
              'clinical']
#data_types += ['humanmethylation450']
stddata = ['firehose_get -b -o {} stddata {} {} > tmp_{}_{} &'.format(r, RUN_DATE, c, r, c)
            for c in cancers for r in data_types]



In [53]:

    
calls = analysis + stddata
script = '\nsleep 1\n'.join(calls)
f = open('script.sh', 'wb')
f.write(script)
f.close()
!bash script.sh



In [87]:

    
def check_file(f):
    text =  open(f).read()
    if 'ERROR 404: Not Found' not in text:
        return True
    summary = text.split('\n')[-5] 
    n_files = int(summary.split(' files')[0].split()[1])
    if n_files > 2:
        return True
    else:
        return False



In [95]:

    
logs = [f for f in os.listdir('.') if f.startswith('tmp')]
failed_runs = [f for f in logs if check_file(f) == False]



In [108]:

    
recall = [c for c in calls for f in failed_runs if f + ' ' in c]
len(recall), len(calls)









    Out[108]:





(42, 150)



In [110]:

    
script = '\n'.join(recall)
f = open('script.sh', 'wb')
f.write(script)
f.close()
!bash script.sh

Clean Up Downloads

No going back from here, so I would check your data to make sure everything got downloaded correctly.



In [124]:

    
!rm -f tmp*
!rm script.sh



In [111]:

    
!rm fh_get.zip
!rm firehose_get



In [112]:

    
if not os.path.isdir(OUT_PATH):
    os.makedirs(OUT_PATH)



In [115]:

    
analyses_folder = 'analyses__' + RUN_DATE
!mv $analyses_folder {OUT_PATH + '/' + analyses_folder}



In [116]:

    
stddata_folder = 'stddata__' + RUN_DATE
!mv $stddata_folder {OUT_PATH + '/' + stddata_folder}

Exctract data and set up file hierarchy for downstream analysis



In [118]:

    
from Processing.ProcessFirehose import process_all_cancers

process_all_cancers(OUT_PATH, RUN_DATE)

Get rid of all of the downloaded zip files that we processed



In [119]:

    
!rm -rf {OUT_PATH + '/' + stddata_folder}
!rm -rf {OUT_PATH + '/' + analyses_folder}