Here we are downloading and processing most of the necissary data to run this analysis pipeline. I have set up a series of scripts to do this in an automated fashon in order to allow for reproduction of this study by others as well as for updating the results obtained here as more TCGA data is collected and reseased.
Downloading this data can take a considerable amount of time (~5 hours) and disk space (~45GB), be prepared.
We use the firehose_get script provided by the Broad to download the data, please see the firehose_get documentation for troubleshooting. As we are making using the Broad's initial processing pipeline and data formats we can not promise this initial code will not break upon future update that they make.
Make sure to edit the [Imports Notebook](./Imports) and change __OUT_PATH__ to directory on your machine where you want to store the data
In [123]:
import NotebookImport
from Imports import *
In [4]:
!curl http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip -o fh_get.zip
!unzip fh_get.zip
In [6]:
d = 'http://gdac.broadinstitute.org/runs/analyses__{}/ingested_data.tsv'.format(RUN_DATE)
tab = pd.read_table(d, sep='\n', header=None)
skip = tab[0].dropna().apply(lambda s: s.startswith('#'))
skip = list(skip[skip==True].index)
tab = pd.read_table(d, skiprows=skip, index_col=0).dropna()
In [8]:
cancers = tab[tab.Clinical>0].index[:-1]
cancer_string = ' '.join(cancers)
cancer_string
Out[8]:
Firehose has a pretty rough limit on 1MB/s for downloading files. To speed things up I send parallel requests for each cancer.
In [52]:
analysis = ['firehose_get -b analyses {} {} > tmp_{} &'.format(RUN_DATE, c, c)
for c in cancers]
data_types = ['miR_gene_expression','RSEM_genes_normalized','protein_normalization',
'clinical']
#data_types += ['humanmethylation450']
stddata = ['firehose_get -b -o {} stddata {} {} > tmp_{}_{} &'.format(r, RUN_DATE, c, r, c)
for c in cancers for r in data_types]
In [53]:
calls = analysis + stddata
script = '\nsleep 1\n'.join(calls)
f = open('script.sh', 'wb')
f.write(script)
f.close()
!bash script.sh
In [87]:
def check_file(f):
text = open(f).read()
if 'ERROR 404: Not Found' not in text:
return True
summary = text.split('\n')[-5]
n_files = int(summary.split(' files')[0].split()[1])
if n_files > 2:
return True
else:
return False
In [95]:
logs = [f for f in os.listdir('.') if f.startswith('tmp')]
failed_runs = [f for f in logs if check_file(f) == False]
In [108]:
recall = [c for c in calls for f in failed_runs if f + ' ' in c]
len(recall), len(calls)
Out[108]:
In [110]:
script = '\n'.join(recall)
f = open('script.sh', 'wb')
f.write(script)
f.close()
!bash script.sh
No going back from here, so I would check your data to make sure everything got downloaded correctly.
In [124]:
!rm -f tmp*
!rm script.sh
In [111]:
!rm fh_get.zip
!rm firehose_get
In [112]:
if not os.path.isdir(OUT_PATH):
os.makedirs(OUT_PATH)
In [115]:
analyses_folder = 'analyses__' + RUN_DATE
!mv $analyses_folder {OUT_PATH + '/' + analyses_folder}
In [116]:
stddata_folder = 'stddata__' + RUN_DATE
!mv $stddata_folder {OUT_PATH + '/' + stddata_folder}
In [118]:
from Processing.ProcessFirehose import process_all_cancers
process_all_cancers(OUT_PATH, RUN_DATE)
Get rid of all of the downloaded zip files that we processed
In [119]:
!rm -rf {OUT_PATH + '/' + stddata_folder}
!rm -rf {OUT_PATH + '/' + analyses_folder}