Download Data Get All Available MAF Files from TCGA Data Portal



In [1]:

    
import NotebookImport
from Imports import *









    




importing IPython notebook from Imports






    



Populating the interactive namespace from numpy and matplotlib
changing to source dirctory



In [2]:

    
from bs4 import BeautifulSoup
from urllib2 import HTTPError

GLOBAL VARIABLE WARNING

Here I download updated clinical data from the TCGA Data Portal. This is a secure site which uses HTTPS. I had to give it a path to my ca-cert for the download to work. Download a copy of a generic cacert.pem [here](http://curl.haxx.se/ca/cacert.pem).



In [3]:

    
PATH_TO_CACERT = '/cellar/users/agross/cacert.pem'

Download most recent files from MAF dashboard



In [4]:

    
out_path = OUT_PATH + '/MAFs_new_2/'
if not os.path.isdir(out_path):
    os.makedirs(out_path)



In [5]:

    
maf_dashboard = 'https://confluence.broadinstitute.org/display/GDAC/MAF+Dashboard'



In [6]:

    
!curl --cacert $PATH_TO_CACERT $maf_dashboard -o tmp.html









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  137k    0  137k    0     0  98014      0 --:--:--  0:00:01 --:--:-- 97991

Use BeutifulSoup to parse out all of the links in the table



In [7]:

    
f = open('tmp.html', 'rb').read()
soup = BeautifulSoup(f)



In [8]:

    
r = [l.get('href') for l in soup.find_all('a')
   if l.get('href') != None
   and '.maf' in l.get('href')]

Download all of the MAFs by following the links

This takes a while, as I'm downloading all of the data.
I read in the table first to count the number of comment lines and a second time to actuall load the data.
Yes there is likely a more efficient way to do this, but I'm waiting on https://github.com/pydata/pandas/issues/2685



In [11]:

    
t = pd.read_table(f, nrows=10, sep='not_real_term', header=None, squeeze=True,
                          engine='python')



In [54]:

    
cols = ['Hugo_Symbol', 'NCBI_Build', 'Chromosome', 'Start_position', 
        'End_position', 'Strand', 'Reference_Allele', 
        'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2',
        'Tumor_Sample_Barcode', 'Protein_Change',
        'Variant_Classification','Variant_Type']



In [55]:

    
maf = {}
for f in r:
    try:
        t = pd.read_table(f, nrows=10, sep='not_real_term', header=None, 
                          squeeze=True,
                          engine='python')
        skip = t.apply(lambda s: s.startswith('#'))
        skip = list(skip[skip==True].index)
        h = pd.read_table(f, header=0, index_col=None, skiprows=skip, 
                  engine='python', nrows=0)
        cc = list(h.columns.intersection(cols))
        maf[f] = pd.read_table(f, header=0, index_col=None,
                               skiprows=skip,
                               engine='c',
                               usecols=cc)
    except HTTPError:
        print f









    



https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/laml/gsc/genome.wustl.edu/illuminaga_dnaseq/mutations/genome.wustl.edu_LAML.IlluminaGA_DNASeq.Level_2.1.2.0/genome.wustl.edu_LAML.IlluminaGA_DNASeq.preliminary.1.maf



In [56]:

    
m2 = pd.concat(maf)
m3 = m2.dropna(axis=1, how='all')

Reduce MAF down to most usefull columns



In [57]:

    
m4 = m3[cols]
m4 = m4.reset_index()
#m4.index = map(lambda s: s.split('/')[-1], m4.index)
m4 = m4.drop_duplicates(subset=['Hugo_Symbol','Tumor_Sample_Barcode','Start_position'])
m4 = m4.reset_index()



In [58]:

    
m4.to_csv(out_path + 'mega_maf.csv')

Get gene by patient mutation count matrix and save



In [59]:

    
m5 = m4.ix[m4.Variant_Classification != 'Silent']
cc = m5.groupby(['Hugo_Symbol','Tumor_Sample_Barcode']).size()
cc = cc.reset_index()



In [60]:

    
cc.to_csv(out_path + 'meta.csv')



In [62]:

    
cc.shape









    Out[62]:





(1528429, 3)