Using the cBioPortal MS import scripts

To add flexibility to the data import, a library of classes were constructed so that one class was made for every kind of file supported. This also gives more flexibility to support importing new types of mass spectrometry (MS) data. Though all the classes have docstrings, we thought that a Jupyter notebook tutorial was the best way to demonstrate usage cases using the data that is already uploaded from the CPTAC datasets and some test data for MaxQuant files.



In [1]:

    
# import classes
import sys
sys.path.append('../')

from ms2cbioportal import (CDAPiTraqTable, 
                           CDAPPrecursorAreaTable, 
                           MaxQuantProteomeTable, 
                           MaxQuantPTMTable, 
                           MSMeta)

CDAP iTRAQ and Precursor Area

CPTAC uses the Common Data Analysis Pipeline (CDAP), which produces a common file format for MS data analyzed using it. iTRAQ and precursor area are mostly similar in their treatment, but a distinction is made because the two assay types cannot be combined together. Precursor area data needs to be log-transformed in order to fit a normal distribution of values. The sample_regex refers to a regular expression that isolates sample columns. It will probably be necessary to inspect the data to determine what a good regex should be. use_ruler allows the user to specify that they would like an estimate of protein copy number per cell instead of just the iTRAQ or precursor area values.



In [2]:

    
# ptm_type is None if it's just proteome quantification
table_prot = CDAPiTraqTable(
                 filename='data/TCGA_Breast_BI_Proteome_CDAP.r2.itraq.tsv',
                 sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Log Ratio',
                 ptm_type=None,
                 use_ruler=False)

# view the data
table_prot.head()









    Out[2]:






  
    
      
      A2-A0D0-01A Log Ratio
      BH-A0HK-01A Log Ratio
      C8-A12T-01A Log Ratio
      A2-A0D2-01A Log Ratio
      C8-A12U-01A Log Ratio
      AR-A1AS-01A Log Ratio
      A2-A0EV-01A Log Ratio
      AN-A0AM-01A Log Ratio
      D8-A142-01A Log Ratio
      A2-A0SW-01A Log Ratio
      ...
      BH-A0BZ-01A Log Ratio
      D8-A13Y-01A Log Ratio
      A8-A076-01A Log Ratio
      AO-A126-01A Log Ratio
      E2-A10A-01A Log Ratio
      BH-A18Q-01A Log Ratio
      C8-A130-01A Log Ratio
      E2-A159-01A Log Ratio
      A2-A0T3-01A Log Ratio
      A2-A0YD-01A Log Ratio
    
    
      Hugo_Symbol
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      A1BG|A1BG
      0.815411
      1.452402
      0.325653
      0.012612
      -0.093688
      -0.294707
      0.729201
      -0.381916
      0.472856
      0.249276
      ...
      0.912297
      0.553538
      -0.740000
      -0.004677
      0.234698
      0.078452
      0.091140
      0.804976
      0.020051
      1.009284
    
    
      A2M|A2M
      0.753806
      0.954409
      0.002935
      -0.063085
      0.078432
      -0.538026
      0.496959
      -0.822142
      0.210174
      0.315397
      ...
      0.887778
      -0.206859
      -0.722060
      -0.820011
      0.375134
      0.007955
      -0.537951
      0.608922
      -0.618751
      0.413809
    
    
      A2ML1|A2ML1
      -1.619107
      -0.568099
      -1.043894
      -0.404392
      -0.479052
      -0.790740
      0.995787
      -0.973861
      -1.576598
      -0.842791
      ...
      -1.328969
      -1.804258
      -1.079000
      0.246278
      -0.763648
      0.779338
      -0.578102
      3.306430
      -0.252158
      -0.115476
    
    
      AAAS|AAAS
      1.429531
      0.704430
      0.171455
      0.380669
      -0.140444
      0.159832
      0.194428
      -0.001685
      0.112209
      -0.393742
      ...
      0.725156
      0.659645
      0.056929
      0.168459
      0.925411
      0.252361
      -0.048593
      0.945814
      0.087038
      0.163201
    
    
      AACS|AACS
      -0.141402
      -0.529805
      -0.125071
      -0.541923
      -0.052916
      -0.158630
      -0.305420
      -0.559738
      0.082587
      -0.577913
      ...
      -0.421893
      -0.034974
      0.248760
      0.447356
      -0.764068
      -0.382885
      -0.384127
      -0.948860
      -0.547481
      -0.477441
    
  

5 rows × 102 columns



In [3]:

    
# we want to rename the table columns and write to file
table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Log Ratio', ''))
print(table_prot.columns)









    



Index(['TCGA-A2-A0D0-01A', 'TCGA-BH-A0HK-01A', 'TCGA-C8-A12T-01A',
       'TCGA-A2-A0D2-01A', 'TCGA-C8-A12U-01A', 'TCGA-AR-A1AS-01A',
       'TCGA-A2-A0EV-01A', 'TCGA-AN-A0AM-01A', 'TCGA-D8-A142-01A',
       'TCGA-A2-A0SW-01A',
       ...
       'TCGA-BH-A0BZ-01A', 'TCGA-D8-A13Y-01A', 'TCGA-A8-A076-01A',
       'TCGA-AO-A126-01A', 'TCGA-E2-A10A-01A', 'TCGA-BH-A18Q-01A',
       'TCGA-C8-A130-01A', 'TCGA-E2-A159-01A', 'TCGA-A2-A0T3-01A',
       'TCGA-A2-A0YD-01A'],
      dtype='object', length=102)



In [4]:

    
# now load in a phosphoproteome dataset that's also from breast cancer
# NOTE: ptm_type is now 'P' for phosphoprotein
table_ptm = CDAPiTraqTable(
                filename='data/TCGA_Breast_BI_Phosphoproteome.phosphosite.itraq.tsv',
                sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Log Ratio',
                ptm_type='P',
                use_ruler=False)

table_ptm.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Log Ratio', ''))

# combine the proteome and phosphoproteome data
table_prot.vertical_concat(table_ptm)

# write to file
table_prot.write_csv('data_breast_protein_level.txt')



In [5]:

    
# create metadata using ids supplied by cBioPortal
meta = MSMeta(cancer_id='brca_tcga', 
              prot_file='data_breast_protein_level.txt')

meta.write('meta_breast_protein_level.txt')

# take a look
meta









    Out[5]:





cancer_study_identifier: brca_tcga
genetic_alteration_type: PROTEIN_LEVEL
datatype: Z-SCORE
stable_id: protein_quantification
profile_description: Protein Quantification (Mass Spec)
show_profile_in_analysis_tab: true
profile_name: Protein levels (mass spectrometry by CPTAC)
data_filename: data_breast_protein_level.txt



In [6]:

    
# precursor area data works in almost exactly the same way, but the values happen to be log-transformed internally.
# NOTE: the sample_regex has changed to reflect the different data format
table_prot = CDAPPrecursorAreaTable(
                 filename='data/TCGA_Colon_VU_Proteome_CDAP.r2.precursor_area.tsv',
                 sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Area',
                 ptm_type=None,
                 use_ruler=False)

table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Area', ''))

# write to file
table_prot.write_csv('data_colorectal_protein_level.txt')

# create metadata using ids supplied by cBioPortal
meta = MSMeta(cancer_id='coadread_tcga', prot_file='data_colorectal_protein_level.txt')
meta.write('meta_colorectal_protein_level.txt')

# compare values to cell below
table_prot.head()









    Out[6]:






  
    
      
      TCGA-A6-3807-01A-22
      TCGA-A6-3808-01A-22
      TCGA-A6-3810-01A-22
      TCGA-AA-3518-01A-11
      TCGA-AA-3525-01A-12
      TCGA-AA-3526-01A-11
      TCGA-AA-3529-01A-12
      TCGA-AA-3531-01A-22
      TCGA-AA-3534-01A-22
      TCGA-AA-3552-01A-22
      ...
      TCGA-AG-A01L-01A-22
      TCGA-AG-A01N-01A-23
      TCGA-AG-A01W-01A-23
      TCGA-AG-A01Y-01A-43
      TCGA-AG-A020-01A-23
      TCGA-AG-A026-01A-71
      TCGA-AG-A02N-01A-31
      TCGA-AG-A02X-01A-32
      TCGA-AG-A032-01A-31
      TCGA-AG-A036-01A-22
    
    
      Hugo_Symbol
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      A1BG|A1BG
      25.123140
      33.397088
      26.501822
      31.858215
      25.527911
      25.654713
      27.108381
      27.227353
      27.973242
      26.450790
      ...
      25.780401
      25.376286
      26.112949
      25.863284
      28.069504
      26.078348
      25.347789
      26.881424
      23.976052
      24.682132
    
    
      A1CF|A1CF
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      26.198012
      0.000000
      26.298682
      0.000000
      0.000000
      ...
      0.000000
      24.525723
      23.518646
      0.000000
      0.000000
      23.304387
      0.000000
      0.000000
      0.000000
      20.638801
    
    
      A2M|A2M
      31.903124
      33.052010
      29.608782
      33.067037
      30.257659
      30.454480
      30.716812
      31.392235
      31.067008
      30.263586
      ...
      29.473984
      28.856287
      29.613190
      29.269172
      30.239735
      28.833491
      29.524423
      29.919089
      29.395016
      27.834534
    
    
      AAAS|AAAS
      24.048462
      0.000000
      22.112993
      23.396796
      22.496118
      24.397320
      23.586470
      26.785921
      0.000000
      24.999464
      ...
      22.917412
      24.519118
      21.476712
      0.000000
      25.563707
      23.156966
      25.031882
      24.384520
      23.286313
      20.465527
    
    
      AACS|AACS
      24.339778
      0.000000
      0.000000
      21.849554
      0.000000
      0.000000
      23.245774
      0.000000
      0.000000
      23.670237
      ...
      22.211652
      0.000000
      0.000000
      0.000000
      22.746999
      0.000000
      24.070055
      0.000000
      0.000000
      20.590727
    
  

5 rows × 95 columns



In [7]:

    
# for intensity values, we have included an optional transformation to convert values into
# estimates of copy number per cell using the proteomic ruler method published by 
# Wiśniewski et al. 2014
# for precursor area, values are only log-tranformed if `use_ruler` is False.
table_prot = CDAPPrecursorAreaTable(
                 filename='data/TCGA_Colon_VU_Proteome_CDAP.r2.precursor_area.tsv',
                 sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Area',
                 ptm_type=None,
                 use_ruler=True)

table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Area', ''))

table_prot.head()









    Out[7]:






  
    
      
      TCGA-A6-3807-01A-22
      TCGA-A6-3808-01A-22
      TCGA-A6-3810-01A-22
      TCGA-AA-3518-01A-11
      TCGA-AA-3525-01A-12
      TCGA-AA-3526-01A-11
      TCGA-AA-3529-01A-12
      TCGA-AA-3531-01A-22
      TCGA-AA-3534-01A-22
      TCGA-AA-3552-01A-22
      ...
      TCGA-AG-A01L-01A-22
      TCGA-AG-A01N-01A-23
      TCGA-AG-A01W-01A-23
      TCGA-AG-A01Y-01A-43
      TCGA-AG-A020-01A-23
      TCGA-AG-A026-01A-71
      TCGA-AG-A02N-01A-31
      TCGA-AG-A02X-01A-32
      TCGA-AG-A032-01A-31
      TCGA-AG-A036-01A-22
    
    
      Hugo_Symbol
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      A1BG|A1BG
      11618.293962
      5.113226e+06
      48464.401324
      765507.172445
      26470.367578
      29441.721494
      39964.052480
      29143.050170
      122291.429794
      39225.671665
      ...
      30650.355310
      15933.270771
      54353.180075
      64386.862028
      133693.451626
      29065.305541
      18868.985000
      78012.452445
      8107.327695
      36492.499234
    
    
      A1CF|A1CF
      0.000000
      0.000000e+00
      0.000000
      0.000000
      0.000000
      34328.853041
      0.000000
      12249.658328
      0.000000
      0.000000
      ...
      0.000000
      7069.793982
      7201.261324
      0.000000
      0.000000
      3399.980452
      0.000000
      0.000000
      0.000000
      1770.868334
    
    
      A2M|A2M
      412243.582815
      1.299718e+06
      134816.720316
      571312.543625
      226771.734144
      264770.733219
      157379.614328
      168780.551083
      337089.533322
      177979.275129
      ...
      128040.692953
      57401.411434
      198579.927878
      220345.712032
      194288.748919
      63356.132822
      110172.581175
      206835.544485
      111990.572481
      104762.109012
    
    
      AAAS|AAAS
      5143.527433
      0.000000e+00
      2157.159157
      2025.056577
      3018.061253
      11483.624217
      3244.123942
      20011.497495
      0.000000
      13375.393524
      ...
      3928.427080
      8201.636272
      2038.013916
      0.000000
      21949.193915
      3577.503046
      14134.471177
      12886.941830
      4686.777426
      1830.237202
    
    
      AACS|AACS
      4973.216105
      0.000000e+00
      0.000000
      547.459804
      0.000000
      0.000000
      2024.038836
      0.000000
      0.000000
      4205.821590
      ...
      1903.028462
      0.000000
      0.000000
      0.000000
      2461.423616
      0.000000
      5733.531518
      0.000000
      0.000000
      1577.167517
    
  

5 rows × 95 columns

MaxQuant proteome and PTM data

Since MaxQuant is such a popular software for analyzing quantitative MS data, we decided to include processing support for it, even though CPTAC does not currently use it for their data. This would be useful for local institutional builds of cBioPortal that want to import their own MaxQuant data. Unlike with CDAP data, separate tables are created for MaxQuant proteome and MaxQuant PTM, simply because their file formats are so divergent. Also unlike CDAP data, it makes sense to combine the proteome and PTM quantification data into one file.



In [8]:

    
table_prot = MaxQuantProteomeTable(filename='data/proteinGroups.txt',
                                   sample_regex='Intensity std[0-9]+$',
                                   ptm_type=None,
                                   use_ruler=False)

table_prot.rename_columns(renamer=lambda col: col.replace('Intensity ', ''))

# take a look
table_prot.head()









    Out[8]:






  
    
      
      std00
      std01
      std02
      std03
      std04
      std05
      std06
      std07
      std08
      std09
      std10
      std11
    
    
      Hugo_Symbol
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      PSMA2|PSMA2
      0
      0
      0
      0
      1105400
      0
      955260
      0
      661570
      195870
      0
      0
    
    
      DNAH8|DNAH8
      531360
      0
      0
      285290
      0
      0
      0
      0
      459770
      0
      0
      0
    
    
      CAPNS1|CAPNS1
      0
      0
      0
      0
      367960
      0
      305730
      0
      135990
      0
      0
      0
    
    
      DCP1A|DCP1A
      0
      0
      0
      0
      0
      0
      0
      0
      77772
      0
      0
      0
    
    
      HSPA4|HSPA4
      0
      0
      0
      0
      0
      0
      781410
      0
      0
      239030
      0
      0



In [9]:

    
table_ptm = MaxQuantPTMTable(filename='data/Oxidation (M)Sites.txt',
                             sample_regex='Intensity std[0-9]+$',
                             ptm_type='O',
                             use_ruler=False)

table_ptm.rename_columns(renamer=lambda col: col.replace('Intensity ', ''))

# combine the two datasets
table_prot.vertical_concat(table_ptm)

# NOTE: horizontal_concat exists to pool samples
# while vertical_concat pools gene and PTM IDs

# write to file
table_prot.write_csv('data_maxquant_test.txt')



In [ ]:

	A2-A0D0-01A Log Ratio	BH-A0HK-01A Log Ratio	C8-A12T-01A Log Ratio	A2-A0D2-01A Log Ratio	C8-A12U-01A Log Ratio	AR-A1AS-01A Log Ratio	A2-A0EV-01A Log Ratio	AN-A0AM-01A Log Ratio	D8-A142-01A Log Ratio	A2-A0SW-01A Log Ratio	...	BH-A0BZ-01A Log Ratio	D8-A13Y-01A Log Ratio	A8-A076-01A Log Ratio	AO-A126-01A Log Ratio	E2-A10A-01A Log Ratio	BH-A18Q-01A Log Ratio	C8-A130-01A Log Ratio	E2-A159-01A Log Ratio	A2-A0T3-01A Log Ratio	A2-A0YD-01A Log Ratio
Hugo_Symbol
A1BG\|A1BG	0.815411	1.452402	0.325653	0.012612	-0.093688	-0.294707	0.729201	-0.381916	0.472856	0.249276	...	0.912297	0.553538	-0.740000	-0.004677	0.234698	0.078452	0.091140	0.804976	0.020051	1.009284
A2M\|A2M	0.753806	0.954409	0.002935	-0.063085	0.078432	-0.538026	0.496959	-0.822142	0.210174	0.315397	...	0.887778	-0.206859	-0.722060	-0.820011	0.375134	0.007955	-0.537951	0.608922	-0.618751	0.413809
A2ML1\|A2ML1	-1.619107	-0.568099	-1.043894	-0.404392	-0.479052	-0.790740	0.995787	-0.973861	-1.576598	-0.842791	...	-1.328969	-1.804258	-1.079000	0.246278	-0.763648	0.779338	-0.578102	3.306430	-0.252158	-0.115476
AAAS\|AAAS	1.429531	0.704430	0.171455	0.380669	-0.140444	0.159832	0.194428	-0.001685	0.112209	-0.393742	...	0.725156	0.659645	0.056929	0.168459	0.925411	0.252361	-0.048593	0.945814	0.087038	0.163201
AACS\|AACS	-0.141402	-0.529805	-0.125071	-0.541923	-0.052916	-0.158630	-0.305420	-0.559738	0.082587	-0.577913	...	-0.421893	-0.034974	0.248760	0.447356	-0.764068	-0.382885	-0.384127	-0.948860	-0.547481	-0.477441

	TCGA-A6-3807-01A-22	TCGA-A6-3808-01A-22	TCGA-A6-3810-01A-22	TCGA-AA-3518-01A-11	TCGA-AA-3525-01A-12	TCGA-AA-3526-01A-11	TCGA-AA-3529-01A-12	TCGA-AA-3531-01A-22	TCGA-AA-3534-01A-22	TCGA-AA-3552-01A-22	...	TCGA-AG-A01L-01A-22	TCGA-AG-A01N-01A-23	TCGA-AG-A01W-01A-23	TCGA-AG-A01Y-01A-43	TCGA-AG-A020-01A-23	TCGA-AG-A026-01A-71	TCGA-AG-A02N-01A-31	TCGA-AG-A02X-01A-32	TCGA-AG-A032-01A-31	TCGA-AG-A036-01A-22
Hugo_Symbol
A1BG\|A1BG	25.123140	33.397088	26.501822	31.858215	25.527911	25.654713	27.108381	27.227353	27.973242	26.450790	...	25.780401	25.376286	26.112949	25.863284	28.069504	26.078348	25.347789	26.881424	23.976052	24.682132
A1CF\|A1CF	0.000000	0.000000	0.000000	0.000000	0.000000	26.198012	0.000000	26.298682	0.000000	0.000000	...	0.000000	24.525723	23.518646	0.000000	0.000000	23.304387	0.000000	0.000000	0.000000	20.638801
A2M\|A2M	31.903124	33.052010	29.608782	33.067037	30.257659	30.454480	30.716812	31.392235	31.067008	30.263586	...	29.473984	28.856287	29.613190	29.269172	30.239735	28.833491	29.524423	29.919089	29.395016	27.834534
AAAS\|AAAS	24.048462	0.000000	22.112993	23.396796	22.496118	24.397320	23.586470	26.785921	0.000000	24.999464	...	22.917412	24.519118	21.476712	0.000000	25.563707	23.156966	25.031882	24.384520	23.286313	20.465527
AACS\|AACS	24.339778	0.000000	0.000000	21.849554	0.000000	0.000000	23.245774	0.000000	0.000000	23.670237	...	22.211652	0.000000	0.000000	0.000000	22.746999	0.000000	24.070055	0.000000	0.000000	20.590727

	TCGA-A6-3807-01A-22	TCGA-A6-3808-01A-22	TCGA-A6-3810-01A-22	TCGA-AA-3518-01A-11	TCGA-AA-3525-01A-12	TCGA-AA-3526-01A-11	TCGA-AA-3529-01A-12	TCGA-AA-3531-01A-22	TCGA-AA-3534-01A-22	TCGA-AA-3552-01A-22	...	TCGA-AG-A01L-01A-22	TCGA-AG-A01N-01A-23	TCGA-AG-A01W-01A-23	TCGA-AG-A01Y-01A-43	TCGA-AG-A020-01A-23	TCGA-AG-A026-01A-71	TCGA-AG-A02N-01A-31	TCGA-AG-A02X-01A-32	TCGA-AG-A032-01A-31	TCGA-AG-A036-01A-22
Hugo_Symbol
A1BG\|A1BG	11618.293962	5.113226e+06	48464.401324	765507.172445	26470.367578	29441.721494	39964.052480	29143.050170	122291.429794	39225.671665	...	30650.355310	15933.270771	54353.180075	64386.862028	133693.451626	29065.305541	18868.985000	78012.452445	8107.327695	36492.499234
A1CF\|A1CF	0.000000	0.000000e+00	0.000000	0.000000	0.000000	34328.853041	0.000000	12249.658328	0.000000	0.000000	...	0.000000	7069.793982	7201.261324	0.000000	0.000000	3399.980452	0.000000	0.000000	0.000000	1770.868334
A2M\|A2M	412243.582815	1.299718e+06	134816.720316	571312.543625	226771.734144	264770.733219	157379.614328	168780.551083	337089.533322	177979.275129	...	128040.692953	57401.411434	198579.927878	220345.712032	194288.748919	63356.132822	110172.581175	206835.544485	111990.572481	104762.109012
AAAS\|AAAS	5143.527433	0.000000e+00	2157.159157	2025.056577	3018.061253	11483.624217	3244.123942	20011.497495	0.000000	13375.393524	...	3928.427080	8201.636272	2038.013916	0.000000	21949.193915	3577.503046	14134.471177	12886.941830	4686.777426	1830.237202
AACS\|AACS	4973.216105	0.000000e+00	0.000000	547.459804	0.000000	0.000000	2024.038836	0.000000	0.000000	4205.821590	...	1903.028462	0.000000	0.000000	0.000000	2461.423616	0.000000	5733.531518	0.000000	0.000000	1577.167517

	std00	std01	std02	std03	std04	std05	std06	std07	std08	std09	std10	std11
Hugo_Symbol
PSMA2\|PSMA2	0	0	0	0	1105400	0	955260	0	661570	195870	0	0
DNAH8\|DNAH8	531360	0	0	285290	0	0	0	0	459770	0	0	0
CAPNS1\|CAPNS1	0	0	0	0	367960	0	305730	0	135990	0	0	0
DCP1A\|DCP1A	0	0	0	0	0	0	0	0	77772	0	0	0
HSPA4\|HSPA4	0	0	0	0	0	0	781410	0	0	239030	0	0