Using the cBioPortal MS import scripts

To add flexibility to the data import, a library of classes were constructed so that one class was made for every kind of file supported. This also gives more flexibility to support importing new types of mass spectrometry (MS) data. Though all the classes have docstrings, we thought that a Jupyter notebook tutorial was the best way to demonstrate usage cases using the data that is already uploaded from the CPTAC datasets and some test data for MaxQuant files.


In [1]:
# import classes
import sys
sys.path.append('../')

from ms2cbioportal import (CDAPiTraqTable, 
                           CDAPPrecursorAreaTable, 
                           MaxQuantProteomeTable, 
                           MaxQuantPTMTable, 
                           MSMeta)

CDAP iTRAQ and Precursor Area

CPTAC uses the Common Data Analysis Pipeline (CDAP), which produces a common file format for MS data analyzed using it. iTRAQ and precursor area are mostly similar in their treatment, but a distinction is made because the two assay types cannot be combined together. Precursor area data needs to be log-transformed in order to fit a normal distribution of values. The sample_regex refers to a regular expression that isolates sample columns. It will probably be necessary to inspect the data to determine what a good regex should be. use_ruler allows the user to specify that they would like an estimate of protein copy number per cell instead of just the iTRAQ or precursor area values.


In [2]:
# ptm_type is None if it's just proteome quantification
table_prot = CDAPiTraqTable(
                 filename='data/TCGA_Breast_BI_Proteome_CDAP.r2.itraq.tsv',
                 sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Log Ratio',
                 ptm_type=None,
                 use_ruler=False)

# view the data
table_prot.head()


Out[2]:
A2-A0D0-01A Log Ratio BH-A0HK-01A Log Ratio C8-A12T-01A Log Ratio A2-A0D2-01A Log Ratio C8-A12U-01A Log Ratio AR-A1AS-01A Log Ratio A2-A0EV-01A Log Ratio AN-A0AM-01A Log Ratio D8-A142-01A Log Ratio A2-A0SW-01A Log Ratio ... BH-A0BZ-01A Log Ratio D8-A13Y-01A Log Ratio A8-A076-01A Log Ratio AO-A126-01A Log Ratio E2-A10A-01A Log Ratio BH-A18Q-01A Log Ratio C8-A130-01A Log Ratio E2-A159-01A Log Ratio A2-A0T3-01A Log Ratio A2-A0YD-01A Log Ratio
Hugo_Symbol
A1BG|A1BG 0.815411 1.452402 0.325653 0.012612 -0.093688 -0.294707 0.729201 -0.381916 0.472856 0.249276 ... 0.912297 0.553538 -0.740000 -0.004677 0.234698 0.078452 0.091140 0.804976 0.020051 1.009284
A2M|A2M 0.753806 0.954409 0.002935 -0.063085 0.078432 -0.538026 0.496959 -0.822142 0.210174 0.315397 ... 0.887778 -0.206859 -0.722060 -0.820011 0.375134 0.007955 -0.537951 0.608922 -0.618751 0.413809
A2ML1|A2ML1 -1.619107 -0.568099 -1.043894 -0.404392 -0.479052 -0.790740 0.995787 -0.973861 -1.576598 -0.842791 ... -1.328969 -1.804258 -1.079000 0.246278 -0.763648 0.779338 -0.578102 3.306430 -0.252158 -0.115476
AAAS|AAAS 1.429531 0.704430 0.171455 0.380669 -0.140444 0.159832 0.194428 -0.001685 0.112209 -0.393742 ... 0.725156 0.659645 0.056929 0.168459 0.925411 0.252361 -0.048593 0.945814 0.087038 0.163201
AACS|AACS -0.141402 -0.529805 -0.125071 -0.541923 -0.052916 -0.158630 -0.305420 -0.559738 0.082587 -0.577913 ... -0.421893 -0.034974 0.248760 0.447356 -0.764068 -0.382885 -0.384127 -0.948860 -0.547481 -0.477441

5 rows × 102 columns


In [3]:
# we want to rename the table columns and write to file
table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Log Ratio', ''))
print(table_prot.columns)


Index(['TCGA-A2-A0D0-01A', 'TCGA-BH-A0HK-01A', 'TCGA-C8-A12T-01A',
       'TCGA-A2-A0D2-01A', 'TCGA-C8-A12U-01A', 'TCGA-AR-A1AS-01A',
       'TCGA-A2-A0EV-01A', 'TCGA-AN-A0AM-01A', 'TCGA-D8-A142-01A',
       'TCGA-A2-A0SW-01A',
       ...
       'TCGA-BH-A0BZ-01A', 'TCGA-D8-A13Y-01A', 'TCGA-A8-A076-01A',
       'TCGA-AO-A126-01A', 'TCGA-E2-A10A-01A', 'TCGA-BH-A18Q-01A',
       'TCGA-C8-A130-01A', 'TCGA-E2-A159-01A', 'TCGA-A2-A0T3-01A',
       'TCGA-A2-A0YD-01A'],
      dtype='object', length=102)

In [4]:
# now load in a phosphoproteome dataset that's also from breast cancer
# NOTE: ptm_type is now 'P' for phosphoprotein
table_ptm = CDAPiTraqTable(
                filename='data/TCGA_Breast_BI_Phosphoproteome.phosphosite.itraq.tsv',
                sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Log Ratio',
                ptm_type='P',
                use_ruler=False)

table_ptm.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Log Ratio', ''))

# combine the proteome and phosphoproteome data
table_prot.vertical_concat(table_ptm)

# write to file
table_prot.write_csv('data_breast_protein_level.txt')

In [5]:
# create metadata using ids supplied by cBioPortal
meta = MSMeta(cancer_id='brca_tcga', 
              prot_file='data_breast_protein_level.txt')

meta.write('meta_breast_protein_level.txt')

# take a look
meta


Out[5]:
cancer_study_identifier: brca_tcga
genetic_alteration_type: PROTEIN_LEVEL
datatype: Z-SCORE
stable_id: protein_quantification
profile_description: Protein Quantification (Mass Spec)
show_profile_in_analysis_tab: true
profile_name: Protein levels (mass spectrometry by CPTAC)
data_filename: data_breast_protein_level.txt

In [6]:
# precursor area data works in almost exactly the same way, but the values happen to be log-transformed internally.
# NOTE: the sample_regex has changed to reflect the different data format
table_prot = CDAPPrecursorAreaTable(
                 filename='data/TCGA_Colon_VU_Proteome_CDAP.r2.precursor_area.tsv',
                 sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Area',
                 ptm_type=None,
                 use_ruler=False)

table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Area', ''))

# write to file
table_prot.write_csv('data_colorectal_protein_level.txt')

# create metadata using ids supplied by cBioPortal
meta = MSMeta(cancer_id='coadread_tcga', prot_file='data_colorectal_protein_level.txt')
meta.write('meta_colorectal_protein_level.txt')

# compare values to cell below
table_prot.head()


Out[6]:
TCGA-A6-3807-01A-22 TCGA-A6-3808-01A-22 TCGA-A6-3810-01A-22 TCGA-AA-3518-01A-11 TCGA-AA-3525-01A-12 TCGA-AA-3526-01A-11 TCGA-AA-3529-01A-12 TCGA-AA-3531-01A-22 TCGA-AA-3534-01A-22 TCGA-AA-3552-01A-22 ... TCGA-AG-A01L-01A-22 TCGA-AG-A01N-01A-23 TCGA-AG-A01W-01A-23 TCGA-AG-A01Y-01A-43 TCGA-AG-A020-01A-23 TCGA-AG-A026-01A-71 TCGA-AG-A02N-01A-31 TCGA-AG-A02X-01A-32 TCGA-AG-A032-01A-31 TCGA-AG-A036-01A-22
Hugo_Symbol
A1BG|A1BG 25.123140 33.397088 26.501822 31.858215 25.527911 25.654713 27.108381 27.227353 27.973242 26.450790 ... 25.780401 25.376286 26.112949 25.863284 28.069504 26.078348 25.347789 26.881424 23.976052 24.682132
A1CF|A1CF 0.000000 0.000000 0.000000 0.000000 0.000000 26.198012 0.000000 26.298682 0.000000 0.000000 ... 0.000000 24.525723 23.518646 0.000000 0.000000 23.304387 0.000000 0.000000 0.000000 20.638801
A2M|A2M 31.903124 33.052010 29.608782 33.067037 30.257659 30.454480 30.716812 31.392235 31.067008 30.263586 ... 29.473984 28.856287 29.613190 29.269172 30.239735 28.833491 29.524423 29.919089 29.395016 27.834534
AAAS|AAAS 24.048462 0.000000 22.112993 23.396796 22.496118 24.397320 23.586470 26.785921 0.000000 24.999464 ... 22.917412 24.519118 21.476712 0.000000 25.563707 23.156966 25.031882 24.384520 23.286313 20.465527
AACS|AACS 24.339778 0.000000 0.000000 21.849554 0.000000 0.000000 23.245774 0.000000 0.000000 23.670237 ... 22.211652 0.000000 0.000000 0.000000 22.746999 0.000000 24.070055 0.000000 0.000000 20.590727

5 rows × 95 columns


In [7]:
# for intensity values, we have included an optional transformation to convert values into
# estimates of copy number per cell using the proteomic ruler method published by 
# Wiśniewski et al. 2014
# for precursor area, values are only log-tranformed if `use_ruler` is False.
table_prot = CDAPPrecursorAreaTable(
                 filename='data/TCGA_Colon_VU_Proteome_CDAP.r2.precursor_area.tsv',
                 sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Area',
                 ptm_type=None,
                 use_ruler=True)

table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Area', ''))

table_prot.head()


Out[7]:
TCGA-A6-3807-01A-22 TCGA-A6-3808-01A-22 TCGA-A6-3810-01A-22 TCGA-AA-3518-01A-11 TCGA-AA-3525-01A-12 TCGA-AA-3526-01A-11 TCGA-AA-3529-01A-12 TCGA-AA-3531-01A-22 TCGA-AA-3534-01A-22 TCGA-AA-3552-01A-22 ... TCGA-AG-A01L-01A-22 TCGA-AG-A01N-01A-23 TCGA-AG-A01W-01A-23 TCGA-AG-A01Y-01A-43 TCGA-AG-A020-01A-23 TCGA-AG-A026-01A-71 TCGA-AG-A02N-01A-31 TCGA-AG-A02X-01A-32 TCGA-AG-A032-01A-31 TCGA-AG-A036-01A-22
Hugo_Symbol
A1BG|A1BG 11618.293962 5.113226e+06 48464.401324 765507.172445 26470.367578 29441.721494 39964.052480 29143.050170 122291.429794 39225.671665 ... 30650.355310 15933.270771 54353.180075 64386.862028 133693.451626 29065.305541 18868.985000 78012.452445 8107.327695 36492.499234
A1CF|A1CF 0.000000 0.000000e+00 0.000000 0.000000 0.000000 34328.853041 0.000000 12249.658328 0.000000 0.000000 ... 0.000000 7069.793982 7201.261324 0.000000 0.000000 3399.980452 0.000000 0.000000 0.000000 1770.868334
A2M|A2M 412243.582815 1.299718e+06 134816.720316 571312.543625 226771.734144 264770.733219 157379.614328 168780.551083 337089.533322 177979.275129 ... 128040.692953 57401.411434 198579.927878 220345.712032 194288.748919 63356.132822 110172.581175 206835.544485 111990.572481 104762.109012
AAAS|AAAS 5143.527433 0.000000e+00 2157.159157 2025.056577 3018.061253 11483.624217 3244.123942 20011.497495 0.000000 13375.393524 ... 3928.427080 8201.636272 2038.013916 0.000000 21949.193915 3577.503046 14134.471177 12886.941830 4686.777426 1830.237202
AACS|AACS 4973.216105 0.000000e+00 0.000000 547.459804 0.000000 0.000000 2024.038836 0.000000 0.000000 4205.821590 ... 1903.028462 0.000000 0.000000 0.000000 2461.423616 0.000000 5733.531518 0.000000 0.000000 1577.167517

5 rows × 95 columns

MaxQuant proteome and PTM data

Since MaxQuant is such a popular software for analyzing quantitative MS data, we decided to include processing support for it, even though CPTAC does not currently use it for their data. This would be useful for local institutional builds of cBioPortal that want to import their own MaxQuant data. Unlike with CDAP data, separate tables are created for MaxQuant proteome and MaxQuant PTM, simply because their file formats are so divergent. Also unlike CDAP data, it makes sense to combine the proteome and PTM quantification data into one file.


In [8]:
table_prot = MaxQuantProteomeTable(filename='data/proteinGroups.txt',
                                   sample_regex='Intensity std[0-9]+$',
                                   ptm_type=None,
                                   use_ruler=False)

table_prot.rename_columns(renamer=lambda col: col.replace('Intensity ', ''))

# take a look
table_prot.head()


Out[8]:
std00 std01 std02 std03 std04 std05 std06 std07 std08 std09 std10 std11
Hugo_Symbol
PSMA2|PSMA2 0 0 0 0 1105400 0 955260 0 661570 195870 0 0
DNAH8|DNAH8 531360 0 0 285290 0 0 0 0 459770 0 0 0
CAPNS1|CAPNS1 0 0 0 0 367960 0 305730 0 135990 0 0 0
DCP1A|DCP1A 0 0 0 0 0 0 0 0 77772 0 0 0
HSPA4|HSPA4 0 0 0 0 0 0 781410 0 0 239030 0 0

In [9]:
table_ptm = MaxQuantPTMTable(filename='data/Oxidation (M)Sites.txt',
                             sample_regex='Intensity std[0-9]+$',
                             ptm_type='O',
                             use_ruler=False)

table_ptm.rename_columns(renamer=lambda col: col.replace('Intensity ', ''))

# combine the two datasets
table_prot.vertical_concat(table_ptm)

# NOTE: horizontal_concat exists to pool samples
# while vertical_concat pools gene and PTM IDs

# write to file
table_prot.write_csv('data_maxquant_test.txt')

In [ ]: