To add flexibility to the data import, a library of classes were constructed so that one class was made for every kind of file supported. This also gives more flexibility to support importing new types of mass spectrometry (MS) data. Though all the classes have docstrings, we thought that a Jupyter notebook tutorial was the best way to demonstrate usage cases using the data that is already uploaded from the CPTAC datasets and some test data for MaxQuant files.
In [1]:
# import classes
import sys
sys.path.append('../')
from ms2cbioportal import (CDAPiTraqTable,
CDAPPrecursorAreaTable,
MaxQuantProteomeTable,
MaxQuantPTMTable,
MSMeta)
CPTAC uses the Common Data Analysis Pipeline (CDAP), which produces a common file format for MS data analyzed using it. iTRAQ and precursor area are mostly similar in their treatment, but a distinction is made because the two assay types cannot be combined together. Precursor area data needs to be log-transformed in order to fit a normal distribution of values. The sample_regex refers to a regular expression that isolates sample columns. It will probably be necessary to inspect the data to determine what a good regex should be. use_ruler allows the user to specify that they would like an estimate of protein copy number per cell instead of just the iTRAQ or precursor area values.
In [2]:
# ptm_type is None if it's just proteome quantification
table_prot = CDAPiTraqTable(
filename='data/TCGA_Breast_BI_Proteome_CDAP.r2.itraq.tsv',
sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Log Ratio',
ptm_type=None,
use_ruler=False)
# view the data
table_prot.head()
Out[2]:
In [3]:
# we want to rename the table columns and write to file
table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Log Ratio', ''))
print(table_prot.columns)
In [4]:
# now load in a phosphoproteome dataset that's also from breast cancer
# NOTE: ptm_type is now 'P' for phosphoprotein
table_ptm = CDAPiTraqTable(
filename='data/TCGA_Breast_BI_Phosphoproteome.phosphosite.itraq.tsv',
sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Log Ratio',
ptm_type='P',
use_ruler=False)
table_ptm.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Log Ratio', ''))
# combine the proteome and phosphoproteome data
table_prot.vertical_concat(table_ptm)
# write to file
table_prot.write_csv('data_breast_protein_level.txt')
In [5]:
# create metadata using ids supplied by cBioPortal
meta = MSMeta(cancer_id='brca_tcga',
prot_file='data_breast_protein_level.txt')
meta.write('meta_breast_protein_level.txt')
# take a look
meta
Out[5]:
In [6]:
# precursor area data works in almost exactly the same way, but the values happen to be log-transformed internally.
# NOTE: the sample_regex has changed to reflect the different data format
table_prot = CDAPPrecursorAreaTable(
filename='data/TCGA_Colon_VU_Proteome_CDAP.r2.precursor_area.tsv',
sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Area',
ptm_type=None,
use_ruler=False)
table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Area', ''))
# write to file
table_prot.write_csv('data_colorectal_protein_level.txt')
# create metadata using ids supplied by cBioPortal
meta = MSMeta(cancer_id='coadread_tcga', prot_file='data_colorectal_protein_level.txt')
meta.write('meta_colorectal_protein_level.txt')
# compare values to cell below
table_prot.head()
Out[6]:
In [7]:
# for intensity values, we have included an optional transformation to convert values into
# estimates of copy number per cell using the proteomic ruler method published by
# Wiśniewski et al. 2014
# for precursor area, values are only log-tranformed if `use_ruler` is False.
table_prot = CDAPPrecursorAreaTable(
filename='data/TCGA_Colon_VU_Proteome_CDAP.r2.precursor_area.tsv',
sample_regex='([A-Z0-9]{2}\-[A-Z0-9]{4}\-[A-Z0-9]{2})[A-Z0-9\-]+ Area',
ptm_type=None,
use_ruler=True)
table_prot.rename_columns(renamer=lambda col: 'TCGA-' + col.replace(' Area', ''))
table_prot.head()
Out[7]:
Since MaxQuant is such a popular software for analyzing quantitative MS data, we decided to include processing support for it, even though CPTAC does not currently use it for their data. This would be useful for local institutional builds of cBioPortal that want to import their own MaxQuant data. Unlike with CDAP data, separate tables are created for MaxQuant proteome and MaxQuant PTM, simply because their file formats are so divergent. Also unlike CDAP data, it makes sense to combine the proteome and PTM quantification data into one file.
In [8]:
table_prot = MaxQuantProteomeTable(filename='data/proteinGroups.txt',
sample_regex='Intensity std[0-9]+$',
ptm_type=None,
use_ruler=False)
table_prot.rename_columns(renamer=lambda col: col.replace('Intensity ', ''))
# take a look
table_prot.head()
Out[8]:
In [9]:
table_ptm = MaxQuantPTMTable(filename='data/Oxidation (M)Sites.txt',
sample_regex='Intensity std[0-9]+$',
ptm_type='O',
use_ruler=False)
table_ptm.rename_columns(renamer=lambda col: col.replace('Intensity ', ''))
# combine the two datasets
table_prot.vertical_concat(table_ptm)
# NOTE: horizontal_concat exists to pool samples
# while vertical_concat pools gene and PTM IDs
# write to file
table_prot.write_csv('data_maxquant_test.txt')
In [ ]: