This notebook gives an example of how to calculate protein properties for a list of proteins. The main features demonstrated are:
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
ERROR
WARNING
INFO
(default)DEBUG
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Set these three things:
ROOT_DIR
PROJECT
will be createdPROJECT
LIST_OF_GENES
A directory will be created in ROOT_DIR
with your PROJECT
name. The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
Out[8]:
In [9]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
Out[9]:
These are the ways to map sequence to structure:
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
Out[10]:
In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.7, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
Out[11]:
In [15]:
import pandas as pd
import os.path as op
In [16]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]
homology_model_dict = {}
for i,r in tmp.iterrows():
homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
'file_type':'pdb'}}
my_gempro.get_manual_homology_models(homology_model_dict)
In [17]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]
homology_model_dict = {}
for i,r in tmp.iterrows():
homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
'file_type':'pdb'}}
my_gempro.get_manual_homology_models(homology_model_dict)
In [18]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
Out[18]:
In [19]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
Out[19]:
In [20]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()
In [ ]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='scratch',
results_dir=my_gempro.data_dir,
num_cores=4)
In [22]:
my_gempro.find_disulfide_bridges(representatives_only=False)
In [23]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()
In [24]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()
In [25]:
# for g in my_gempro.genes_with_a_representative_sequence:
# g.protein.representative_sequence.feature_path = '/path/to/new/feature/file.gff'
In [26]:
# Kyte-Doolittle scale for hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }
In [27]:
# Use Biopython to calculated hydrophobicity using a set sliding window length
from Bio.SeqUtils.ProtParam import ProteinAnalysis
window = 7
for g in my_gempro.genes_with_a_representative_sequence:
# Create a ProteinAnalysis object -- see http://biopython.org/wiki/ProtParam
my_seq = g.protein.representative_sequence.seq_str
analysed_seq = ProteinAnalysis(my_seq)
# Calculate scale
hydrophobicity = analysed_seq.protein_scale(param_dict=kd, window=window)
# Correct list length by prepending and appending "inf" (result needs to be same length as sequence)
for i in range(window//2):
hydrophobicity.insert(0, float("Inf"))
hydrophobicity.append(float("Inf"))
# Add new annotation to the representative sequence's "letter_annotations" dictionary
g.protein.representative_sequence.letter_annotations['hydrophobicity-kd'] = hydrophobicity
Properties of the entire protein sequence/structure are stored in:
representative_sequence
annotations
fieldrepresentative_structure
's representative chain SeqRecordThese properties describe aspects of the entire protein, such as its molecular weight, the percentage of amino acids in a particular secondary structure, etc.
In [28]:
# Printing all global protein properties
from pprint import pprint
# Only looking at 2 genes for now, remove [:2] to gather properties for all
for g in my_gempro.genes_with_a_representative_sequence[:2]:
repseq = g.protein.representative_sequence
repstruct = g.protein.representative_structure
repchain = g.protein.representative_chain
print('Gene: {}'.format(g.id))
print('Number of structures: {}'.format(g.protein.num_structures))
print('Representative sequence: {}'.format(repseq.id))
print('Representative structure: {}'.format(repstruct.id))
print('----------------------------------------------------------------')
print('Global properties of the representative sequence:')
pprint(repseq.annotations)
print('----------------------------------------------------------------')
print('Global properties of the representative structure:')
pprint(repstruct.chains.get_by_id(repchain).seq_record.annotations)
print('****************************************************************')
print('****************************************************************')
print('****************************************************************')
Properties of specific residues are stored in:
representative_sequence
's letter_annotations
attributerepresentative_structure
's representative chain SeqRecordSpecific sites, like metal or metabolite binding sites, can be found in the representative_sequence
's features
attribute. This information is retrieved from UniProt. The below examples extract features for the metal binding sites.
The properties related to those sites can be retrieved using the function get_residue_annotations
.
UniProt contains more information than just "sites"
In [29]:
# Looking at all features
for g in my_gempro.genes_with_a_representative_sequence[:2]:
g.id
# UniProt features
[x for x in g.protein.representative_sequence.features]
# Catalytic site atlas features
for s in g.protein.structures:
if s.structure_file:
for c in s.mapped_chains:
if s.chains.get_by_id(c).seq_record:
if s.chains.get_by_id(c).seq_record.features:
[x for x in s.chains.get_by_id(c).seq_record.features]
Out[29]:
Out[29]:
Out[29]:
Out[29]:
In [30]:
metal_info = []
for g in my_gempro.genes:
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
res_info['gene_id'] = g.id
res_info['seq_id'] = g.protein.representative_sequence.id
res_info['struct_id'] = g.protein.representative_structure.id
res_info['chain_id'] = g.protein.representative_chain
metal_info.append(res_info)
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']
pd.DataFrame.from_records(metal_info, columns=cols).set_index(['gene_id', 'seq_id', 'struct_id', 'chain_id', 'seq_resnum'])
Out[30]:
gene_id
: Gene ID used in GEM-PRO projectseq_id
: Representative protein sequence IDstruct_id
: Representative protein structure ID, with REP-
prepended to it. 4 letter structure IDs are experimental structures from the PDB, others are homology modelschain_id
: Representative chain ID in the representative structureseq_resnum
: Residue number of the amino acid in the representative sequencesite_name
: Name of the feature as defined in UniProtseq_residue
: Amino acid in the representative sequence at the residue numberstruct_residue
: Amino acid in the representative structure at the residue numberstruct_resnum
: Residue number of the amino acid in the representative structureseq_SS-sspro
: Predicted secondary structure, 3 definitions (from the SCRATCH program)seq_SS-sspro8
: Predicted secondary structure, 8 definitions (SCRATCH)seq_RSA-accpro
: Predicted exposed (e) or buried (-) residue (SCRATCH)seq_RSA-accpro20
: Predicted exposed/buried, 0 to 100 scale (SCRATCH)struct_SS-dssp
: Secondary structure (DSSP program)struct_RSA-dssp
: Relative solvent accessibility (DSSP) struct_ASA-dssp
: Solvent accessibility, absolute value (DSSP)struct_PHI-dssp
: Phi angle measure (DSSP)struct_PSI-dssp
: Psi angle measure (DSSP)struct_RES_DEPTH-msms
: Calculated residue depth averaged for all atoms in the residue (MSMS program)struct_CA_DEPTH-msms
: Calculated residue depth for the carbon alpha atom (MSMS)
In [31]:
for g in my_gempro.genes:
# Gather residue numbers
metal_binding_structure_residues = []
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
metal_binding_structure_residues.append(res_info['struct_resnum'])
print(metal_binding_structure_residues)
# Display structure
view = g.protein.representative_structure.view_structure()
g.protein.representative_structure.add_residues_highlight_to_nglview(view=view, structure_resnums=metal_binding_structure_residues)
view
In [32]:
# Run all sequence to structure alignments
for g in my_gempro.genes:
for s in g.protein.structures:
g.protein.align_seqprop_to_structprop(seqprop=g.protein.representative_sequence, structprop=s)
In [33]:
metal_info_compared = []
for g in my_gempro.genes:
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
for s in g.protein.structures:
for c in s.mapped_chains:
res_info = g.protein.get_residue_annotations(seq_resnum=f.location.end,
seqprop=g.protein.representative_sequence,
structprop=s, chain_id=c,
use_representatives=False)
res_info['gene_id'] = g.id
res_info['seq_id'] = g.protein.representative_sequence.id
res_info['struct_id'] = s.id
res_info['chain_id'] = c
metal_info_compared.append(res_info)
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']
pd.DataFrame.from_records(metal_info_compared, columns=cols).sort_values(by=['seq_resnum','struct_id','chain_id']).set_index(['gene_id','seq_id','seq_resnum','seq_residue','struct_id'])
Out[33]: