Protein - Structure Mapping, Alignments, and Visualization

This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.

**Input:** Protein ID + amino acid sequence + mutated sequence(s)
**Output:** Representative protein structure, sequence alignments, and visualization of mutations

Imports


In [1]:
import sys
import logging

In [2]:
# Import the Protein class
from ssbio.core.protein import Protein

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don't affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff

**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!


In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROTEIN_ID will be created
  • PROTEIN_ID
    • Your protein ID
  • PROTEIN_SEQ
    • Your protein sequence

A directory will be created in ROOT_DIR with your PROTEIN_ID name. The folders are organized like so:

    ROOT_DIR
    └── PROTEIN_ID
        ├── sequences  # Protein sequence files, alignments, etc.
        └── structures  # Protein structure files, calculations, etc.

In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

In [7]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')

In [8]:
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)


Out[8]:
<SeqProp WT at 0x7f5f79cbd908>

Mapping sequence --> structure

Since the sequence has been provided, we just need to BLAST it to the PDB.

**Note:** These methods do not download any 3D structure files.

Methods


In [9]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()


Out[9]:
['2zyd', '2zya', '3fwn', '2zyg']
Out[9]:
pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
pdb_id
2zya A 2319.0 0.0 0.987179 0.963675 451 462
2zya B 2319.0 0.0 0.987179 0.963675 451 462
2zyd A 2319.0 0.0 0.987179 0.963675 451 462
2zyd B 2319.0 0.0 0.987179 0.963675 451 462
2zyg A 2284.0 0.0 0.982906 0.950855 445 460

Downloading and ranking structures

Methods

**Warning:** Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.

In [10]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)


Out[10]:
['2zyd', '2zya', '3fwn', '2zyg']
Out[10]:
pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
pdb_id
2zya Dimeric 6-phosphogluconate dehydrogenase compl... 6-phosphogluconate dehydrogenase, decarboxylat... X-RAY DIFFRACTION A;B 1.6 6PG Escherichia coli 2zya.mmtf
2zyd Dimeric 6-phosphogluconate dehydrogenase compl... 6-phosphogluconate dehydrogenase, decarboxylat... X-RAY DIFFRACTION A;B 1.5 GLO Escherichia coli 2zyd.mmtf

In [11]:
# Set representative structures
my_protein.set_representative_structure()


Out[11]:
<StructProp REP-2zyd at 0x7f5f79bea1d0>

Loading and aligning new sequences

You can load additional sequences into this protein object and align them to the representative sequence.


In [12]:
my_protein.__dict__


Out[12]:
{'_root_dir': '/tmp',
 'description': None,
 'id': 'SRR1753782_00918',
 'notes': {},
 'pdb_file_type': 'mmtf',
 'representative_chain': 'A',
 'representative_chain_seq_coverage': 95.7,
 'representative_sequence': <SeqProp WT at 0x7f5f79cbd908>,
 'representative_structure': <StructProp REP-2zyd at 0x7f5f79bea1d0>,
 'sequence_alignments': [<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f5f79be7a90>,
  <<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f5f79299080>],
 'sequences': [<SeqProp WT at 0x7f5f79cbd908>],
 'structure_alignments': [],
 'structures': [<PDBProp 2zyd at 0x7f5fdb48e080>,
  <PDBProp 2zya at 0x7f5fd538a550>,
  <PDBProp 3fwn at 0x7f5f79cbde48>,
  <PDBProp 2zyg at 0x7f5f79cbdda0>,
  <StructProp REP-2zyd at 0x7f5f79bea1d0>]}

Methods


In [ ]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)

In [ ]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)

In [ ]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()

In [ ]:
# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]

# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('SRR1753782_00918_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)

In [ ]:
# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f

Some additional methods

Getting binding site/other information from UniProt


In [ ]:
import ssbio.databases.uniprot

In [ ]:
this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features

Mapping sequence residue numbers to structure residue numbers

Methods


In [ ]:
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45], 
                                                                       use_representatives=True)
structure_sites

Viewing structures

The awesome package nglview is utilized as a backend for viewing structures within a Jupyter notebook. ssbio view functions will either return a NGLWidget object, which is the same as using nglview like the below example, or act upon the widget object itself.

# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view

Methods


In [ ]:
# View just the structure
view = my_protein.representative_structure.view_structure()
view

In [ ]:
# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7), 
                                    use_representatives=True)

In [ ]:
# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)

Saving


In [ ]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))