Python Basics

Python is an interpreted language originally designed for general-purpose programming, but it also became a full-fledged tool for scentific computing thanks to its highly modularized and extensible design, and a growing scentific user/developer community. The first part of this tutorial is to show or remind you some basic syntax of Python.


In [ ]:
print('hello world!')

The real power of Python is with the packages. There are many built-in ones, just like the one we will see below, and also many open-source packages developed by the Python community. There are several ways to import a package (module). For starters, you can import the module by its name:


In [ ]:
import numpy as np
np.pi

Another common way to import is using a wildcard (namely *):


In [ ]:
from numpy import *
pi

Data types

Just like other languages, Python has several basic data types: int, float, bool, str, to define a variable of a certain type, one can simply write out the initial value:


In [ ]:
i = 1
f = 3.1415
b = True or False
s = 'hello world!'

To create a list:


In [ ]:
a = [0, 1, 2, 3, 4]
a

Notably the array in Python starts from 0. A list can be also generated by range function, minimally the only input is the length of the list. So to generate a list like above, we can write:


In [ ]:
a = range(5)
a

Note that in Python 3K, this will generate an iterator (a Range class to be precise). Following code will convert it to a list instance:


In [ ]:
a = list(range(5))
a

Optionally, the most robust and universal way to convert a iterable object to a list is through list comprehension:


In [ ]:
a = [_ for _ in range(5)]
a

Control flow

The list comprehension mentioned above is essentially an one liner of a for loop. The full version of the for loop looks like the following:


In [ ]:
for i in range(5):
    print(i)

Which iterates through numbers from 0 to 4. if block is used for making choices. Let's first initialize two integers:


In [ ]:
a = 3; b = -3

Then write out a control block as follows:


In [ ]:
if a > b:
    print('a is bigger than b')
elif a < b:
    print('a is smaller than b')
else:
    print('a equals to b')

Numpy

Numpy is a Python package for numerical computing and linear algebra. It is also a required installation for ProDy.


In [ ]:
from numpy import *

The most fundamental data type in numpy is ndarray (n-dimensional array). It can be used to represent vectors or matrices. There are various ways to initialize an vector in numpy:


In [ ]:
v = arange(10, dtype=float)
v

In [ ]:
v = zeros(5)
v

In [ ]:
v.shape

Or a matrix:


In [ ]:
V = zeros([5, 5])
V

In [ ]:
V.shape

In [ ]:
V = ones([5, 5])
V

In [ ]:
I = eye(5)
I

Matplotlib

Matplotlib is a package used by ProDy to visualize the data. It is compatible with Python objects as well as numpy objects and provides MATLAB-like syntax for plotting data points, matrices, 3D coordinates, etc.


In [ ]:
from matplotlib.pyplot import *

In [ ]:
x = arange(0, 2*pi, 0.1)
y = sin(x)
y

In [ ]:
plot(x, y)
xlabel('x')
ylabel('y')
title('sin')

In [ ]:
y2 = cos(x)
plot(x, y, x, y2)
xlabel('x')
ylabel('y')
legend(['sin', 'cos'])

In [ ]:
plot(y, y2)

Prody Basics

This tutorial aims to teach basic data structures and functions in Prody. First we need to import required packages:


In [ ]:
from prody import *
from numpy import *
from matplotlib.pyplot import *
%matplotlib inline
confProDy(auto_show=False)
confProDy(auto_secondary=True)

These import commands will load numpy, matplotlib, and ProDy into the memory. confProDy is used to modify the default behaviors of ProDy. Here we turned off auto_show so that the plots can be made in the same figure, and we turn on auto_secondary to parse the secondary structure information whenever we load a PDB into ProDy. See here for a complete list of behaviors that can be changed by this function. This function only needs to be called once, and the setting will be remembered by ProDy.

Load PDB files and visualization

ProDy comes with many functions that can be used to fetch data from Protein Data Bank. Let's first parse a structure:


In [ ]:
p38 = parsePDB('1p38')
p38

parsePDB will download the PDB file and load it into the memory. Let's inspect the variable p38:


In [ ]:
p38

To visualize the structure:


In [ ]:
showProtein(p38);
legend();

If you would like to display the 3D structure using other packages or your own code, you can get the 3D coordinates via:


In [ ]:
p38.getCoords()

In [ ]:
p38.getCoords().shape

In [ ]:
showContactMap(p38.ca);

AtomGroup is essentially a collection of protein atoms. Each atom can be indexed/queried/found by the following way:


In [ ]:
p38[10]

This will give you the 11-th atom from p38, noting that Python index starts from 0. We can also examine the spatial location of this atom by querying the coordinate:


In [ ]:
p38[10].getCoords()

In [ ]:
showProtein(p38);
ax3d = gca()
x, y, z = p38[10].getCoords()
ax3d.plot([x], [y], [z], 'r*', markersize=30);

We could select a chain, e.g. chain A, of the protein by indexing using its identifier, like the following:


In [ ]:
p38['A']

In [ ]:
p38['A'].getSequence()

In many cases, it is more convenient to examine the structure with residue numbers, and AtomGroup supports indexing with a chain ID and a residue number:


In [ ]:
p38[10].getResnum()

In [ ]:
p38['A', 5]

This will give you the residue with the residue number of 10, which is an arginine in p38. Please note the difference between this line and the previous one.


In [ ]:
p38['A', 5].getNames()

Note that some ProDy objects may not support indexing using a chain identifier or a residue number. In such cases, we can first obtain a hierarchical view of the object:


In [ ]:
hv = p38.getHierView()

And then use HierView to index using a chain identifier and residue number as it will always be supported:


In [ ]:
hv['A', 5]

Retrive data from AtomGroup

Many properties of the protein can be acquired by functions named like "getxxx". For instance, we can obtain the B-factors by:


In [ ]:
betas = p38.getBetas()
betas.shape

In this way, we can obtain the B-factor for every single atom. However, in some cases, we only need to know the B-factors of alpha-carbons,


In [ ]:
p38.ca

In [ ]:
betas = p38.ca.getBetas()
betas.shape

In [ ]:
plot(betas);
ylabel('B-factor');
xlabel('Residue index');

If we would like to use residue numbers in the PDB, instead of the indices, as the x-axis of the plot, it would be much more convenient to use the ProDy plotting function, showAtomicLines.


In [ ]:
showAtomicLines(betas, atoms=p38.ca);
ylabel('B-factor');
xlabel('Residue number');

We can also obtain the secondary structure information as an array:


In [ ]:
p38.ca.getSecstrs()

To make it easier to read, we can convert the array into a string using the Python's built-in function, join :


In [ ]:
''.join(p38.ca.getSecstrs())

C is for coil, H for alpha helix, I for pi helix, G for 3-10 helix, and E for beta strand (sheet). To get a complete list of "get" function, you can type p38.get<TAB>.


In [ ]:
p38

Speaking of which, in measure module, you can find various functions for calculations for structural properties. For example, you can calculate the phi angle of 11th residue:


In [ ]:
calcPhi(p38['A', 10])

In [ ]:
round(calcPhi(p38['A', 10]), 3)

A dihedral angle is the angle between two intersecting planes. In chemistry it is the angle between planes through two sets of three atoms, having two atoms in common. In proteins, there are two most interested dihedral angles, namely Phi and Psi, and they are illustrated as follows.

Note that the residue at N-terminus or C-terminus does not have a Phi or Psi angle, respectively. If we calculate the Phi and Psi angle for every non-terminal residue, we can obtain a Ramachandran plot for a protein. An example of Ramachandran plot for human PCNA is shown as follows:

Three favored regions (in red)--upper left: beta sheet; center left: alpha helix; center right: left-handed helix. Each blue data point corresponds to the two dihedrals of a residue. We will reproduce this plot for ubiquitin (we will only reproduce the points).


In [ ]:
chain = p38['A']
Phi = []; Psi = []; c = []
for res in chain.iterResidues():
    try:
        phi = calcPhi(res)
        psi = calcPsi(res)
    except:
        continue
    else:
        Phi.append(phi)
        Psi.append(psi)
        if res.getResname() == 'GLY':
            c.append('black')
        else:
            secstr = res.getSecstrs()[0]
            if secstr == 'H':
                c.append('red')
            elif secstr == 'G':
                c.append('darkred')
            elif secstr == 'E':
                c.append('blue')
            else:
                c.append('grey')

In the above code, we use an exception handler to exclude the terminal residues from the calculation.


In [ ]:
scatter(Phi, Psi, c=c, s=10);
xlabel('Phi (degree)');
ylabel('Psi (degree)');

Selection

In theory you could retrieve any set of atoms by indexing the AtomGroup, but it would be cumbersome to do so. To make it more convienient, ProDy provides VMD-like syntax for selecting atoms. Here lists a few common selection strings, and for a more complete tutorial on selection, please see here.


In [ ]:
ca = p38.select('calpha')
ca

In [ ]:
p38.ca

In [ ]:
bb = p38.select('backbone')
p38.bb

We could also perform some simple selections right when the structure is being parsed. For example, we can specify that we would like to obtain only alpha-carbons of chain A of the p38 as follows:


In [ ]:
chainA_ca = parsePDB('1p38', chain='A', subset='ca')

We could find the chain A using selection (as an alternative to the indexing method shown above):


In [ ]:
chA = p38.select('calpha and chain A')
chA

Selection also works for finding a single residue or multiple residues:


In [ ]:
res = p38.ca.select('chain A and resnum 10')
res.getResnums()

In [ ]:
res = p38.ca.select('chain A and resnum 10 11 12')
res.getResnums()

In [ ]:
head = p38.ca.select('resnum < 50')
head.numAtoms()

We can also select a range of residues by:


In [ ]:
fragment = p38.ca.select('resnum 50 to 100')

If we have data associated to the full length of the protein, we can slice the data using the sliceAtomicData:


In [ ]:
subbetas = sliceAtomicData(betas, atoms=p38.ca, select=fragment)

We can visualize the data of this range using showAtomicLines:


In [ ]:
showAtomicLines(subbetas, atoms=fragment);
xlabel('Residue number');
ylabel('B factor');

Or highlight the subset in the plot of the whole protein:


In [ ]:
showAtomicLines(betas, atoms=p38.ca, overlay=True);
showAtomicLines(subbetas, atoms=fragment, overlay=True);
xlabel('Residue number');
ylabel('B factor');

Selection also allows us to extract particular amino acid types:


In [ ]:
args = p38.ca.select('resname ARG')
args

Again, combined with sliceAtomicData and showAtomicLines, we can highlight these residues in the plot of the whole protein:


In [ ]:
argbetas = sliceAtomicData(betas, atoms=p38.ca, select=args)
showAtomicLines(betas, atoms=p38.ca, overlay=True);
showAtomicLines(argbetas, atoms=args, linespec='r*', overlay=True);
xlabel('Residue number');
ylabel('B factor');

Compare and align structures

You can also compare different structures using some of the methods in proteins module. Let’s parse another p38 MAP kinase structure.


In [ ]:
bound = parsePDB('1zz2')

You can find similar chains in structure 1p38 and 1zz2 using matchChains function


In [ ]:
results = matchChains(p38, bound)
results[0]

In Python, a tuple (or any indexable objects) can be unpacked by:


In [ ]:
apo_chA, bnd_chA, seqid, overlap = results[0]
apo_chA

In [ ]:
bnd_chA

The first two terms are the mapping of the proteins to each other. Then the third term is the sequence identity:


In [ ]:
seqid

and the forth term is the sequence coverage:


In [ ]:
overlap

If we calculate RMSD right now, we will obtain the value for the unsuperposed proteins:


In [ ]:
calcRMSD(bnd_chA, apo_chA)

After superposition, the RMSD will be much improved,


In [ ]:
bnd_chA, transformation = superpose(bnd_chA, apo_chA)
calcRMSD(bnd_chA, apo_chA)

In [ ]:
confProDy(auto_show=False)
showProtein(bnd_chA);
showProtein(apo_chA);
legend();

To visualize the superposition of the full proteins, we need to apply transform matrix to the entire structure:


In [ ]:
showProtein(p38);
showProtein(bound);
legend();

Advanced Visualization

Using matplotlib, we only obtained a very simple linear representation of proteins. ProDy also support a more sophisticated way of visualizing proteins in 3D via py3Dmol:


In [ ]:
import py3Dmol
showProtein(p38)

The limitation is that py3Dmol only works in an iPython notebook. You can always write out the protein to a PDB file and visualize it in an external program:


In [ ]:
writePDB('bound_aligned.pdb', bnd_chA)