Python Basics
Python is an interpreted language originally designed for general-purpose programming, but it also became a full-fledged tool for scentific computing thanks to its highly modularized and extensible design, and a growing scentific user/developer community. The first part of this tutorial is to show or remind you some basic syntax of Python.
In [ ]:
print('hello world!')
The real power of Python is with the packages. There are many built-in ones, just like the one we will see below, and also many open-source packages developed by the Python community. There are several ways to import a package (module). For starters, you can import the module by its name:
In [ ]:
import numpy as np
np.pi
Another common way to import is using a wildcard (namely *
):
In [ ]:
from numpy import *
pi
Just like other languages, Python has several basic data types: int
, float
, bool
, str
, to define a variable of a certain type, one can simply write out the initial value:
In [ ]:
i = 1
f = 3.1415
b = True or False
s = 'hello world!'
To create a list:
In [ ]:
a = [0, 1, 2, 3, 4]
a
Notably the array in Python starts from 0. A list can be also generated by range
function, minimally the only input is the length of the list. So to generate a list like above, we can write:
In [ ]:
a = range(5)
a
Note that in Python 3K, this will generate an iterator (a Range class to be precise). Following code will convert it to a list
instance:
In [ ]:
a = list(range(5))
a
Optionally, the most robust and universal way to convert a iterable object to a list is through list comprehension:
In [ ]:
a = [_ for _ in range(5)]
a
The list comprehension mentioned above is essentially an one liner of a for
loop. The full version of the for
loop looks like the following:
In [ ]:
for i in range(5):
print(i)
Which iterates through numbers from 0 to 4. if
block is used for making choices. Let's first initialize two integers:
In [ ]:
a = 3; b = -3
Then write out a control block as follows:
In [ ]:
if a > b:
print('a is bigger than b')
elif a < b:
print('a is smaller than b')
else:
print('a equals to b')
Numpy is a Python package for numerical computing and linear algebra. It is also a required installation for ProDy.
In [ ]:
from numpy import *
The most fundamental data type in numpy is ndarray
(n-dimensional array). It can be used to represent vectors or matrices. There are various ways to initialize an vector in numpy:
In [ ]:
v = arange(10, dtype=float)
v
In [ ]:
v = zeros(5)
v
In [ ]:
v.shape
Or a matrix:
In [ ]:
V = zeros([5, 5])
V
In [ ]:
V.shape
In [ ]:
V = ones([5, 5])
V
In [ ]:
I = eye(5)
I
Matplotlib is a package used by ProDy to visualize the data. It is compatible with Python objects as well as numpy objects and provides MATLAB-like syntax for plotting data points, matrices, 3D coordinates, etc.
In [ ]:
from matplotlib.pyplot import *
In [ ]:
x = arange(0, 2*pi, 0.1)
y = sin(x)
y
In [ ]:
plot(x, y)
xlabel('x')
ylabel('y')
title('sin')
In [ ]:
y2 = cos(x)
plot(x, y, x, y2)
xlabel('x')
ylabel('y')
legend(['sin', 'cos'])
In [ ]:
plot(y, y2)
Prody Basics
This tutorial aims to teach basic data structures and functions in Prody. First we need to import required packages:
In [ ]:
from prody import *
from numpy import *
from matplotlib.pyplot import *
%matplotlib inline
confProDy(auto_show=False)
confProDy(auto_secondary=True)
These import commands will load numpy, matplotlib, and ProDy into the memory. confProDy
is used to modify the default behaviors of ProDy. Here we turned off auto_show
so that the plots can be made in the same figure, and we turn on auto_secondary
to parse the secondary structure information whenever we load a PDB into ProDy. See here for a complete list of behaviors that can be changed by this function. This function only needs to be called once, and the setting will be remembered by ProDy.
ProDy comes with many functions that can be used to fetch data from Protein Data Bank. Let's first parse a structure:
In [ ]:
p38 = parsePDB('1p38')
p38
parsePDB
will download the PDB file and load it into the memory. Let's inspect the variable p38
:
In [ ]:
p38
To visualize the structure:
In [ ]:
showProtein(p38);
legend();
If you would like to display the 3D structure using other packages or your own code, you can get the 3D coordinates via:
In [ ]:
p38.getCoords()
In [ ]:
p38.getCoords().shape
In [ ]:
showContactMap(p38.ca);
AtomGroup
is essentially a collection of protein atoms. Each atom can be indexed/queried/found by the following way:
In [ ]:
p38[10]
This will give you the 11-th atom from p38
, noting that Python index starts from 0. We can also examine the spatial location of this atom by querying the coordinate:
In [ ]:
p38[10].getCoords()
In [ ]:
showProtein(p38);
ax3d = gca()
x, y, z = p38[10].getCoords()
ax3d.plot([x], [y], [z], 'r*', markersize=30);
We could select a chain, e.g. chain A, of the protein by indexing using its identifier, like the following:
In [ ]:
p38['A']
In [ ]:
p38['A'].getSequence()
In many cases, it is more convenient to examine the structure with residue numbers, and AtomGroup
supports indexing with a chain ID and a residue number:
In [ ]:
p38[10].getResnum()
In [ ]:
p38['A', 5]
This will give you the residue with the residue number of 10, which is an arginine in p38
. Please note the difference between this line and the previous one.
In [ ]:
p38['A', 5].getNames()
Note that some ProDy objects may not support indexing using a chain identifier or a residue number. In such cases, we can first obtain a hierarchical view of the object:
In [ ]:
hv = p38.getHierView()
And then use HierView
to index using a chain identifier and residue number as it will always be supported:
In [ ]:
hv['A', 5]
Many properties of the protein can be acquired by functions named like "getxxx". For instance, we can obtain the B-factors by:
In [ ]:
betas = p38.getBetas()
betas.shape
In this way, we can obtain the B-factor for every single atom. However, in some cases, we only need to know the B-factors of alpha-carbons,
In [ ]:
p38.ca
In [ ]:
betas = p38.ca.getBetas()
betas.shape
In [ ]:
plot(betas);
ylabel('B-factor');
xlabel('Residue index');
If we would like to use residue numbers in the PDB, instead of the indices, as the x-axis of the plot, it would be much more convenient to use the ProDy plotting function, showAtomicLines
.
In [ ]:
showAtomicLines(betas, atoms=p38.ca);
ylabel('B-factor');
xlabel('Residue number');
We can also obtain the secondary structure information as an array:
In [ ]:
p38.ca.getSecstrs()
To make it easier to read, we can convert the array into a string using the Python's built-in function, join
:
In [ ]:
''.join(p38.ca.getSecstrs())
C
is for coil, H
for alpha helix, I
for pi helix, G
for 3-10 helix, and E
for beta strand (sheet). To get a complete list of "get" function, you can type p38.get<TAB>
.
In [ ]:
p38
Speaking of which, in measure module, you can find various functions for calculations for structural properties. For example, you can calculate the phi angle of 11th residue:
In [ ]:
calcPhi(p38['A', 10])
In [ ]:
round(calcPhi(p38['A', 10]), 3)
A dihedral angle is the angle between two intersecting planes. In chemistry it is the angle between planes through two sets of three atoms, having two atoms in common. In proteins, there are two most interested dihedral angles, namely Phi and Psi, and they are illustrated as follows.
Note that the residue at N-terminus or C-terminus does not have a Phi or Psi angle, respectively. If we calculate the Phi and Psi angle for every non-terminal residue, we can obtain a Ramachandran plot for a protein. An example of Ramachandran plot for human PCNA is shown as follows:
Three favored regions (in red)--upper left: beta sheet; center left: alpha helix; center right: left-handed helix. Each blue data point corresponds to the two dihedrals of a residue. We will reproduce this plot for ubiquitin (we will only reproduce the points).
In [ ]:
chain = p38['A']
Phi = []; Psi = []; c = []
for res in chain.iterResidues():
try:
phi = calcPhi(res)
psi = calcPsi(res)
except:
continue
else:
Phi.append(phi)
Psi.append(psi)
if res.getResname() == 'GLY':
c.append('black')
else:
secstr = res.getSecstrs()[0]
if secstr == 'H':
c.append('red')
elif secstr == 'G':
c.append('darkred')
elif secstr == 'E':
c.append('blue')
else:
c.append('grey')
In the above code, we use an exception handler to exclude the terminal residues from the calculation.
In [ ]:
scatter(Phi, Psi, c=c, s=10);
xlabel('Phi (degree)');
ylabel('Psi (degree)');
In theory you could retrieve any set of atoms by indexing the AtomGroup
, but it would be cumbersome to do so. To make it more convienient, ProDy provides VMD-like syntax for selecting atoms. Here lists a few common selection strings, and for a more complete tutorial on selection, please see here.
In [ ]:
ca = p38.select('calpha')
ca
In [ ]:
p38.ca
In [ ]:
bb = p38.select('backbone')
p38.bb
We could also perform some simple selections right when the structure is being parsed. For example, we can specify that we would like to obtain only alpha-carbons of chain A of the p38 as follows:
In [ ]:
chainA_ca = parsePDB('1p38', chain='A', subset='ca')
We could find the chain A using selection (as an alternative to the indexing method shown above):
In [ ]:
chA = p38.select('calpha and chain A')
chA
Selection also works for finding a single residue or multiple residues:
In [ ]:
res = p38.ca.select('chain A and resnum 10')
res.getResnums()
In [ ]:
res = p38.ca.select('chain A and resnum 10 11 12')
res.getResnums()
In [ ]:
head = p38.ca.select('resnum < 50')
head.numAtoms()
We can also select a range of residues by:
In [ ]:
fragment = p38.ca.select('resnum 50 to 100')
If we have data associated to the full length of the protein, we can slice the data using the sliceAtomicData
:
In [ ]:
subbetas = sliceAtomicData(betas, atoms=p38.ca, select=fragment)
We can visualize the data of this range using showAtomicLines
:
In [ ]:
showAtomicLines(subbetas, atoms=fragment);
xlabel('Residue number');
ylabel('B factor');
Or highlight the subset in the plot of the whole protein:
In [ ]:
showAtomicLines(betas, atoms=p38.ca, overlay=True);
showAtomicLines(subbetas, atoms=fragment, overlay=True);
xlabel('Residue number');
ylabel('B factor');
Selection also allows us to extract particular amino acid types:
In [ ]:
args = p38.ca.select('resname ARG')
args
Again, combined with sliceAtomicData
and showAtomicLines
, we can highlight these residues in the plot of the whole protein:
In [ ]:
argbetas = sliceAtomicData(betas, atoms=p38.ca, select=args)
showAtomicLines(betas, atoms=p38.ca, overlay=True);
showAtomicLines(argbetas, atoms=args, linespec='r*', overlay=True);
xlabel('Residue number');
ylabel('B factor');
You can also compare different structures using some of the methods in proteins module. Let’s parse another p38 MAP kinase structure.
In [ ]:
bound = parsePDB('1zz2')
You can find similar chains in structure 1p38 and 1zz2 using matchChains
function
In [ ]:
results = matchChains(p38, bound)
results[0]
In Python, a tuple (or any indexable objects) can be unpacked by:
In [ ]:
apo_chA, bnd_chA, seqid, overlap = results[0]
apo_chA
In [ ]:
bnd_chA
The first two terms are the mapping of the proteins to each other. Then the third term is the sequence identity:
In [ ]:
seqid
and the forth term is the sequence coverage:
In [ ]:
overlap
If we calculate RMSD right now, we will obtain the value for the unsuperposed proteins:
In [ ]:
calcRMSD(bnd_chA, apo_chA)
After superposition, the RMSD will be much improved,
In [ ]:
bnd_chA, transformation = superpose(bnd_chA, apo_chA)
calcRMSD(bnd_chA, apo_chA)
In [ ]:
confProDy(auto_show=False)
showProtein(bnd_chA);
showProtein(apo_chA);
legend();
To visualize the superposition of the full proteins, we need to apply transform matrix to the entire structure:
In [ ]:
showProtein(p38);
showProtein(bound);
legend();
Using matplotlib
, we only obtained a very simple linear representation of proteins. ProDy also support a more sophisticated way of visualizing proteins in 3D via py3Dmol:
In [ ]:
import py3Dmol
showProtein(p38)
The limitation is that py3Dmol
only works in an iPython notebook. You can always write out the protein to a PDB file and visualize it in an external program:
In [ ]:
writePDB('bound_aligned.pdb', bnd_chA)