Parsing and manipulating PDB files


In [ ]:
%matplotlib inline
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
  • Two important pieces of metadata for a crystal structure are $R$ and $R_{free}$. These measure how well the crystallographic model fits experimental data used to generate the model. The location of these numbers are indicated in pdb files by the following text.

    REMARK   3   R VALUE     (WORKING + TEST SET)
    
    and 
    
    REMARK   3   FREE R VALUE

    Write a function called get_R that takes a string specifying a pdb file as an argument, extracts $R$ and $R_{free}$ from the file, and then returns them as a tuple of floats (R,R_free).


In [ ]:

  • Create a histogram of all $C_{\alpha,i} \rightarrow C_{\alpha,i+1}$ distances in the pdb file 1stn.pdb, where $i$ counts along residues in the sequence. (This means the distance between $C_{\alpha}$ for residue 1 and 2, then for 2 and 3, etc.)

In [ ]:

Write a function called center_of_mass that calculates the center of mass for a pdb file. It should take a string specifying the pdb file as an argument and return a tuple with the center of mass in (x, y, z). The center of mass is the average of the coordinates of the protein atoms, weighted by their mass. The type of each atom is given on the far right of each line of the pdb. You should use the following masses for the atoms:

atom mass
C 12.0107
N 14.0067
O 15.9994
S 32.065

In [ ]:

The HN hydrogen atom attached to the N is often not in crystal structures because it is invisible in the diffraction experiment used to make the model. Unfortunately, the HN atom coordinates are necessary to calculate things like structural biologists care about like Ramachandran plots. The missing atom is indicated with the red arrow in the figure below.

The function below (calc_hn) calculates the position of the HN atom for the $i^{th}$ residue given the coordinates of the $(i-1)^{th}$ C atom (red sphere in picture), the $i^{th}$ N atom (cyan sphere in picture) and the $i^{th}$ CA atom (cyan sphere in picture). Write a program that takes a pdb file as input, calculates the position of each HN atom, and then writes out a new pdb file with the HN atoms written out as lines just after the N atoms. This means the line encoding the position of N for residue 46 would be followed by a new line encoding the position of HN for residue 46. You do not have to renumber the atoms in the file (but bonus points if you do).


In [ ]:
def calc_hn(CO_i_minus_one,N_i,CA_i):
    """
    Calculate the position of the HN proton. 
    
    CO_i_minus_one: array of x,y,z coordinates for *previous* 
                    "C" atom
    N_i: array of x,y,z coordinates for current "N" atom.
    CA_i: array of x,y,z coordinates for current "CA" atom.
    
    Returns: x,y,z array for HN proton.
    """
    
    # Center on N
    Ca = CA_i           - N_i
    Co = CO_i_minus_one - N_i

    # Get length of relevant vectors
    length_HN = 1.02
    length_Ca = np.sqrt(np.sum(Ca**2))
    length_Co = np.sqrt(np.sum(Co**2))

    # Dot product of H and C
    H_dot_C = length_HN*length_Co*np.cos(119.0*np.pi/180.0)

    xo = Co[0]
    yo = Co[1]
    zo = Co[2]

    xa = Ca[0]
    ya = Ca[1]
    za = Ca[2]
    
    Q = length_Ca/length_Co
    A = (xo + Q*xa)
    B = (yo + Q*ya)
    C = (zo + Q*za)

    xh = H_dot_C/(xo + yo*B/A + zo*C/A)
    yh = xh*B/A
    zh = xh*C/A

    # Translate HN back to original coordinates
    HN_i = np.array((xh,yh,zh)) + N_i
    
    return HN_i

In [ ]:


In [ ]: