Parsing a Protein Databank File

In this notebook we will do some simple parsings and analysis on a Protein Databank (PDB) file. You don't need to care about proteins to follow this example. The idea is that some (downloaded) files may need some changes before they can be fed into your favourite program or that you may want to perform some simple analisis on them.

For a regular work on PDB files I recommend the use of the Biopython extension Bio.PDB

In this tutorial we will:

  1. Rename HIS residues to HIE residues
  2. Get the total charge of the protein and the number of charged residues.
  3. Translate the protein to its center of mass.

The data directory contains some pdbs.

The PDB is a fixed format. Its format is detailed here. What is relevant here is that the name of the protein residue occupies columns 18-20 and that $x, y, z$ coordinates occupy 31-38, 39-46, 47-54. The numbering start from 1. Atomic positions are in lines starting with ATOM or HETATM records.

HIS to HIE


In [2]:
filein = open('../data/protein.pdb', 'r')
fileout = open('../data/protein_hie.pdb', 'w')
#Finish...
        
filein.close()
fileout.close()

Charge calculations

We will only consider the standard protonation state at pH=7. Of course, we need to define the negative residues and the positive residues. You can find the charged residues here

We could use a list, but as the order here is irrelevant, we will use a set (for large collections, checking the presence of an object in a set is much faster than checking in a list. In our case, this is irrelevant).


In [6]:
negative = set(['ARG', 'LYS']) #We consider histidines (HIS) neutral and epsilon protonated (thus the HIE name)
positive = #Finish
charged = #Finish using union method

Now we need to count the number of residues. The problem is that if we count the number of GLU, ASP,... occurrences, we will be counting atoms, not residues. And as each residue has a different number of atoms, the translation from atoms to residues is not trivial. But residues are numbered in a PDB. They correspond to columns 23-26.

We need to find a change in that column. As a starter, let's see if we can count all the residues this way: by counting the times there is residue change. I can tell you that the protein as 903 residues.


In [3]:
total_res = 0
filein = open('data/protein.pdb', 'r')

# Finish...


print("Total number of residues: ", total_res)


Total number of residues:  903

Good! So now that we know how to detect a change in residue, we need to count how many of these residues are in the charged set.


In [1]:
total_charged = 0
charge = 0
filein = open('data/protein.pdb', 'r')

# Finish

print("Total number of charged residues: ", total_charged)
print("Net charge of the protein: ", charge)


Total number of charged residues:  0
Net charge of the protein:  0