In this notebook we will do some simple parsings and analysis on a Protein Databank (PDB) file. You don't need to care about proteins to follow this example. The idea is that some (downloaded) files may need some changes before they can be fed into your favourite program or that you may want to perform some simple analisis on them.
For a regular work on PDB files I recommend the use of the Biopython extension Bio.PDB
In this tutorial we will:
The data directory contains some pdbs.
The PDB is a fixed format. Its format is detailed here. What is relevant here is that the name of the protein residue occupies columns 18-20 and that $x, y, z$ coordinates occupy 31-38, 39-46, 47-54. The numbering start from 1. Atomic positions are in lines starting with ATOM or HETATM records.
In [2]:
filein = open('../data/protein.pdb', 'r')
fileout = open('../data/protein_hie.pdb', 'w')
#Finish...
filein.close()
fileout.close()
We will only consider the standard protonation state at pH=7. Of course, we need to define the negative residues and the positive residues. You can find the charged residues here
We could use a list, but as the order here is irrelevant, we will use a set (for large collections, checking the presence of an object in a set is much faster than checking in a list. In our case, this is irrelevant).
In [6]:
negative = set(['ARG', 'LYS']) #We consider histidines (HIS) neutral and epsilon protonated (thus the HIE name)
positive = #Finish
charged = #Finish using union method
Now we need to count the number of residues. The problem is that if we count the number of GLU, ASP,... occurrences, we will be counting atoms, not residues. And as each residue has a different number of atoms, the translation from atoms to residues is not trivial. But residues are numbered in a PDB. They correspond to columns 23-26.
We need to find a change in that column. As a starter, let's see if we can count all the residues this way: by counting the times there is residue change. I can tell you that the protein as 903 residues.
In [3]:
total_res = 0
filein = open('data/protein.pdb', 'r')
# Finish...
print("Total number of residues: ", total_res)
Good! So now that we know how to detect a change in residue, we need to count how many of these residues are in the charged
set.
In [1]:
total_charged = 0
charge = 0
filein = open('data/protein.pdb', 'r')
# Finish
print("Total number of charged residues: ", total_charged)
print("Net charge of the protein: ", charge)