BioPandas

Author: Sebastian Raschka mail@sebastianraschka.com
License: BSD 3 clause
Project Website: http://rasbt.github.io/biopandas/
Code Repository: https://github.com/rasbt/biopandas



In [1]:

    
%load_ext watermark
%watermark -d -u -p pandas,biopandas









    



last updated: 2019-02-05 

pandas 0.23.4
biopandas 0.2.4



In [2]:

    
from biopandas.pdb import PandasPdb
import pandas as pd
pd.set_option('display.width', 600)
pd.set_option('display.max_columns', 8)

Working with PDB Structures in DataFrames

Loading PDB Files

There are 2 1/2 ways to load a PDB structure into a PandasPdb object.

1

PDB files can be directly fetched from The Protein Data Bank at http://www.rcsb.org via its unique 4-letter after initializing a new PandasPdb object and calling the fetch_pdb method:



In [3]:

    
from biopandas.pdb import PandasPdb

# Initialize a new PandasPdb object
# and fetch the PDB file from rcsb.org
ppdb = PandasPdb().fetch_pdb('3eiy')

2 a)

Alternatively, we can load PDB files from local directories as regular PDB files using read_pdb:



In [4]:

    
ppdb.read_pdb('./data/3eiy.pdb')









    Out[4]:





<biopandas.pdb.pandas_pdb.PandasPdb at 0x11912b710>

[File link: 3eiy.pdb]

2 b)

Or, we can load them from gzip archives like so (note that the file must end with a '.gz' suffix in order to be recognized as a gzip file):



In [5]:

    
ppdb.read_pdb('./data/3eiy.pdb.gz')









    Out[5]:





<biopandas.pdb.pandas_pdb.PandasPdb at 0x11912b710>

[File link: 3eiy.pdb.gz]

After the file was succesfully loaded, we have access to the following attributes:



In [6]:

    
print('PDB Code: %s' % ppdb.code)
print('PDB Header Line: %s' % ppdb.header)
print('\nRaw PDB file contents:\n\n%s\n...' % ppdb.pdb_text[:1000])









    



PDB Code: 3eiy
PDB Header Line:     HYDROLASE                               17-SEP-08   3EIY

Raw PDB file contents:

HEADER    HYDROLASE                               17-SEP-08   3EIY              
TITLE     CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHATASE FROM BURKHOLDERIA      
TITLE    2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE                                
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: INORGANIC PYROPHOSPHATASE;                                 
COMPND   3 CHAIN: A;                                                            
COMPND   4 EC: 3.6.1.1;                                                         
COMPND   5 ENGINEERED: YES                                                      
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: BURKHOLDERIA PSEUDOMALLEI 1710B;                
SOURCE   3 ORGANISM_TAXID: 320372;                                              
SOURCE   4 GENE: PPA, BURPS1710B_1237;                                          
SOURCE   5 EXPRESSION_SYSTEM
...

The most interesting / useful attribute is the PandasPdb.df DataFrame dictionary though, which gives us access to the PDB files as pandas DataFrames. Let's print the first 3 lines from the ATOM coordinate section to see how it looks like:



In [7]:

    
ppdb.df['ATOM'].head(3)









    Out[7]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      0
      ATOM
      1
      
      N
      ...
      
      N
      NaN
      609
    
    
      1
      ATOM
      2
      
      CA
      ...
      
      C
      NaN
      610
    
    
      2
      ATOM
      3
      
      C
      ...
      
      C
      NaN
      611
    
  

3 rows × 21 columns

But more on that in the next section.

Looking at PDBs in DataFrames

PDB files are parsed according to the PDB file format description. More specifically, BioPandas reads the columns of the ATOM and HETATM sections as shown in the following excerpt from http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM.

COLUMNS	DATA TYPE	CONTENTS	biopandas column name
1 - 6	Record name	"ATOM"	record_name
7 - 11	Integer	Atom serial number.	atom_number
12			blank_1
13 - 16	Atom	Atom name.	atom_name
17	Character	Alternate location indicator.	alt_loc
18 - 20	Residue name	Residue name.	residue_name
21			blank_2
22	Character	Chain identifier.	chain_id
23 - 26	Integer	Residue sequence number.	residue_number
27	AChar	Code for insertion of residues.	insertion
28 - 30			blank_3
31 - 38	Real(8.3)	Orthogonal coordinates for X in Angstroms.	x_coord
39 - 46	Real(8.3)	Orthogonal coordinates for Y in Angstroms.	y_coord
47 - 54	Real(8.3)	Orthogonal coordinates for Z in Angstroms.	z_coord
55 - 60	Real(6.2)	Occupancy.	occupancy
61 - 66	Real(6.2)	Temperature factor (Default = 0.0).	bfactor
67-72			blank_4
73 - 76	LString(4)	Segment identifier, left-justified.	segment_id
77 - 78	LString(2)	Element symbol, right-justified.	element_symbol
79 - 80	LString(2)	Charge on the atom.	charge

Below is an example of how this would look like in an actual PDB file:

Example: 
         1         2         3         4         5         6         7         8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM    145  N   VAL A  25      32.433  16.336  57.540  1.00 11.92      A1   N
ATOM    146  CA  VAL A  25      31.132  16.439  58.160  1.00 11.85      A1   C
ATOM    147  C   VAL A  25      30.447  15.105  58.363  1.00 12.34      A1   C
ATOM    148  O   VAL A  25      29.520  15.059  59.174  1.00 15.65      A1   O
ATOM    149  CB AVAL A  25      30.385  17.437  57.230  0.28 13.88      A1   C
ATOM    150  CB BVAL A  25      30.166  17.399  57.373  0.72 15.41      A1   C
ATOM    151  CG1AVAL A  25      28.870  17.401  57.336  0.28 12.64      A1   C
ATOM    152  CG1BVAL A  25      30.805  18.788  57.449  0.72 15.11      A1   C
ATOM    153  CG2AVAL A  25      30.835  18.826  57.661  0.28 13.58      A1   C
ATOM    154  CG2BVAL A  25      29.909  16.996  55.922  0.72 13.25      A1   C

After loading a PDB file from rcsb.org or our local drive, the PandasPdb.df attribute should contain the following 4 DataFrame objects:



In [8]:

    
from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb('./data/3eiy.pdb')
ppdb.df.keys()









    Out[8]:





dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

[File link: 3eiy.pdb]

'ATOM': contains the entries from the ATOM coordinate section
'ATOM': ... entries from the "HETATM" coordinate section
'ANISOU': ... entries from the "ANISOU" coordinate section
'OTHERS': Everything else that is not a 'ATOM', 'HETATM', or 'ANISOU' entry

The columns of the 'HETATM' DataFrame are indentical to the 'ATOM' DataFrame that we've seen earlier:



In [9]:

    
ppdb.df['HETATM'].head(2)









    Out[9]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      0
      HETATM
      1332
      
      K
      ...
      
      K
      NaN
      1940
    
    
      1
      HETATM
      1333
      
      NA
      ...
      
      NA
      NaN
      1941
    
  

2 rows × 21 columns

Note that "ANISOU" entries are handled a bit differently as specified at http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM.



In [10]:

    
ppdb.df['ANISOU'].head(2)









    Out[10]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      blank_4
      element_symbol
      charge
      line_idx
    
  
  
  

0 rows × 21 columns

Not every PDB file contains ANISOU entries (similarly, some PDB files may only contain HETATM or ATOM entries). If records are basent, the DataFrame will be empty as show above.



In [11]:

    
ppdb.df['ANISOU'].empty









    Out[11]:





True

Since the DataFrames are fairly wide, let's us take a look at the columns by accessing the DataFrame's column attribute:



In [12]:

    
ppdb.df['ANISOU'].columns









    Out[12]:





Index(['record_name', 'atom_number', 'blank_1', 'atom_name', 'alt_loc', 'residue_name', 'blank_2', 'chain_id', 'residue_number', 'insertion', 'blank_3', 'U(1,1)', 'U(2,2)', 'U(3,3)', 'U(1,2)', 'U(1,3)', 'U(2,3)', 'blank_4', 'element_symbol', 'charge', 'line_idx'], dtype='object')

ANISOU records are very similar to ATOM/HETATM records. In fact, the columns 7 - 27 and 73 - 80 are identical to their corresponding ATOM/HETATM records, which means that the 'ANISOU' DataFrame doesn't have the following entries:



In [13]:

    
set(ppdb.df['ATOM'].columns).difference(set(ppdb.df['ANISOU'].columns))









    Out[13]:





{'b_factor', 'occupancy', 'segment_id', 'x_coord', 'y_coord', 'z_coord'}

Instead, the "ANISOU" DataFrame contains the anisotropic temperature factors "U(-,-)" -- note that these are scaled by a factor of $10^4$ ($\text{Angstroms}^2$) by convention.



In [14]:

    
set(ppdb.df['ANISOU'].columns).difference(set(ppdb.df['ATOM'].columns))









    Out[14]:





{'U(1,1)', 'U(1,2)', 'U(1,3)', 'U(2,2)', 'U(2,3)', 'U(3,3)'}

Ah, another interesting thing to mention is that the columns already come with the types you'd expect (where object essentially "means" str here):



In [15]:

    
ppdb.df['ATOM'].dtypes









    Out[15]:





record_name        object
atom_number         int64
blank_1            object
atom_name          object
alt_loc            object
residue_name       object
blank_2            object
chain_id           object
residue_number      int64
insertion          object
blank_3            object
x_coord           float64
y_coord           float64
z_coord           float64
occupancy         float64
b_factor          float64
blank_4            object
segment_id         object
element_symbol     object
charge            float64
line_idx            int64
dtype: object

Typically, all good things come in threes, however, there is a 4th DataFrame, an'OTHER' DataFrame, which contains everything that wasn't parsed as 'ATOM', 'HETATM', or 'ANISOU' coordinate section:



In [16]:

    
ppdb.df['OTHERS'].head(5)









    Out[16]:







  
    
      
      record_name
      entry
      line_idx
    
  
  
    
      0
      HEADER
      HYDROLASE                               17...
      0
    
    
      1
      TITLE
      CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHA...
      1
    
    
      2
      TITLE
      2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE
      2
    
    
      3
      COMPND
      MOL_ID: 1;
      3
    
    
      4
      COMPND
      2 MOLECULE: INORGANIC PYROPHOSPHATASE;
      4

Although these 'OTHER' entries are typically less useful for structure-related computations, you may still want to take a look at them to get a short summary of the PDB structure and learn about it's potential quirks and gotchas (typically listed in the REMARKs section). Lastly, the "OTHERS" DataFrame comes in handy if we want to reconstruct the structure as PDB file as we will see later (note the line_idx columns in all of the DataFrames).

Working with PDB DataFrames

In the previous sections, we've seen how to load PDB structures into DataFrames, and how to access them. Now, let's talk about manipulating PDB files in DataFrames.



In [17]:

    
from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb('./data/3eiy.pdb.gz')
ppdb.df['ATOM'].head()









    Out[17]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      0
      ATOM
      1
      
      N
      ...
      
      N
      NaN
      609
    
    
      1
      ATOM
      2
      
      CA
      ...
      
      C
      NaN
      610
    
    
      2
      ATOM
      3
      
      C
      ...
      
      C
      NaN
      611
    
    
      3
      ATOM
      4
      
      O
      ...
      
      O
      NaN
      612
    
    
      4
      ATOM
      5
      
      CB
      ...
      
      C
      NaN
      613
    
  

5 rows × 21 columns

[File link: 3eiy.pdb.gz]

Okay, there's actually not that much to say ...
Once we have our PDB file in the DataFrame format, we have the whole convenience of pandas right there at our fingertips.

For example, let's get all Proline residues:



In [18]:

    
ppdb.df['ATOM'][ppdb.df['ATOM']['residue_name'] == 'PRO'].head()









    Out[18]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      38
      ATOM
      39
      
      N
      ...
      
      N
      NaN
      647
    
    
      39
      ATOM
      40
      
      CA
      ...
      
      C
      NaN
      648
    
    
      40
      ATOM
      41
      
      C
      ...
      
      C
      NaN
      649
    
    
      41
      ATOM
      42
      
      O
      ...
      
      O
      NaN
      650
    
    
      42
      ATOM
      43
      
      CB
      ...
      
      C
      NaN
      651
    
  

5 rows × 21 columns

Or main chain atoms:



In [19]:

    
ppdb.df['ATOM'][ppdb.df['ATOM']['atom_name'] == 'C'].head()









    Out[19]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      2
      ATOM
      3
      
      C
      ...
      
      C
      NaN
      611
    
    
      8
      ATOM
      9
      
      C
      ...
      
      C
      NaN
      617
    
    
      19
      ATOM
      20
      
      C
      ...
      
      C
      NaN
      628
    
    
      25
      ATOM
      26
      
      C
      ...
      
      C
      NaN
      634
    
    
      33
      ATOM
      34
      
      C
      ...
      
      C
      NaN
      642
    
  

5 rows × 21 columns

It's also easy to strip our coordinate section from hydrogen atoms if there are any ...



In [20]:

    
ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H'].head()









    Out[20]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      0
      ATOM
      1
      
      N
      ...
      
      N
      NaN
      609
    
    
      1
      ATOM
      2
      
      CA
      ...
      
      C
      NaN
      610
    
    
      2
      ATOM
      3
      
      C
      ...
      
      C
      NaN
      611
    
    
      3
      ATOM
      4
      
      O
      ...
      
      O
      NaN
      612
    
    
      4
      ATOM
      5
      
      CB
      ...
      
      C
      NaN
      613
    
  

5 rows × 21 columns

Or, let's compute the average temperature factor of our protein main chain:



In [21]:

    
mainchain = ppdb.df['ATOM'][(ppdb.df['ATOM']['atom_name'] == 'C') | 
                            (ppdb.df['ATOM']['atom_name'] == 'O') | 
                            (ppdb.df['ATOM']['atom_name'] == 'N') | 
                            (ppdb.df['ATOM']['atom_name'] == 'CA')]

bfact_mc_avg = mainchain['b_factor'].mean()
print('Average B-Factor [Main Chain]: %.2f' % bfact_mc_avg)









    



Average B-Factor [Main Chain]: 28.83

Plotting

Since we are using pandas under the hood, which in turns uses matplotlib under the hood, we can produce quick summary plots of our PDB structures relatively conveniently:



In [22]:

    
from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')

[File link: 3eiy.pdb.gz]



In [23]:

    
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')



In [24]:

    
ppdb.df['ATOM']['b_factor'].plot(kind='hist')
plt.title('Distribution of B-Factors')
plt.xlabel('B-factor')
plt.ylabel('count')
plt.show()



In [25]:

    
ppdb.df['ATOM']['b_factor'].plot(kind='line')
plt.title('B-Factors Along the Amino Acid Chain')
plt.xlabel('Residue Number')
plt.ylabel('B-factor in $A^2$')
plt.show()



In [26]:

    
ppdb.df['ATOM']['element_symbol'].value_counts().plot(kind='bar')
plt.title('Distribution of Atom Types')
plt.xlabel('elements')
plt.ylabel('count')
plt.show()

Computing the Root Mean Square Deviation

BioPandas also comes with certain convenience functions, for example, ...

The Root-mean-square deviation (RMSD) is simply a measure of the average distance between atoms of 2 protein or ligand structures. This calculation of the Cartesian error follows the equation:

$$RMSD(a, b) = \sqrt{\frac{1}{n} \sum^{n}_{i=1} \big((a_{ix})^2 + (a_{iy})^2 + (a_{iz})^2 \big)} \\ = \sqrt{\frac{1}{n} \sum^{n}_{i=1} || a_i + b_i||_2^2}$$

So, assuming that the we have the following 2 conformations of a ligand molecule

we can compute the RMSD as follows:



In [27]:

    
from biopandas.pdb import PandasPdb

l_1 = PandasPdb().read_pdb('./data/lig_conf_1.pdb')
l_2 = PandasPdb().read_pdb('./data/lig_conf_2.pdb')
r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'],
                   s=None) # all atoms, including hydrogens
print('RMSD: %.4f Angstrom' % r)









    



RMSD: 2.6444 Angstrom

[File links: lig_conf_1.pdb, lig_conf_2.pdb]



In [28]:

    
r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], 
                   s='carbon') # carbon atoms only
print('RMSD: %.4f Angstrom' % r)









    



RMSD: 3.1405 Angstrom



In [29]:

    
r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], 
                   s='heavy') # heavy atoms only
print('RMSD: %.4f Angstrom' % r)









    



RMSD: 1.9959 Angstrom

Similarly, we can compute the RMSD between 2 related protein structures:

The hydrogen-free RMSD:



In [30]:

    
p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')
p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')
r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='heavy')
print('RMSD: %.4f Angstrom' % r)









    



RMSD: 0.7377 Angstrom

Or the RMSD between the main chains only:



In [31]:

    
p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')
p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')
r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='main chain')
print('RMSD: %.4f Angstrom' % r)









    



RMSD: 0.4781 Angstrom

Filtering PDBs by Distance

We can use the distance method to compute the distance between each atom (or a subset of atoms) in our data frame and a three-dimensional reference point. For example:



In [32]:

    
p_1 = PandasPdb().read_pdb('./data/3eiy.pdb')

reference_point = (9.362, 41.410, 10.542)
distances = p_1.distance(xyz=reference_point, records=('ATOM',))

[File link: 3eiy.pdb]

The distance method returns a Pandas Series object:



In [33]:

    
distances.head()









    Out[33]:





0    19.267419
1    18.306060
2    16.976934
3    16.902897
4    18.124171
dtype: float64

And we can use this Series object, for instance, to select certain atoms in our DataFrame that fall within a desired distance threshold. For example, let's select all atoms that are within 7A of our reference point:



In [34]:

    
all_within_7A = p_1.df['ATOM'][distances < 7.0]
all_within_7A.tail()









    Out[34]:







  
    
      
      record_name
      atom_number
      blank_1
      atom_name
      ...
      segment_id
      element_symbol
      charge
      line_idx
    
  
  
    
      786
      ATOM
      787
      
      CB
      ...
      
      C
      NaN
      1395
    
    
      787
      ATOM
      788
      
      CG
      ...
      
      C
      NaN
      1396
    
    
      788
      ATOM
      789
      
      CD1
      ...
      
      C
      NaN
      1397
    
    
      789
      ATOM
      790
      
      CD2
      ...
      
      C
      NaN
      1398
    
    
      790
      ATOM
      791
      
      N
      ...
      
      N
      NaN
      1399
    
  

5 rows × 21 columns

Visualized in PyMOL, this subset (yellow surface) would look as follows:

Converting Amino Acid codes from 3- to 1-letter codes

Residues in the residue_name field can be converted into 1-letter amino acid codes, which may be useful for further sequence analysis, for example, pair-wise or multiple sequence alignments:



In [35]:

    
from biopandas.pdb import PandasPdb
ppdb = PandasPdb().fetch_pdb('5mtn')
sequence = ppdb.amino3to1()
sequence.tail()









    Out[35]:







  
    
      
      chain_id
      residue_name
    
  
  
    
      1378
      B
      I
    
    
      1386
      B
      N
    
    
      1394
      B
      Y
    
    
      1406
      B
      R
    
    
      1417
      B
      T

As shown above, the amino3to1 method returns a DataFrame containing the chain_id and residue_name of the translated 1-letter amino acids. If you like to work with the sequence as a Python list of string characters, you could do the following:



In [36]:

    
sequence_list = list(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])
sequence_list[-5:] # last 5 residues of chain A









    Out[36]:





['V', 'R', 'H', 'Y', 'T']

And if you prefer to work with the sequence as a string, you can use the join method:



In [37]:

    
''.join(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])









    Out[37]:





'SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT'

To iterate over the sequences of multi-chain proteins, you can use the unique method as shown below:



In [38]:

    
for chain_id in sequence['chain_id'].unique():
    print('\nChain ID: %s' % chain_id)
    print(''.join(sequence.loc[sequence['chain_id'] == chain_id, 'residue_name']))









    



Chain ID: A
SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT

Chain ID: B
SVSSVPTKLEVVAATPTSLLISWDAPAVTVVYYLITYGETGSPWPGGQAFEVPGSKSTATISGLKPGVDYTITVYAHRSSYGYSENPISINYRT

Wrapping it up - Saving PDB structures

Finally, let's talk about how to get the PDB structures out of the DataFrame format back into the beloved .pdb format.

Let's say we loaded a PDB structure, removed it from it's hydrogens:



In [39]:

    
from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')
ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H']

[File link: 3eiy.pdb.gz]

We can save the file using the PandasPdb.to_pdb method:



In [40]:

    
ppdb.to_pdb(path='./data/3eiy_stripped.pdb', 
            records=None, 
            gz=False, 
            append_newline=True)

[File link: 3eiy_stripped.pdb]

By default, all records (that is, 'ATOM', 'HETATM', 'OTHERS', 'ANISOU') are written if we set records=None. Alternatively, let's say we want to get rid of the 'ANISOU' entries and produce a compressed gzip archive of our PDB structure:



In [41]:

    
ppdb.to_pdb(path='./data/3eiy_stripped.pdb.gz', 
            records=['ATOM', 'HETATM', 'OTHERS'], 
            gz=True, 
            append_newline=True)

[File link: 3eiy_stripped.pdb.gz]

	record_name	entry	line_idx
0	HEADER	HYDROLASE 17...	0
1	TITLE	CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHA...	1
2	TITLE	2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE	2
3	COMPND	MOL_ID: 1;	3
4	COMPND	2 MOLECULE: INORGANIC PYROPHOSPHATASE;	4

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
786	ATOM	787	CB	...	C	NaN	1395
787	ATOM	788	CG	...	C	NaN	1396
788	ATOM	789	CD1	...	C	NaN	1397
789	ATOM	790	CD2	...	C	NaN	1398
790	ATOM	791	N	...	N	NaN	1399