Unit 2: Programming Design

Lesson 13: File I/O - Pre-activity

Scientific Context: Biomolecule Databases

The sequencing of the human genome as well as emerging interest in proteomics and molecular structure have produced an explosion in the amount of biological information that's available. This flood of data has necessitated development of databases and associated software designed to update, query retrieve, and analyze components of the data stored within these systems. There are thousands of public and commercial biological databases containing huge amounts of nucleotide sequences of genes or amino acid sequences of proteins. These databases include both public repositories of gene data like GenBank (as of Feb 2017 contains 228,719,437,638 nucleotides in 199,341,377 sequences in the main database and 1,892,966,308,635 nucleotides in 409,490,397 in the whole genome shotgun database), EMBL's UniProt protein databases (as of Feb 2017 Swiss-Prot contains 553,655 proteins and TrEMBL contains 77,483,538 protein sequences), or the Protein Data Bank (PDB) which contains 3D structures, and private databases like those used by research groups involved in gene mapping projects or chemical compound databases held by biotech and pharmaceutical companies.

Science Fundamentals: Molecular data files

When acquiring details of molecular composition from an electronic database, you will find the data presented in a wide variety of formats. For example, one of the simplest sequence file formats used in many biological databases is the FASTA file. This text-based format represents either nucleic acid or protein sequences using the single-letter code introduced previously. An example of a twenty-residue peptide Trp-cage (the fastest known folding peptide) in FASTA format is given in Figure 1 below:

>1L2Y:A|PDBID|CHAIN|SEQUENCE
NLYIQWLKDGGPSSGRPPPS

Figure 1: FASTA file for the 20 amino acid Trp-cage protein

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is labeled by a greater-than (">") symbol in the first column. The words following the ">" symbol is the identifier of the sequence (e.g., 1L2Y), and the rest of the line is the description (e.g., A:PDBID). The next line (until a hard return) is the sequence information, in this case it is in the single letter amino acid code. If another line starting with a ">" appears, multiple sequences are represented, so it is possible for a single FASTA file to contain many sequences, each starting with an information line starting with “>”.

For more detailed structural information, atomic coordinate files are available identifying an x-y-z coordinate for each atom. These structures are typically derived from experimental studies of molecules (for example, x-ray diffraction or nuclear magnetic resonance (NMR) analyses).

One of the most popular atomic coordinate files for macromolecules is called a "PDB file". In short, a PDB file is broken into two sections:

  1. a header that contains a variety of background information such as authors and experimental conditions
  2. 3D Cartesian coordinate data including atom labels and experimental parameters.

An excerpt from the coordinate section of a PDB file is given in Figure 2 detailing the positions of all atoms (except hydrogen) for the first amino acid in the protein Trp-cage protein structure 1L2Y (PDB #). The line labeled "MODEL" signifies the beginning of the coordinate section below:

MODEL 1                                                                  
ATOM      1  N   ASN A   1      -8.901   4.127  -0.555  1.00  0.00     N  
ATOM      2  CA  ASN A   1      -8.608   3.135  -1.618  1.00  0.00     C  
ATOM      3  C   ASN A   1      -7.117   2.964  -1.897  1.00  0.00     C  
ATOM      4  O   ASN A   1      -6.634   1.849  -1.758  1.00  0.00     O  
ATOM      5  CB  ASN A   1      -9.437   3.396  -2.889  1.00  0.00     C  
ATOM      6  CG  ASN A   1     -10.915   3.130  -2.611  1.00  0.00     C  
ATOM      7  OD1 ASN A   1     -11.269   2.700  -1.524  1.00  0.00     O  
ATOM      8  ND2 ASN A   1     -11.806   3.406  -3.543  1.00  0.00     N

Figure 2: Example atomic coordinates from the 1L2Y PDB file

If there is only 1 "MODEL" then the word need not appear, the beginning of the coordinates is the first line that starts with "ATOM". If there are additional lines starting with "MODEL", multiple structural representations of the same molecule are contained in the file. The information in the coordinate section of a PDB file is often used as input for visualization and simulation computer programs used in molecular modeling applications. For more information on the file format, see: Introduction to Protein Data Bank Format (especially the section "Examples of PDB Format" toward the bottom.)

Pre-Activity Questions

1. Go to the RSCB protein data bank (http://www.rcsb.org) and download or view the pdb file: PDB Format (Text) corresponding to the NMR structure of Trp-cage protein structure 2LDJ (PDB #). Identify the first amino acid and the XYZ coordinates of the first atom.