Unit 2: Programming Design

Lesson 10: Strings

Scientific Context: Bioinformatics & Computational Biology

Bioinformatics and computational biology involve the use of computers to characterize the molecular composition of living things. It is the field of science in which biology, computer science, and information technology merge to form a single discipline. Most biologists talk about "doing bioinformatics" when they use computers to store, retrieve, analyze or predict the composition or the structure of everything from a single biomolecule, all the way up to comparing multiple genomes (all ofthe DNA in an organism) and proteomes (all of the proteins of an organism). Currently, one of the most important tasks in this field involves the analysis and interpretation and analysis of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The results of these analyses are often new hypotheses in evolutionary biology, infectious disease, and cancer biology.

Scientific Fundementals: Protein & Nucleic Acid Structure

Proteins are polymers of amino acids linked by peptide bonds. The 20 amino acids found in proteins are similar in molecular structure and differ only by a side chain (R group) that defines both the name and chemical properties. The specific sequence of amino acids in a protein (referred to as its primary structure) determines the ultimate 3D structure of the protein.

DNA is a long polymer of molecular units called nucleotides held together by a backbone made of alternating sugars and phosphate groups. DNA strands have distinct ends and directionality related to the polarity of this sugar-phosphate backbone. One end is called the 5' (five prime) end and the other is 3' (three prime). Attached to each sugar is one of four nitrogenous bases, adenine (A), cytosine (C), guanine (G) and thymine (T). In living organisms, DNA is composed of two strands of these nucleotides wound around each other in the form of a double helix. These two strands are referred to as complementary in that each base of one DNA strand forms hydrogen bonds with a base of the second DNA strand with A bonding only to T, and C bonding only to G. Because of this base-pairing characteristic, the complementary strand of any single-stranded DNA sequence can be deduced. In addition, these two strands run antiparallel or are of opposite chemical polarity. One of the DNA strands is in the 5'→3' direction, while the complementary strand is in the 3'→5' direction as shown in Figure 1, below.

DNA1

Figure 1: Two complementary, antiparallel strands of DNA. The arrows indicate the direction of polarity.

Portions of the DNA base sequence, referred to as a gene, specify the sequence of the amino acids within proteins. More specifically, the genes that code for proteins are composed of tri-nucleotide units called codons, each corresponding to a single amino acid. The genetic code consists of 64 triplets of nucleotides identified in Table 1. DNA sequences that code for proteins all begin with the three bases ATG that code for the amino acid methionine (often also referred to as a start codon when it is the first one), and end with one or more stop codons: TAA, TAG or TGA (which do not add an amino acid). This code is read by copying stretches of DNA in the 5' → 3' direction into the related messenger RNA, in a process called transcription. The nucleic acid RNA is similar to DNA with the exception that the nucleotide uracil (U) appears in place of thymine (T). Once transcribed, the single mRNA strand can be translated into protein. This process is summarized in Figure 2.

DNA2

Table 1: The 64 DNA codons. ATG (corresponding to the amino acid methionine) signals the start of transcription. The three codons TAA, TAG, or TGA signal the termination of the peptide chain. This table will also often appear with the mRNA codons (U instead of T)

Figure 2: Summary of 2 step gene expression process

Pre-Activity Questions

1. Identify one possible double stranded DNA sequence and the corresponding mRNA sequence that would result in the following protein sequence: CRGLVMF.

2. Why is there more than one possible answer? Explain and provide an example.