Session 01 - GC Content and Chromosome Size

Learning Outcomes

  • Read, examine and manipulate prokaryotic genome sequences using Biopython.
  • Extract bulk genome properties from a genome sequence
  • Basic visualisation of genome properties using Python
  • Use of bulk genome properties to discriminate and identify organisms

Introduction

Bacterial genomes

Bacterial genomes are relatively simple (especially compared to many eukaryotic genomes), and typically comprise only a single (usually circular) chromosome, and possibly a small number of plasmids.

Bacterial chromosome properties such as length and GC content vary, usually reliably, by bacterial species and genus (histogram from http://www.sci.sdsu.edu/~smaloy/MicrobialGenetics/topics/chroms-genes-prots/genomes.html.

Python code

We can visualise the similarities and differences among and between genomes by plotting summary statistics using Python.

We will use the Biopython libraries to interact with and manipulate sequence data, and the Pandas data analysis libraries to manipulate numerical data.

Some code is imported from the local bs32010 module in this directory, to avoid clutter in this notebook. You can inspect this module if you are interested in how it works.


In [ ]:
%matplotlib inline
from Bio import SeqIO       # Biopython libraries for working with sequence data
from bs32010 import ex01    # Functions and data specific for this exercise

1. Calculating nucleotide frequency and genome size using Biopython

In the lecture slides, we saw an example of loading in a genome, and calculating two values: GC content, and GC skew. These values are defined as below:

  • GC content = $\frac{(G + C)}{\textrm{length}(S)}$
  • GC skew = $\frac{(G - C)}{(G + C)}$

where $G$ and $C$ are the count of G bases and C bases on the forward strand of the genome sequence $S$, respectively. The code used in the lecture slide is:

For these exercises, we will mostly work with helper functions in the Python module bs32010, but to revise some Python programming, we will reproduce the code in the slide below for a different genome.

Firsly though, we will look at how to load a genome (or other biological) sequence using Biopython.

Sequence data in Biopython

To load a single sequence, you can use the function SeqIO.read(). This takes two arguments: the name of the file that contains the sequence information; and the type of sequence format - the way the information is arranged in the file.

As an example, we will load information from a Pectobacterium genome. The data is in the file genome_data/Pectobacterium/GCF_000738125.1.fasta and, as the file extension suggests, the data is in fasta format.

So, the statement we use to load the information is:

myseq = SeqIO.read("genome_data/Pectobacterium/GCF_000024645.1.fasta", "fasta")

Use this code in the cell below, to read the sequence data.

NOTE: both the filename and the format need to be enclosed in quotation marks ("), because they are strings.

NOTE: we are loading the information into a variable called myseq - you can think of this as a container that holds the sequence information. When we refer to the sequence data from now on, we can refer to it as myseq.


In [ ]:
# Enter code here

The sequence information in myseq is organised to be helpful. There are separate components for sequence and metadata information. You can look at this information using attributes of myseq: id, description, and seq.

myseq.id
myseq.description
myseq.seq

Use these attributes in the cell below to see information about the sequence in myseq.

NOTE: if you type more than one line of attributes, the Jupyter notebook will only show you output for the last attribute.


In [ ]:
# Enter code here

You can get more information about the sequence using functions in Python. For example, to obtain the sequence length of the sequence in myseq, you can use the len() function, as follows:

len(myseq.seq)

To get a count of the number of adenines (A), you can use the .count() method, as follows:

myseq.seq.count('A')

Exercise 1: Use the len() and .count() functions in the cell below to find the length of myseq, the numbers of A, C, G, and T bases


In [ ]:
# Enter code here

Example Exercise (5min): Enter the following code in the cell below. What is the organism, and what are the GC content and skew for this genome?

s = SeqIO.read("genome_data/Pectobacterium/GCA_000769535.1.fasta", "fasta")
print(s.description)
a, c, g, t = s.seq.count("A"), s.seq.count("C"), s.seq.count("G"), s.seq.count("T")
print("Genome length: %d" % len(s))
print("GC content: %.2g" % ((g + c)/len(s)))
print("GC skew: %.2g" % ((g - c)/(g + c)))

In [ ]:
# Enter code here

Exercise 2 (5min): Adapt the above code to calculate the same quantities for the genome in the file GCF_000011605.1.fasta, and discover what the organism is.

Enter your code in the cell below.


In [ ]:
# Enter code here

Exercise 2a - stretch goal (5min): Adapt the above code to calculate the AT content and AT skew for the genome in the file GCF_000011605.1.fasta

Enter your code in the cell below.


In [ ]:
# Enter code here

2. Calculating nucleotide frequency and genome size using helper functions

For convenience, the Python package bs32010 is provided for this workshop. For this worksheet, functions are found in the module ex01. We can see what functions are present using the dir() function.

dir(ex01)

Run this code in the cell, below.


In [ ]:
# Enter code here

You will shortly be using the ex01.calc_size_gc() function. You can find out useful information about any function in Python with the help() function.

help(ex01.calc_size_gc)

Run this code in the cell below.

NOTE: You need only provide the function name, not the parentheses that follow it.


In [ ]:
# Enter code here

The help text refers to to the variable ex01.bact_files. You can examine this by typing the variable name in the cell below.

ex01.bact_files

This will show you the names of several bacteria that can be used with the calc_size_gc() function.

  • Nostoc punctiforme
  • Mycoplasma pneumoniae
  • Mycobacterium tuberculosis
  • Mycoplasma genitalium
  • Escherichia coli

In [ ]:
# Enter code here

Test the calc_size_gc() function by running it on the organisms Mycoplasma genitalium and Notoc punctiforme using the code below.

ex01.calc_size_gc('Mycoplasma genitalium', 'Nostoc punctiforme')

Run this code in the cell below.


In [ ]:
# Enter code here

The function ex01.plot_data() will plot genome length against GC content, with coloured points for each genome sequence. Test this using the code below.

gc_data = ex01.calc_size_gc('Mycoplasma genitalium', 'Nostoc punctiforme')
ex01.plot_data(gc_data)

In [ ]:
# Enter code here

You should now see a plot of chromosome length and GC content that looks something like the one shown below.

You should see that M. genitalium has much smaller chromosomes, and a lower GC content than Nostoc punctiforme. There is not a great deal of variation in genome size and GC content for either organism.

Exercise 3 (5min): Produce a scatterplot of all the example chromosomes in bact_files. Which organism has the largest/smallest chromosome? Which has the largest/smallest GC content?

  • HINT: You can use list(ex01.bacteria) to obtain a list of all the bacteria names you need

Exercise 3a - stretch goal (5min): Use the help() function to find out how to write your plot to a file, and produce the file "all_chromosomes.pdf"


In [ ]:
# Enter code here

3 Using chromosome length and GC content to identify an organism

Summary statistics such as the chromosome length and GC content can be characteristic of a bacterial species, it is possible to use them to help infer the species of bacterium from which an "unknown" chromosome sequence may originate.

We can add new data representing a chromosome of unknown origin to the data for all example chromosomes, and produce a scatterplot. The proximity of hte unknown genome to points from a named species may indicate the origin of the unknown chromosome.

The bs32010.ex01 module provides GC content and genome size information about an unknown organism, in the variable unknown (a Pandas dataframe). We can look at it with

ex01.unknown

Enter this code in the cell below.


In [ ]:
# Enter code here

The unknown organism has a genome length of 4.4Mbp, and GC content of around 66%. You could determine a likely identity for this organism from a scatterplot, by visual inspection. To generate the input data for the scatterplot, enter the following code in the cell below:

all_data = ex01.calc_size_gc(*ex01.bacteria)
all_data = all_data.append(ex01.unknown)

In [ ]:
# Enter code here

Exercise 4 (5min): Render a scatterplot of all_data, and identify the unknown organism.


In [ ]:
# Enter code here