Empirical frequencies of DNA $k$-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of $k$-mers with non-trivial lengths is now possible, as can be seen in Chor et al. (2009) Genome Biol. 10:R108.
You will visualise the distribution of $k$-mer counts as spectra, as in the image above, using Python.
We will use the Biopython
libraries to interact with and manipulate sequence data, and the Pandas
data analysis libraries to manipulate numerical data.
Some code is imported from the local bs32010
module in this directory, to avoid clutter in this notebook. You can inspect this module if you are interested.
In [ ]:
%matplotlib inline
from Bio import SeqIO # For working with sequence data files
from Bio.Seq import Seq # Seq object, needed for the last activity
from Bio.Alphabet import generic_dna # sequence alphabet, for the last activity
from bs32010 import ex02 # Local functions and data
Like Session 01, we will be dealing with sequence data directly, but there are again helper functions for this exercise, in the module ex02
.
There is a dictionary stored in the variable ex02.bact_files
. This provides a tuple of sequence file names, for any organism name in the list stored in ex02.bacteria
.
You can see the contents of this list and dictionary with
print(list(ex02.bacteria))
print(ex02.bact_files)
In [ ]:
# Enter code here
To choose a particular organism, you can use the square bracket notation for dictionaries:
print(ex02.bact_files['Mycobacterium tuberculosis'])
In [ ]:
# Enter code here
A function is provided in the ex02
module to help you:
count_seq_kmers(inseq, k)
: this counts all subsequences of size $k$ in the sequence inseq
Test the function using the code below, which conducts the analysis for a Pectobacterium chromosome:
inseq = SeqIO.read('genome_data/Pectobacterium/GCA_000769535.1.fasta', 'fasta')
kmer_count = ex02.count_seq_kmers(inseq, 6)
kmer_count.head()
The Pandas
dataframe that is returned lets us use the .head()
method to view the first few rows of the dataframe. This shows a column of six-character strings (the $k$-mers), with a second column showing the number of times that $k$-mer occurs in the genome.
In [ ]:
# Enter code here
We can also inspect the .shape
attribute to find out how large the returned results are, as this returns a (rows, columns)
tuple, using the code below:
kmer_count.shape
This tells us that there are 4096 distinct 6-mers in the sequence.
In [ ]:
# Enter code here
You can use the built-in .hist()
method of Pandas
dataframes, that will plot a histogram directly in this notebook. By default, this has quite a wide bin width, but this can be overridden with the bins=n
argument, as with:
kmer_count.hist(column='frequency', bins=100)
By default, the .hist()
method will display the full range of data, but by specifying maximum and minimum values with the range=(min, max)
argument, the extent of data displayed can be controlled.
Use the code below to visualise the 6-mer spectrum
kmer_count.hist(column='frequency', bins=100, range=(0, 1000))
In [ ]:
# Enter code here
Exercise 1 (10min): Recreate the plot in the upper left corner of the figure in the introduction, for one of the E. coli genomes.
print(ex02.bact_files['Escherichia coli'])
to get a list of E.coli chromosome files.
In [ ]:
# Enter code here
Exercise 2 (5min): The E.coli spectrum is unimodal, but how many modes does the Platypus chromosome 01 have?
genome_data/Platypus/oan_ref_Ornithorhynchus_anatinus_5.0.1_chr1.fa
In [ ]:
# Enter code here