Session 02 - Chromosome $k$-mers

Learning Outcomes

Read and manipulate prokaryotic genome sequences using Biopython.
Extract bulk genome properties from a genome sequence
Visualisation of bulk genome properties using Python

Introduction

$k$-mers

Empirical frequencies of DNA $k$-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of $k$-mers with non-trivial lengths is now possible, as can be seen in Chor et al. (2009) Genome Biol. 10:R108.

k-mer spectra for a range of organisms

You will visualise the distribution of $k$-mer counts as spectra, as in the image above, using Python.

Python code

We will use the Biopython libraries to interact with and manipulate sequence data, and the Pandas data analysis libraries to manipulate numerical data.

Some code is imported from the local bs32010 module in this directory, to avoid clutter in this notebook. You can inspect this module if you are interested.



In [ ]:

    
%matplotlib inline
from Bio import SeqIO                 # For working with sequence data files
from Bio.Seq import Seq               # Seq object, needed for the last activity
from Bio.Alphabet import generic_dna  # sequence alphabet, for the last activity

from bs32010 import ex02              # Local functions and data

Sequence data

Like Session 01, we will be dealing with sequence data directly, but there are again helper functions for this exercise, in the module ex02.

There is a dictionary stored in the variable ex02.bact_files. This provides a tuple of sequence file names, for any organism name in the list stored in ex02.bacteria.

You can see the contents of this list and dictionary with

print(list(ex02.bacteria))
print(ex02.bact_files)



In [ ]:

    
# Enter code here

To choose a particular organism, you can use the square bracket notation for dictionaries:

print(ex02.bact_files['Mycobacterium tuberculosis'])



In [ ]:

    
# Enter code here

1. Counting $k$-mers

A function is provided in the ex02 module to help you:

count_seq_kmers(inseq, k): this counts all subsequences of size $k$ in the sequence inseq

Test the function using the code below, which conducts the analysis for a Pectobacterium chromosome:

inseq = SeqIO.read('genome_data/Pectobacterium/GCA_000769535.1.fasta', 'fasta')
kmer_count = ex02.count_seq_kmers(inseq, 6)
kmer_count.head()

The Pandas dataframe that is returned lets us use the .head() method to view the first few rows of the dataframe. This shows a column of six-character strings (the $k$-mers), with a second column showing the number of times that $k$-mer occurs in the genome.



In [ ]:

    
# Enter code here

We can also inspect the .shape attribute to find out how large the returned results are, as this returns a (rows, columns) tuple, using the code below:

kmer_count.shape

This tells us that there are 4096 distinct 6-mers in the sequence.



In [ ]:

    
# Enter code here

2. Plotting $k$-mer spectra

You can use the built-in .hist() method of Pandas dataframes, that will plot a histogram directly in this notebook. By default, this has quite a wide bin width, but this can be overridden with the bins=n argument, as with:

kmer_count.hist(column='frequency', bins=100)

By default, the .hist() method will display the full range of data, but by specifying maximum and minimum values with the range=(min, max) argument, the extent of data displayed can be controlled.

Use the code below to visualise the 6-mer spectrum

kmer_count.hist(column='frequency', bins=100, range=(0, 1000))



In [ ]:

    
# Enter code here

Exercise 1 (10min): Recreate the plot in the upper left corner of the figure in the introduction, for one of the E. coli genomes.

HINT: Use print(ex02.bact_files['Escherichia coli']) to get a list of E.coli chromosome files.



In [ ]:

    
# Enter code here

Exercise 2 (5min): The E.coli spectrum is unimodal, but how many modes does the Platypus chromosome 01 have?

The platypus chromosome 01 file can be found in the file genome_data/Platypus/oan_ref_Ornithorhynchus_anatinus_5.0.1_chr1.fa



In [ ]:

    
# Enter code here