scikit-bio: core bioinformatics data structures and algorithms in Python

J Gregory Caporaso

caporasolab.us

Northern Arizona University

scikit-bio.org

GitHub/Twitter: @gregcaporaso


In [1]:
from __future__ import division, print_function
from IPython.core import page
page.page = print

import warnings
warnings.simplefilter('always')

The pre-history of scikit-bio...

Quantitative Insights Into Microbial Ecology

Cited 1596 times since publication in 2010

scikit-bio: framework to make building tools like QIIME easier

Integration with the python scientific computing stack

  • scipy, numpy, IPython, matplotlib, pandas

Modern community standards

  • numpy API documentation standards
  • Full PEP8 compliance
  • 99% test coverage (via coverage.py)
  • Native py2/py3 compatibility
  • Hosted on GitHub
  • Continuous Integration testing with Travis
  • Peer-reviewed code via pull requests
  • BSD-licensed

Core objects and algorithms for bioinformatics

scikit-bio.org/docs/latest/

<img src="skbio-docs.png" style="width: 600px;"/ align="left">

scikit-bio: education-ready and production-ready toolkit

An Introduction To Applied Bioinformatics

applied-bioinformatics.org

Bioinformatics education in the context of production-ready implementations

Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet). Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.

I teach bioinformatics at the undergraduate and graduate levels at Northern Arizona University. This repository contains some of the materials that I've developed in these courses, and represents an initial attempt to organize these materials in a standalone way. If you'd like to read a little more about the project, see my blog post on microbe.net.

Disclaimer

This project is in very early development stage. It's not ready for prime-time by any means, but I fall firmly into the "publish early, publish often" mindset, hence its public availability. I am very interested in feedback in the form of email (gregcaporaso@gmail.com) or pull requests.

The code in the iab module is not sufficiently tested, documented, or optimized for production use. As code reaches those quality standards it will be ported to scikit-bio. I do not recommend using the code in the iab module outside of these notebooks. In other words, don't import iab outside of the notebooks - if you want access to the functionality in your own code, you should import skbio.

Currently, the best example of where I'm hoping to go with these materials is the multiple sequence alignment chapter.

An Introduction To Applied Bioinformatics

applied-bioinformatics.org

Detailed API documentation, so accesible to new users...


In [2]:
from skbio.core.alignment.pairwise import local_pairwise_align_nucleotide

local_pairwise_align_nucleotide?


Type:        function
String form: <function local_pairwise_align_nucleotide at 0x1045585f0>
File:        /Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py
Definition:  local_pairwise_align_nucleotide(seq1, seq2, gap_open_penalty=5, gap_extend_penalty=2, match_score=2, mismatch_score=-3, substitution_matrix=None)
Docstring:
Locally align exactly two nucleotide seqs with Smith-Waterman

Parameters
----------
seq1 : str or BiologicalSequence
    The first unaligned sequence.
seq2 : str or BiologicalSequence
    The second unaligned sequence.
gap_open_penalty : int or float, optional
    Penalty for opening a gap (this is substracted from previous best
    alignment score, so is typically positive).
gap_extend_penalty : int or float, optional
    Penalty for extending a gap (this is substracted from previous best
    alignment score, so is typically positive).
match_score : int or float, optional
    The score to add for a match between a pair of bases (this is added
    to the previous best alignment score, so is typically positive).
mismatch_score : int or float, optional
    The score to add for a mismatch between a pair of bases (this is
    added to the previous best alignment score, so is typically
    negative).
substitution_matrix: 2D dict (or similar)
    Lookup for substitution scores (these values are added to the
    previous best alignment score). If provided, this overrides
    ``match_score`` and ``mismatch_score``.

Returns
-------
skbio.Alignment
    ``Alignment`` object containing the aligned sequences as well as
    details about the alignment.

See Also
--------
local_pairwise_align
local_pairwise_align_protein
skbio.core.alignment.local_pairwise_align_ssw
global_pairwise_align
global_pairwise_align_protein
global_pairwise_align_nucelotide

Notes
-----
Default ``match_score``, ``mismatch_score``, ``gap_open_penalty`` and
``gap_extend_penalty`` parameters are derived from the NCBI BLAST
Server [1]_.

References
----------
.. [1] http://blast.ncbi.nlm.nih.gov/Blast.cgi

... and that documentation is also of course available on the web

We can easily apply these python aligners


In [3]:
from skbio.core.alignment.pairwise import local_pairwise_align_nucleotide

s1 = "ACTAAGGCTCTCTACCCCTCTCAGAGA"
s2 = "AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA"
r = local_pairwise_align_nucleotide(s1, s2)
print(type(r))
print(r)


<class 'skbio.core.alignment.alignment.Alignment'>
>0
ACTAAGGCTCTCTACCCCTC-TCAGAGA
>1
ACTAAGGCTCTCTACCCCTCTTCAGAGA

/Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py:300: EfficiencyWarning: You're using skbio's python implementation of Smith-Waterman alignment. This will be very slow (e.g., thousands of times slower) than skbio.core.alignment.local_pairwise_align_ssw.
  EfficiencyWarning)

In [4]:
from skbio import DNA

s1 = DNA("ACTAAGGCTCTCTACCCCTCTCAGAGA", "query")
s2 = DNA("AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA", "target")
r = local_pairwise_align_nucleotide(s1, s2)
print(type(r))
print(r)


<class 'skbio.core.alignment.alignment.Alignment'>
>query
ACTAAGGCTCTCTACCCCTC-TCAGAGA
>target
ACTAAGGCTCTCTACCCCTCTTCAGAGA

/Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py:300: EfficiencyWarning: You're using skbio's python implementation of Smith-Waterman alignment. This will be very slow (e.g., thousands of times slower) than skbio.core.alignment.local_pairwise_align_ssw.
  EfficiencyWarning)

But, can't use python for implementing alignment for production code, so provide Cython wrappers with matching interfaces


In [5]:
from skbio.core.alignment import local_pairwise_align_ssw
local_pairwise_align_ssw?


Type:        builtin_function_or_method
String form: <built-in function local_pairwise_align_ssw>
Docstring:
Align query and target sequences with Striped Smith-Waterman.

Parameters
----------
sequence1 : str or BiologicalSequence
    The first unaligned sequence
sequence2 : str or BiologicalSequence
    The second unaligned sequence

Returns
-------
``skbio.core.alignment.Alignment``
    The resulting alignment as an Alignment object

Notes
-----
For a complete list of optional keyword-arguments that can be provided,
see ``skbio.core.alignment.StripedSmithWaterman``.

The following kwargs will not have any effect: `suppress_sequences` and
`zero_index`

If an alignment does not meet a provided filter, `None` will be returned.

See Also
--------
skbio.core.alignment.StripedSmithWaterman

In [6]:
r = local_pairwise_align_ssw(s1, s2)
print(type(r))
print(r)


<class 'skbio.core.alignment.alignment.Alignment'>
>query
ACTAAGGCTCTCTACCCCTC-TCAGAGA
>target
ACTAAGGCTCTCTACCCCTCTTCAGAGA

As expected, the C/Cython code is much faster...


In [7]:
from skbio.parse.sequences import parse_fasta
from skbio import SequenceCollection
from random import choice
gg_path = "/Users/caporaso/data/gg_13_8_otus/rep_set/73_otus.fasta"

s = SequenceCollection.from_fasta_records([(i, s) for i, s in parse_fasta(gg_path) if set(s) == set('ACGT')], DNA)

In [8]:
%timeit local_pairwise_align_ssw(choice(s), choice(s), gap_open_penalty=5,\
                                 gap_extend_penalty=2, match_score=2, mismatch_score=-3)


100 loops, best of 3: 4.25 ms per loop

In [9]:
warnings.simplefilter('ignore')
%timeit local_pairwise_align_nucleotide(choice(s), choice(s), gap_open_penalty=5,\
                                        gap_extend_penalty=2, match_score=2, mismatch_score=-3)


1 loops, best of 3: 18.8 s per loop

scikit-bio: simpler bioinformatics pipeline development

We'll re-create QIIME's beta_diversity_through_plots.py workflow.

For conceptual discussion, see the Studying Biological Diversity chapter of An Introduction to Applied Bioinformatics.

Six samples of the human microbiome (two subjects and three body sites)


In [10]:
import pandas as pd
sample_md = {
   'A': {'body_site': 'gut', 'subject': '1'},
   'B': {'body_site': 'skin', 'subject': '1'},
   'C': {'body_site': 'tongue', 'subject': '1'},
   'D': {'body_site': 'gut', 'subject': '2'},
   'E': {'body_site': 'tongue', 'subject': '2'},
   'F': {'body_site': 'skin', 'subject': '2'}}
sample_md = pd.DataFrame.from_dict(sample_md, orient='index')
sample_md


Out[10]:
subject body_site
A 1 gut
B 1 skin
C 1 tongue
D 2 gut
E 2 tongue
F 2 skin

6 rows × 2 columns

Are samples derived from the same subject more similar than samples from different subjects?

(Here we have 6 samples and 7 taxa, but our current record is 15,000 samples and 5.6 million taxa.)


In [11]:
data = [[23, 64, 14, 0, 0, 3, 1],
        [0, 3, 35, 42, 0, 12, 1],
        [0, 5, 5, 0, 40, 40, 0],
        [44, 35, 9, 0, 1, 0, 0],
        [0, 2, 8, 0, 35, 45, 1],
        [0, 0, 25, 35, 0, 19, 0]]
table = pd.DataFrame(data,
                     columns=['Species 1', 'Species 2', 'Species 3', 'Species 4', 'Species 5', 'Species 6', 'Species 7'],
                     index=list('ABCDEF'))
table


Out[11]:
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7
A 23 64 14 0 0 3 1
B 0 3 35 42 0 12 1
C 0 5 5 0 40 40 0
D 44 35 9 0 1 0 0
E 0 2 8 0 35 45 1
F 0 0 25 35 0 19 0

6 rows × 7 columns

Using scikit-bio and scipy, we can then compute pairwise distances between the samples


In [12]:
from skbio.math.diversity.beta import pw_distances

bc_dm = pw_distances(table, table.index, "braycurtis")
print(bc_dm)


6x6 distance matrix
IDs:
A, B, C, D, E, F
Data:
[[ 0.          0.78787879  0.86666667  0.30927835  0.85714286  0.81521739]
 [ 0.78787879  0.          0.78142077  0.86813187  0.75        0.1627907 ]
 [ 0.86666667  0.78142077  0.          0.87709497  0.09392265  0.71597633]
 [ 0.30927835  0.86813187  0.87709497  0.          0.87777778  0.89285714]
 [ 0.85714286  0.75        0.09392265  0.87777778  0.          0.68235294]
 [ 0.81521739  0.1627907   0.71597633  0.89285714  0.68235294  0.        ]]

And create ordination plots using scikit-bio and matplotlib


In [13]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def scatter_3d(ord_results, df, column, color_map, title='', axis1=0,
               axis2=1, axis3=2):
   """Adapted from Matplotlib Gallery:
   http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
   """
   coord_matrix = ord_results.site.T
   ids = ord_results.site_ids
   colors = [color_map[df[column][id_]] for id_ in ord_results.site_ids]

   fig = plt.figure()
   ax = fig.add_subplot(111, projection='3d')

   xs = coord_matrix[axis1]
   ys = coord_matrix[axis2]
   zs = coord_matrix[axis3]
   plot = ax.scatter(xs, ys, zs, c=colors, s=150)

   ax.set_xlabel('PC %d' % (axis1 + 1))
   ax.set_ylabel('PC %d' % (axis2 + 1))
   ax.set_zlabel('PC %d' % (axis3 + 1))
   ax.set_xticklabels([])
   ax.set_yticklabels([])
   ax.set_zticklabels([])
   ax.set_title(title)
   return fig

In [14]:
from skbio.math.stats.ordination import PCoA
bc_pc = PCoA(bc_dm).scores()

In [15]:
# This function adapted from Matplotlib Gallery's:
# http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
fig = scatter_3d(bc_pc, sample_md, 'subject', {'1': 'yellow', '2': 'purple'},
                 'Samples colored by subject')



In [16]:
# This function adapted from Matplotlib Gallery's:
# http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
fig = scatter_3d(bc_pc, sample_md, 'body_site',
                 {'gut': 'b', 'skin': 'r', 'tongue': 'g'},
                 'Samples colored by body site')


And finally run stats to determine if clustering patterns are signficiant


In [17]:
from skbio.math.stats.distance import ANOSIM
anosim = ANOSIM(bc_dm, sample_md, column='subject')
results = anosim(999)
print(results.statistic)

print(results.p_value < 0.05)


-0.407407407407
False

In [18]:
anosim = ANOSIM(bc_dm, sample_md, column='body_site')
results = anosim(999)
print(results.statistic)

print(results.p_value < 0.1)


1.0
True

And these are the types of examples that we document:

Finally, for fun (and to show the generalizability of scikit-bio):

Can we use the same tools to model the "evolutionary" relationships between human languages?

Let's start with tuples of:

   (name of language, 
    spelling of phrase for ordering a beer,
    phonetic phrase for ordering a beer)

[Source]


In [19]:
languages = [("Afrikaans", "'n Bier, asseblief", "A beer ah-suh-bleef"),
 ("Basque", "Garagardo bat, mesedez", "Gara-gardo bat mese-des"),
 ("Breton", "Ur banne bier am bo, mar plij", "Oor bah-ne beer am boh mar pleezh"),
 ("Catalan", "Una cervesa, si us plau", "Oona servayzeh see oos plow"),
 ("Croatian", "Jedno pivo, molim", "Yed-no pee-vo, mo-lim"),
 ("Czech / Slovak", "Pivo, prosím", "Pee-vo, pro-seem"),
 ("Danish", "Jeg vil gerne have en øl", "Yay vil geh-neh heh en url"),
 ("Dutch", "Een bier, alsjeblieft", "Un beer, ahls-yer-bleeft"),
 ("English", "One beer, please", "Wun beer, pleez"),
 ("Esperanto", "Unu bieron, mi petas", "Oo-noo bee-airon, mee peh-tahs"),
 ("Estonian", "Üks õlu, palun", "Ooks ur-loo, pah-lun"),
 ("Finnish", "Olut mulle, kiitos", "O-loot moolek kee-tos"),
 ("French", "Une bière, s'il vous plaît", "Oon bee-air, seel voo pleh"),
 ("German", "Ein Bier, bitte", "Ine beer, bitt-uh"),
 ("Hungarian", "Egy pohár sört kérek", "Edj pohar shurt kayrek"),
 ("Icelandic", "Einn bjór, takk", "Ay-dn byohr tahk"),
 ("Irish", "Beoir amháin, le do thoil", "Byohr awoyn, lyeh doh hull"),
 ("Italian", "Una birra, per favore", "Oo-na beer-ra, pair fa-vo-re"),
 ("Latin", "Cervisiam, sodes", "Ker-wi-see-am, soh-dehs"),
 ("Latvian", "Vienu alu, lū-dzu", "Vyeh-noo ah-loo, loo dzoo"),
 ("Lithuanian", "Prašau viena alaus", "Pra-shau vie-na al-lows"),
 ("Maltese", "Wiehed birra, jekk jghogbok", "Wee-het bir-ra yek yoh-dzbok"),
 ("Norwegian", "En øl, takk", "Ehn url tahk"),
 ("Occitan", "Una cervesa, se vos plai", "Oo-no serbeh-zo se bus ply"),
 ("Polish", "Jedno piwo, proszę", "Yed-no peevo proshe"),
 ("Portuguese", "Uma cerveja, por favor", "Oo-ma ser-vay-ja, poor fa-vohr"),
 ("Romansch Ladina", "Üna biera, per plaschair.", "Oo-nuh bee-air-uh per plah-chair"),
 ("Sardinian", "Una birra, po piaghere", "Oo-na beer-ra po pia-gehre"),
 ("Scots Gaelic", "Leann, mas e do thoil e", "Lyawn mahs eh doh hawl eh"),
 ("Slovene", "Eno pivo, prosim", "Eno pee-vo pro-seem"),
 ("Spanish (Lat. Am.)", "Una cerveza, por favor", "Oo-na ser-veh-sa, por fa-vor"),
 ("Spanish (Spain)", "Una cerveza, por favor", "Oo-na thair-veh-tha, por fa-vor"),
 ("Strine", "Foster's, mate", "Faw-stuhz, mayt"),
 ("Swedish", "En öl, tack", "Ehn irl, tahk"),
 ("Twi", "Mame beer baako, mi pawokyew", "Mah-me bee-ye bah-ko mee pow-che-oo"),
 ("Turkish", "Bir bira, lütfen", "Beer beer-ah luht-fen"),
 ("Welsh", "Cwrw os gwelwch in dda", "Koo-roh ohs gwel-ookh-un-thah")]

We'll build a basic nucleotide-like substitution matrix


In [20]:
language_to_pron = {e[0]: e[2] for e in languages}

all_pron_chars = []
for e in language_to_pron.values():
    all_pron_chars.extend(e)
all_pron_chars = set(all_pron_chars)

pron_substitution_matrix = {}
for c in all_pron_chars:
    row = {}.fromkeys(all_pron_chars, -2.0)
    row[c] = 5.0
    pron_substitution_matrix[c] = row

We can then globally align the phrases and compute distances between them


In [21]:
from skbio.core.alignment.pairwise import global_pairwise_align

alignment = global_pairwise_align(language_to_pron["Swedish"],
                                  language_to_pron["Norwegian"], 
                                  gap_open_penalty=5, gap_extend_penalty=2,
                                  substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])


>0
Ehn irl, tahk
>1
Ehn url- tahk

Hamming distance: 0.154

In [22]:
alignment = global_pairwise_align(language_to_pron["Swedish"],
                                  language_to_pron["Icelandic"], 
                                  gap_open_penalty=5, gap_extend_penalty=2,
                                  substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])


>0
--Ehn ---irl, tahk
>1
Ay-dn byohr-- tahk

Hamming distance: 0.556

In [23]:
alignment = global_pairwise_align(language_to_pron["Spanish (Spain)"],
                                  language_to_pron["Italian"], 
                                  gap_open_penalty=5, gap_extend_penalty=2,
                                  substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])


>0
Oo-na thair-veh-tha, p-or fa-vo-r-
>1
Oo-na be-----er-r-a, pair fa-vo-re

Hamming distance: 0.353

We can go further, and compute all pairwise alignments and distances


In [24]:
warnings.simplefilter('ignore')
from skbio.core.alignment.pairwise import global_pairwise_align
from skbio import DistanceMatrix

languages = language_to_pron.keys()
distances = np.zeros((len(languages), len(languages)))
for i, language1 in enumerate(languages):
    language1_phrase = language_to_pron[language1]
    for j in range(i):
        language2 = languages[j]
        language2_phrase = language_to_pron[language2]
        alignment = global_pairwise_align(language1_phrase, language2_phrase,
                                          gap_open_penalty=5, gap_extend_penalty=2,
                                          substitution_matrix=pron_substitution_matrix)
        distances[i, j] = distances[j, i] = alignment.distances()[0,1]

dm = DistanceMatrix(distances, languages)
print(dm)


37x37 distance matrix
IDs:
Swedish, Icelandic, Estonian, Turkish, Twi, Sardinian, Romansch Ladina, Dutch, ...
Data:
[[ 0.          0.5         0.6        ...,  0.6         0.76923077
   0.76470588]
 [ 0.5         0.          0.8        ...,  0.66666667  0.76923077
   0.76470588]
 [ 0.6         0.8         0.         ...,  0.66666667  0.73076923
   0.72727273]
 ..., 
 [ 0.6         0.66666667  0.66666667 ...,  0.          0.59375     0.61111111]
 [ 0.76923077  0.76923077  0.73076923 ...,  0.59375     0.          0.65714286]
 [ 0.76470588  0.76470588  0.72727273 ...,  0.61111111  0.65714286  0.        ]]

And build a tree to visualize relationships


In [25]:
from scipy.cluster.hierarchy import average, dendrogram, to_tree

lm = average(dm.condensed_form())

def format_dendrogram(tip_count):
    import matplotlib.pylab as plt
    ax = plt.gca()
    fig = plt.gcf()
    height = tip_count * 0.4
    if height < 3:
        height = 3
    fig.set_size_inches(7, height)
    font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : 18}

    matplotlib.rc('font', **font)
    return ax

format_dendrogram(dm.shape[0])
d = dendrogram(lm, labels=dm.ids, orientation='right',
               link_color_func=lambda x: 'black')


Currently in pre-alpha release stage, and working on defining our scope.

Contributors:

Adam Robbins-Pianka (@adamrp) | Antonio Gonzalez (@antgonza) | Daniel McDonald (@wasade) | Evan Bolyen (@ebolyen) | Greg Caporaso (@gregcaporaso) | Jai Ram Rideout (@ElBrogrammer) | Jens Reeder (@jensreeder) | Jorge Cañardo Alastuey (@Jorge-C) | Jose Antonio Navas Molina (@josenavas) | Joshua Shorenstein (@squirrelo) | Yoshiki Vázquez Baeza (@ElDeveloper) | @charudatta-navare | John Chase (@johnchase) | Karen Schwarzberg (@karenschwarzberg) | Emily TerAvest (@teravest) | Will Van Treuren (@wdwvt1) | Zech Xu (@RNAer)

Rob Knight (@rob-knight) | Gavin Huttley (@gavin-huttley) | Micah Hamady | Sandra Smit | Cathy Lozupone (@clozupone) | Mike Robeson (@mikerobeson) | Marcin Cieslik | Peter Maxwell | Jeremy Widmann | Zongzhi Liu | Michael Dwan | Logan Knecht (@loganknecht) | Andrew Cochran | Jose Carlos Clemente (@cleme) | Damien Coy | Levi McCracken | Andrew Butterfield | Justin Kuczynski (@justin212k) | Matthew Wakefield (@genomematt)


In [26]:
import skbio
print(skbio.title)
print(skbio.art)


               _ _    _ _          _     _
              (_) |  (_) |        | |   (_)
      ___  ___ _| | ___| |_ ______| |__  _  ___
     / __|/ __| | |/ / | __|______| '_ \| |/ _ \
     \__ \ (__| |   <| | |_       | |_) | | (_) |
     |___/\___|_|_|\_\_|\__|      |_.__/|_|\___/




           Opisthokonta
                   \  Amoebozoa
                    \ /
                     *    Euryarchaeota
                      \     |_ Crenarchaeota
                       \   *
                        \ /
                         *
                        /
                       /
                      /
                     *
                    / \
                   /   \
        Proteobacteria  \
                       Cyanobacteria