scikit-bio: core bioinformatics data structures and algorithms in Python

J Gregory Caporaso

GitHub/Twitter: @gregcaporaso



In [1]:

    
from __future__ import division, print_function
from IPython.core import page
page.page = print

import warnings
warnings.simplefilter('always')

The pre-history of scikit-bio...

Quantitative Insights Into Microbial Ecology

Cited 1596 times since publication in 2010

scikit-bio: framework to make building tools like QIIME easier

Integration with the python scientific computing stack

scipy, numpy, IPython, matplotlib, pandas

Modern community standards

numpy API documentation standards
Full PEP8 compliance
99% test coverage (via coverage.py)
Native py2/py3 compatibility
Hosted on GitHub
Continuous Integration testing with Travis
Peer-reviewed code via pull requests
BSD-licensed

Core objects and algorithms for bioinformatics

scikit-bio.org/docs/latest/

scikit-bio: education-ready and production-ready toolkit

An Introduction To Applied Bioinformatics

Bioinformatics education in the context of production-ready implementations

Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet). Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.

I teach bioinformatics at the undergraduate and graduate levels at Northern Arizona University. This repository contains some of the materials that I've developed in these courses, and represents an initial attempt to organize these materials in a standalone way. If you'd like to read a little more about the project, see my blog post on microbe.net.

Disclaimer

This project is in very early development stage. It's not ready for prime-time by any means, but I fall firmly into the "publish early, publish often" mindset, hence its public availability. I am very interested in feedback in the form of email (gregcaporaso@gmail.com) or pull requests.

The code in the iab module is not sufficiently tested, documented, or optimized for production use. As code reaches those quality standards it will be ported to scikit-bio. I do not recommend using the code in the iab module outside of these notebooks. In other words, don't import iab outside of the notebooks - if you want access to the functionality in your own code, you should import skbio.

Currently, the best example of where I'm hoping to go with these materials is the multiple sequence alignment chapter.

An Introduction To Applied Bioinformatics

applied-bioinformatics.org

Detailed API documentation, so accesible to new users...



In [2]:

    
from skbio.core.alignment.pairwise import local_pairwise_align_nucleotide

local_pairwise_align_nucleotide?









    



Type:        function
String form: <function local_pairwise_align_nucleotide at 0x1045585f0>
File:        /Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py
Definition:  local_pairwise_align_nucleotide(seq1, seq2, gap_open_penalty=5, gap_extend_penalty=2, match_score=2, mismatch_score=-3, substitution_matrix=None)
Docstring:
Locally align exactly two nucleotide seqs with Smith-Waterman

Parameters
----------
seq1 : str or BiologicalSequence
    The first unaligned sequence.
seq2 : str or BiologicalSequence
    The second unaligned sequence.
gap_open_penalty : int or float, optional
    Penalty for opening a gap (this is substracted from previous best
    alignment score, so is typically positive).
gap_extend_penalty : int or float, optional
    Penalty for extending a gap (this is substracted from previous best
    alignment score, so is typically positive).
match_score : int or float, optional
    The score to add for a match between a pair of bases (this is added
    to the previous best alignment score, so is typically positive).
mismatch_score : int or float, optional
    The score to add for a mismatch between a pair of bases (this is
    added to the previous best alignment score, so is typically
    negative).
substitution_matrix: 2D dict (or similar)
    Lookup for substitution scores (these values are added to the
    previous best alignment score). If provided, this overrides
    ``match_score`` and ``mismatch_score``.

Returns
-------
skbio.Alignment
    ``Alignment`` object containing the aligned sequences as well as
    details about the alignment.

See Also
--------
local_pairwise_align
local_pairwise_align_protein
skbio.core.alignment.local_pairwise_align_ssw
global_pairwise_align
global_pairwise_align_protein
global_pairwise_align_nucelotide

Notes
-----
Default ``match_score``, ``mismatch_score``, ``gap_open_penalty`` and
``gap_extend_penalty`` parameters are derived from the NCBI BLAST
Server [1]_.

References
----------
.. [1] http://blast.ncbi.nlm.nih.gov/Blast.cgi

... and that documentation is also of course available on the web

We can easily apply these python aligners



In [3]:

    
from skbio.core.alignment.pairwise import local_pairwise_align_nucleotide

s1 = "ACTAAGGCTCTCTACCCCTCTCAGAGA"
s2 = "AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA"
r = local_pairwise_align_nucleotide(s1, s2)
print(type(r))
print(r)









    



<class 'skbio.core.alignment.alignment.Alignment'>
>0
ACTAAGGCTCTCTACCCCTC-TCAGAGA
>1
ACTAAGGCTCTCTACCCCTCTTCAGAGA







    



/Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py:300: EfficiencyWarning: You're using skbio's python implementation of Smith-Waterman alignment. This will be very slow (e.g., thousands of times slower) than skbio.core.alignment.local_pairwise_align_ssw.
  EfficiencyWarning)



In [4]:

    
from skbio import DNA

s1 = DNA("ACTAAGGCTCTCTACCCCTCTCAGAGA", "query")
s2 = DNA("AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA", "target")
r = local_pairwise_align_nucleotide(s1, s2)
print(type(r))
print(r)









    



<class 'skbio.core.alignment.alignment.Alignment'>
>query
ACTAAGGCTCTCTACCCCTC-TCAGAGA
>target
ACTAAGGCTCTCTACCCCTCTTCAGAGA







    



/Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py:300: EfficiencyWarning: You're using skbio's python implementation of Smith-Waterman alignment. This will be very slow (e.g., thousands of times slower) than skbio.core.alignment.local_pairwise_align_ssw.
  EfficiencyWarning)

But, can't use python for implementing alignment for production code, so provide Cython wrappers with matching interfaces



In [5]:

    
from skbio.core.alignment import local_pairwise_align_ssw
local_pairwise_align_ssw?









    



Type:        builtin_function_or_method
String form: <built-in function local_pairwise_align_ssw>
Docstring:
Align query and target sequences with Striped Smith-Waterman.

Parameters
----------
sequence1 : str or BiologicalSequence
    The first unaligned sequence
sequence2 : str or BiologicalSequence
    The second unaligned sequence

Returns
-------
``skbio.core.alignment.Alignment``
    The resulting alignment as an Alignment object

Notes
-----
For a complete list of optional keyword-arguments that can be provided,
see ``skbio.core.alignment.StripedSmithWaterman``.

The following kwargs will not have any effect: `suppress_sequences` and
`zero_index`

If an alignment does not meet a provided filter, `None` will be returned.

See Also
--------
skbio.core.alignment.StripedSmithWaterman



In [6]:

    
r = local_pairwise_align_ssw(s1, s2)
print(type(r))
print(r)









    



<class 'skbio.core.alignment.alignment.Alignment'>
>query
ACTAAGGCTCTCTACCCCTC-TCAGAGA
>target
ACTAAGGCTCTCTACCCCTCTTCAGAGA

As expected, the C/Cython code is much faster...



In [7]:

    
from skbio.parse.sequences import parse_fasta
from skbio import SequenceCollection
from random import choice
gg_path = "/Users/caporaso/data/gg_13_8_otus/rep_set/73_otus.fasta"

s = SequenceCollection.from_fasta_records([(i, s) for i, s in parse_fasta(gg_path) if set(s) == set('ACGT')], DNA)



In [8]:

    
%timeit local_pairwise_align_ssw(choice(s), choice(s), gap_open_penalty=5,\
                                 gap_extend_penalty=2, match_score=2, mismatch_score=-3)









    



100 loops, best of 3: 4.25 ms per loop



In [9]:

    
warnings.simplefilter('ignore')
%timeit local_pairwise_align_nucleotide(choice(s), choice(s), gap_open_penalty=5,\
                                        gap_extend_penalty=2, match_score=2, mismatch_score=-3)









    



1 loops, best of 3: 18.8 s per loop

scikit-bio: simpler bioinformatics pipeline development

We'll re-create QIIME's beta_diversity_through_plots.py workflow.

For conceptual discussion, see the Studying Biological Diversity chapter of An Introduction to Applied Bioinformatics.

Six samples of the human microbiome (two subjects and three body sites)



In [10]:

    
import pandas as pd
sample_md = {
   'A': {'body_site': 'gut', 'subject': '1'},
   'B': {'body_site': 'skin', 'subject': '1'},
   'C': {'body_site': 'tongue', 'subject': '1'},
   'D': {'body_site': 'gut', 'subject': '2'},
   'E': {'body_site': 'tongue', 'subject': '2'},
   'F': {'body_site': 'skin', 'subject': '2'}}
sample_md = pd.DataFrame.from_dict(sample_md, orient='index')
sample_md









    Out[10]:






  
    
      
      subject
      body_site
    
  
  
    
      A
       1
          gut
    
    
      B
       1
         skin
    
    
      C
       1
       tongue
    
    
      D
       2
          gut
    
    
      E
       2
       tongue
    
    
      F
       2
         skin
    
  

6 rows × 2 columns

Are samples derived from the same subject more similar than samples from different subjects?

(Here we have 6 samples and 7 taxa, but our current record is 15,000 samples and 5.6 million taxa.)



In [11]:

    
data = [[23, 64, 14, 0, 0, 3, 1],
        [0, 3, 35, 42, 0, 12, 1],
        [0, 5, 5, 0, 40, 40, 0],
        [44, 35, 9, 0, 1, 0, 0],
        [0, 2, 8, 0, 35, 45, 1],
        [0, 0, 25, 35, 0, 19, 0]]
table = pd.DataFrame(data,
                     columns=['Species 1', 'Species 2', 'Species 3', 'Species 4', 'Species 5', 'Species 6', 'Species 7'],
                     index=list('ABCDEF'))
table









    Out[11]:






  
    
      
      Species 1
      Species 2
      Species 3
      Species 4
      Species 5
      Species 6
      Species 7
    
  
  
    
      A
       23
       64
       14
        0
        0
        3
       1
    
    
      B
        0
        3
       35
       42
        0
       12
       1
    
    
      C
        0
        5
        5
        0
       40
       40
       0
    
    
      D
       44
       35
        9
        0
        1
        0
       0
    
    
      E
        0
        2
        8
        0
       35
       45
       1
    
    
      F
        0
        0
       25
       35
        0
       19
       0
    
  

6 rows × 7 columns

Using scikit-bio and scipy, we can then compute pairwise distances between the samples



In [12]:

    
from skbio.math.diversity.beta import pw_distances

bc_dm = pw_distances(table, table.index, "braycurtis")
print(bc_dm)









    



6x6 distance matrix
IDs:
A, B, C, D, E, F
Data:
[[ 0.          0.78787879  0.86666667  0.30927835  0.85714286  0.81521739]
 [ 0.78787879  0.          0.78142077  0.86813187  0.75        0.1627907 ]
 [ 0.86666667  0.78142077  0.          0.87709497  0.09392265  0.71597633]
 [ 0.30927835  0.86813187  0.87709497  0.          0.87777778  0.89285714]
 [ 0.85714286  0.75        0.09392265  0.87777778  0.          0.68235294]
 [ 0.81521739  0.1627907   0.71597633  0.89285714  0.68235294  0.        ]]

And create ordination plots using scikit-bio and matplotlib



In [13]:

    
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def scatter_3d(ord_results, df, column, color_map, title='', axis1=0,
               axis2=1, axis3=2):
   """Adapted from Matplotlib Gallery:
   http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
   """
   coord_matrix = ord_results.site.T
   ids = ord_results.site_ids
   colors = [color_map[df[column][id_]] for id_ in ord_results.site_ids]

   fig = plt.figure()
   ax = fig.add_subplot(111, projection='3d')

   xs = coord_matrix[axis1]
   ys = coord_matrix[axis2]
   zs = coord_matrix[axis3]
   plot = ax.scatter(xs, ys, zs, c=colors, s=150)

   ax.set_xlabel('PC %d' % (axis1 + 1))
   ax.set_ylabel('PC %d' % (axis2 + 1))
   ax.set_zlabel('PC %d' % (axis3 + 1))
   ax.set_xticklabels([])
   ax.set_yticklabels([])
   ax.set_zticklabels([])
   ax.set_title(title)
   return fig



In [14]:

    
from skbio.math.stats.ordination import PCoA
bc_pc = PCoA(bc_dm).scores()



In [15]:

    
# This function adapted from Matplotlib Gallery's:
# http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
fig = scatter_3d(bc_pc, sample_md, 'subject', {'1': 'yellow', '2': 'purple'},
                 'Samples colored by subject')



In [16]:

    
# This function adapted from Matplotlib Gallery's:
# http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
fig = scatter_3d(bc_pc, sample_md, 'body_site',
                 {'gut': 'b', 'skin': 'r', 'tongue': 'g'},
                 'Samples colored by body site')

And finally run stats to determine if clustering patterns are signficiant



In [17]:

    
from skbio.math.stats.distance import ANOSIM
anosim = ANOSIM(bc_dm, sample_md, column='subject')
results = anosim(999)
print(results.statistic)

print(results.p_value < 0.05)









    



-0.407407407407
False



In [18]:

    
anosim = ANOSIM(bc_dm, sample_md, column='body_site')
results = anosim(999)
print(results.statistic)

print(results.p_value < 0.1)









    



1.0
True

And these are the types of examples that we document:

Finally, for fun (and to show the generalizability of scikit-bio):

Can we use the same tools to model the "evolutionary" relationships between human languages?

Let's start with tuples of:

   (name of language, 
    spelling of phrase for ordering a beer,
    phonetic phrase for ordering a beer)

[Source]



In [19]:

    
languages = [("Afrikaans", "'n Bier, asseblief", "A beer ah-suh-bleef"),
 ("Basque", "Garagardo bat, mesedez", "Gara-gardo bat mese-des"),
 ("Breton", "Ur banne bier am bo, mar plij", "Oor bah-ne beer am boh mar pleezh"),
 ("Catalan", "Una cervesa, si us plau", "Oona servayzeh see oos plow"),
 ("Croatian", "Jedno pivo, molim", "Yed-no pee-vo, mo-lim"),
 ("Czech / Slovak", "Pivo, prosím", "Pee-vo, pro-seem"),
 ("Danish", "Jeg vil gerne have en øl", "Yay vil geh-neh heh en url"),
 ("Dutch", "Een bier, alsjeblieft", "Un beer, ahls-yer-bleeft"),
 ("English", "One beer, please", "Wun beer, pleez"),
 ("Esperanto", "Unu bieron, mi petas", "Oo-noo bee-airon, mee peh-tahs"),
 ("Estonian", "Üks õlu, palun", "Ooks ur-loo, pah-lun"),
 ("Finnish", "Olut mulle, kiitos", "O-loot moolek kee-tos"),
 ("French", "Une bière, s'il vous plaît", "Oon bee-air, seel voo pleh"),
 ("German", "Ein Bier, bitte", "Ine beer, bitt-uh"),
 ("Hungarian", "Egy pohár sört kérek", "Edj pohar shurt kayrek"),
 ("Icelandic", "Einn bjór, takk", "Ay-dn byohr tahk"),
 ("Irish", "Beoir amháin, le do thoil", "Byohr awoyn, lyeh doh hull"),
 ("Italian", "Una birra, per favore", "Oo-na beer-ra, pair fa-vo-re"),
 ("Latin", "Cervisiam, sodes", "Ker-wi-see-am, soh-dehs"),
 ("Latvian", "Vienu alu, lū-dzu", "Vyeh-noo ah-loo, loo dzoo"),
 ("Lithuanian", "Prašau viena alaus", "Pra-shau vie-na al-lows"),
 ("Maltese", "Wiehed birra, jekk jghogbok", "Wee-het bir-ra yek yoh-dzbok"),
 ("Norwegian", "En øl, takk", "Ehn url tahk"),
 ("Occitan", "Una cervesa, se vos plai", "Oo-no serbeh-zo se bus ply"),
 ("Polish", "Jedno piwo, proszę", "Yed-no peevo proshe"),
 ("Portuguese", "Uma cerveja, por favor", "Oo-ma ser-vay-ja, poor fa-vohr"),
 ("Romansch Ladina", "Üna biera, per plaschair.", "Oo-nuh bee-air-uh per plah-chair"),
 ("Sardinian", "Una birra, po piaghere", "Oo-na beer-ra po pia-gehre"),
 ("Scots Gaelic", "Leann, mas e do thoil e", "Lyawn mahs eh doh hawl eh"),
 ("Slovene", "Eno pivo, prosim", "Eno pee-vo pro-seem"),
 ("Spanish (Lat. Am.)", "Una cerveza, por favor", "Oo-na ser-veh-sa, por fa-vor"),
 ("Spanish (Spain)", "Una cerveza, por favor", "Oo-na thair-veh-tha, por fa-vor"),
 ("Strine", "Foster's, mate", "Faw-stuhz, mayt"),
 ("Swedish", "En öl, tack", "Ehn irl, tahk"),
 ("Twi", "Mame beer baako, mi pawokyew", "Mah-me bee-ye bah-ko mee pow-che-oo"),
 ("Turkish", "Bir bira, lütfen", "Beer beer-ah luht-fen"),
 ("Welsh", "Cwrw os gwelwch in dda", "Koo-roh ohs gwel-ookh-un-thah")]

We'll build a basic nucleotide-like substitution matrix



In [20]:

    
language_to_pron = {e[0]: e[2] for e in languages}

all_pron_chars = []
for e in language_to_pron.values():
    all_pron_chars.extend(e)
all_pron_chars = set(all_pron_chars)

pron_substitution_matrix = {}
for c in all_pron_chars:
    row = {}.fromkeys(all_pron_chars, -2.0)
    row[c] = 5.0
    pron_substitution_matrix[c] = row

We can then globally align the phrases and compute distances between them



In [21]:

    
from skbio.core.alignment.pairwise import global_pairwise_align

alignment = global_pairwise_align(language_to_pron["Swedish"],
                                  language_to_pron["Norwegian"], 
                                  gap_open_penalty=5, gap_extend_penalty=2,
                                  substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])









    



>0
Ehn irl, tahk
>1
Ehn url- tahk

Hamming distance: 0.154



In [22]:

    
alignment = global_pairwise_align(language_to_pron["Swedish"],
                                  language_to_pron["Icelandic"], 
                                  gap_open_penalty=5, gap_extend_penalty=2,
                                  substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])









    



>0
--Ehn ---irl, tahk
>1
Ay-dn byohr-- tahk

Hamming distance: 0.556



In [23]:

    
alignment = global_pairwise_align(language_to_pron["Spanish (Spain)"],
                                  language_to_pron["Italian"], 
                                  gap_open_penalty=5, gap_extend_penalty=2,
                                  substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])









    



>0
Oo-na thair-veh-tha, p-or fa-vo-r-
>1
Oo-na be-----er-r-a, pair fa-vo-re

Hamming distance: 0.353

We can go further, and compute all pairwise alignments and distances



In [24]:

    
warnings.simplefilter('ignore')
from skbio.core.alignment.pairwise import global_pairwise_align
from skbio import DistanceMatrix

languages = language_to_pron.keys()
distances = np.zeros((len(languages), len(languages)))
for i, language1 in enumerate(languages):
    language1_phrase = language_to_pron[language1]
    for j in range(i):
        language2 = languages[j]
        language2_phrase = language_to_pron[language2]
        alignment = global_pairwise_align(language1_phrase, language2_phrase,
                                          gap_open_penalty=5, gap_extend_penalty=2,
                                          substitution_matrix=pron_substitution_matrix)
        distances[i, j] = distances[j, i] = alignment.distances()[0,1]

dm = DistanceMatrix(distances, languages)
print(dm)









    



37x37 distance matrix
IDs:
Swedish, Icelandic, Estonian, Turkish, Twi, Sardinian, Romansch Ladina, Dutch, ...
Data:
[[ 0.          0.5         0.6        ...,  0.6         0.76923077
   0.76470588]
 [ 0.5         0.          0.8        ...,  0.66666667  0.76923077
   0.76470588]
 [ 0.6         0.8         0.         ...,  0.66666667  0.73076923
   0.72727273]
 ..., 
 [ 0.6         0.66666667  0.66666667 ...,  0.          0.59375     0.61111111]
 [ 0.76923077  0.76923077  0.73076923 ...,  0.59375     0.          0.65714286]
 [ 0.76470588  0.76470588  0.72727273 ...,  0.61111111  0.65714286  0.        ]]

And build a tree to visualize relationships



In [25]:

    
from scipy.cluster.hierarchy import average, dendrogram, to_tree

lm = average(dm.condensed_form())

def format_dendrogram(tip_count):
    import matplotlib.pylab as plt
    ax = plt.gca()
    fig = plt.gcf()
    height = tip_count * 0.4
    if height < 3:
        height = 3
    fig.set_size_inches(7, height)
    font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : 18}

    matplotlib.rc('font', **font)
    return ax

format_dendrogram(dm.shape[0])
d = dendrogram(lm, labels=dm.ids, orientation='right',
               link_color_func=lambda x: 'black')

Currently in pre-alpha release stage, and working on defining our scope.

Contributors:



In [26]:

    
import skbio
print(skbio.title)
print(skbio.art)









    



               _ _    _ _          _     _
              (_) |  (_) |        | |   (_)
      ___  ___ _| | ___| |_ ______| |__  _  ___
     / __|/ __| | |/ / | __|______| '_ \| |/ _ \
     \__ \ (__| |   <| | |_       | |_) | | (_) |
     |___/\___|_|_|\_\_|\__|      |_.__/|_|\___/




           Opisthokonta
                   \  Amoebozoa
                    \ /
                     *    Euryarchaeota
                      \     |_ Crenarchaeota
                       \   *
                        \ /
                         *
                        /
                       /
                      /
                     *
                    / \
                   /   \
        Proteobacteria  \
                       Cyanobacteria

	Species 1	Species 2	Species 3	Species 4	Species 5	Species 6	Species 7
A	23	64	14	0	0	3	1
B	0	3	35	42	0	12	1
C	0	5	5	0	40	40	0
D	44	35	9	0	1	0	0
E	0	2	8	0	35	45	1
F	0	0	25	35	0	19	0

	Species 1	Species 2	Species 3	Species 4	Species 5	Species 6	Species 7
A	23	64	14	0	0	3	1
B	0	3	35	42	0	12	1
C	0	5	5	0	40	40	0
D	44	35	9	0	1	0	0
E	0	2	8	0	35	45	1
F	0	0	25	35	0	19	0

scikit-bio: core bioinformatics data structures and algorithms in Python

The pre-history of scikit-bio...

github.com/biocore

Quantitative Insights Into Microbial Ecology

Cited 1596 times since publication in 2010

scikit-bio: framework to make building tools like QIIME easier

Integration with the python scientific computing stack

Modern community standards

Core objects and algorithms for bioinformatics

scikit-bio.org/docs/latest/

scikit-bio: education-ready and production-ready toolkit

An Introduction To Applied Bioinformatics

applied-bioinformatics.org

Bioinformatics education in the context of production-ready implementations

Disclaimer

An Introduction To Applied Bioinformatics

applied-bioinformatics.org

Detailed API documentation, so accesible to new users...

... and that documentation is also of course available on the web

We can easily apply these python aligners

But, can't use python for implementing alignment for production code, so provide Cython wrappers with matching interfaces

As expected, the C/Cython code is much faster...

scikit-bio: simpler bioinformatics pipeline development

We'll re-create QIIME's beta_diversity_through_plots.py workflow.

For conceptual discussion, see the Studying Biological Diversity chapter of An Introduction to Applied Bioinformatics.

Six samples of the human microbiome (two subjects and three body sites)

Are samples derived from the same subject more similar than samples from different subjects?

(Here we have 6 samples and 7 taxa, but our current record is 15,000 samples and 5.6 million taxa.)

Using scikit-bio and scipy, we can then compute pairwise distances between the samples

And create ordination plots using scikit-bio and matplotlib

And finally run stats to determine if clustering patterns are signficiant

And these are the types of examples that we document:

Finally, for fun (and to show the generalizability of scikit-bio):

Can we use the same tools to model the "evolutionary" relationships between human languages?

Let's start with tuples of:

We'll build a basic nucleotide-like substitution matrix

We can then globally align the phrases and compute distances between them

We can go further, and compute all pairwise alignments and distances

And build a tree to visualize relationships

Currently in pre-alpha release stage, and working on defining our scope.

Contributors:

	Species 1	Species 2	Species 3	Species 4	Species 5	Species 6	Species 7
A	23	64	14	0	0	3	1
B	0	3	35	42	0	12	1
C	0	5	5	0	40	40	0
D	44	35	9	0	1	0	0
E	0	2	8	0	35	45	1
F	0	0	25	35	0	19	0