In [1]:
# I keep this as a cell in my title slide so I can rerun 
# it easily if I make changes, but it's low enough it won't
# be visible in presentation mode.
%run talktools


scikit-bio:

interactive bioinformatics in Python

[scikit-bio.org](http://scikit-bio.org)

Jai Ram Rideout

[@ElBrogrammer](https://github.com/ElBrogrammer)

[Caporaso Lab](http://caporasolab.us)

What's scikit-bio?

An open source Python bioinformatics library designed to be:

  • collaborative
  • interactive
  • well-documented
  • an educational resource
  • performant
  • well-tested

It's under active development and is pre-alpha: the API may change!

What's with the name?

scikit-bio:

It's a scikit: a toolkit built using SciPy that provides functionality used in bioinformatics.

It's the first scikit focused on bioinformatics.

Sometimes abbreviated skbio.

scikits


In [2]:
website('scikits.appspot.com/scikits', 'List of scikits')


Out[2]:

Who should use it?

  • Researchers
    • Directly use skbio to analyze data, test hypotheses, and reach conclusions
    • Example: this presentation :)
  • Software developers
    • Use skbio to build larger systems that answer biological questions
    • Example: QIIME, EMPeror
  • Students
    • Use skbio as tool for learning bioinformatics
    • Example: Introduction to Applied Bioinformatics (IAB)

Interactive computing: flexible API


In [3]:
from skbio.core.distance import DistanceMatrix

# Load using file path
dm1 = DistanceMatrix.from_file('smalldm.txt')

# Load using file object
with open('smalldm.txt', 'U') as dm_f:
    dm2 = DistanceMatrix.from_file(dm_f)

# They should be equal
dm1 == dm2


Out[3]:
True

Interactive computing: ASCII art!


In [1]:
from skbio.core.tree import TreeNode
tree = TreeNode.from_newick('((A, B)C, D)root;')
print tree.ascii_art()


                    /-A
          /C-------|
-root----|          \-B
         |
          \-D

Documentation

Docstrings

  • numpydoc standard
  • Readable by humans and machines
  • Readable from within:
    • the code
    • an interactive session (Python/IPython/IPython Notebook)
    • website (HTML)
    • PDF

Getting help


In [6]:
help(DistanceMatrix)


Help on class DistanceMatrix in module skbio.core.distance:

class DistanceMatrix(DissimilarityMatrix)
 |  Store distances between objects.
 |  
 |  A `DistanceMatrix` is a `DissimilarityMatrix` with the additional
 |  requirement that the matrix data is symmetric. There are additional methods
 |  made available that take advantage of this symmetry.
 |  
 |  See Also
 |  --------
 |  DissimilarityMatrix
 |  
 |  Notes
 |  -----
 |  The distances are stored in redundant (square-form) format [1]_. To
 |  facilitate use with other scientific Python routines (e.g., scipy), the
 |  distances can be retrieved in condensed (vector-form) format using
 |  `condensed_form`.
 |  
 |  `DistanceMatrix` only requires that the distances it stores are symmetric.
 |  Checks are *not* performed to ensure the other three metric properties
 |  hold (non-negativity, identity of indiscernibles, and triangle inequality)
 |  [2]_. Thus, a `DistanceMatrix` instance can store distances that are not
 |  metric.
 |  
 |  References
 |  ----------
 |  .. [1] http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
 |  .. [2] http://planetmath.org/metricspace
 |  
 |  Method resolution order:
 |      DistanceMatrix
 |      DissimilarityMatrix
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  condensed_form(self)
 |      Return an array of distances in condensed format.
 |      
 |      Returns
 |      -------
 |      ndarray
 |          One-dimensional ``numpy.ndarray`` of distances in condensed format.
 |      
 |      Notes
 |      -----
 |      Condensed format is described in [1]_.
 |      
 |      The conversion is not a constant-time operation, though it should be
 |      relatively quick to perform.
 |      
 |      References
 |      ----------
 |      .. [1] http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from DissimilarityMatrix:
 |  
 |  __eq__(self, other)
 |      Compare this dissimilarity matrix to another for equality.
 |      
 |      Two dissimilarity matrices are equal if they have the same shape, IDs
 |      (in the same order!), and have data arrays that are equal.
 |      
 |      Checks are *not* performed to ensure that `other` is a
 |      `DissimilarityMatrix` instance.
 |      
 |      Parameters
 |      ----------
 |      other : DissimilarityMatrix
 |          Dissimilarity matrix to compare to for equality.
 |      
 |      Returns
 |      -------
 |      bool
 |          ``True`` if `self` is equal to `other`, ``False`` otherwise.
 |      
 |      .. shownumpydoc
 |  
 |  __getitem__(self, index)
 |      Slice into dissimilarity data by object ID or numpy indexing.
 |      
 |      Extracts data from the dissimilarity matrix by object ID, a pair of
 |      IDs, or numpy indexing/slicing.
 |      
 |      Parameters
 |      ----------
 |      index : str, two-tuple of str, or numpy index
 |          `index` can be one of the following forms: an ID, a pair of IDs, or
 |          a numpy index.
 |      
 |          If `index` is a string, it is assumed to be an ID and a
 |          ``numpy.ndarray`` row vector is returned for the corresponding ID.
 |          Note that the ID's row of dissimilarities is returned, *not* its
 |          column. If the matrix is symmetric, the two will be identical, but
 |          this makes a difference if the matrix is asymmetric.
 |      
 |          If `index` is a two-tuple of strings, each string is assumed to be
 |          an ID and the corresponding matrix element is returned that
 |          represents the dissimilarity between the two IDs. Note that the
 |          order of lookup by ID pair matters if the matrix is asymmetric: the
 |          first ID will be used to look up the row, and the second ID will be
 |          used to look up the column. Thus, ``dm['a', 'b']`` may not be the
 |          same as ``dm['b', 'a']`` if the matrix is asymmetric.
 |      
 |          Otherwise, `index` will be passed through to
 |          ``DissimilarityMatrix.data.__getitem__``, allowing for standard
 |          indexing of a ``numpy.ndarray`` (e.g., slicing).
 |      
 |      Returns
 |      -------
 |      ndarray or scalar
 |          Indexed data, where return type depends on the form of `index` (see
 |          description of `index` for more details).
 |      
 |      Raises
 |      ------
 |      MissingIDError
 |          If the ID(s) specified in `index` are not in the dissimilarity
 |          matrix.
 |      
 |      Notes
 |      -----
 |      The lookup based on ID(s) is quick.
 |      
 |      .. shownumpydoc
 |  
 |  __init__(self, data, ids)
 |  
 |  __ne__(self, other)
 |      Determine whether two dissimilarity matrices are not equal.
 |      
 |      Parameters
 |      ----------
 |      other : DissimilarityMatrix
 |          Dissimilarity matrix to compare to.
 |      
 |      Returns
 |      -------
 |      bool
 |          ``True`` if `self` is not equal to `other`, ``False`` otherwise.
 |      
 |      See Also
 |      --------
 |      __eq__
 |      
 |      .. shownumpydoc
 |  
 |  __str__(self)
 |      Return a string representation of the dissimilarity matrix.
 |      
 |      Summary includes matrix dimensions, a (truncated) list of IDs, and
 |      (truncated) array of dissimilarities.
 |      
 |      Returns
 |      -------
 |      str
 |          String representation of the dissimilarity matrix.
 |      
 |      .. shownumpydoc
 |  
 |  copy(self)
 |      Return a deep copy of the dissimilarity matrix.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          Deep copy of the dissimilarity matrix. Will be the same type as
 |          `self`.
 |  
 |  redundant_form(self)
 |      Return an array of dissimilarities in redundant format.
 |      
 |      As this is the native format that the dissimilarities are stored in,
 |      this is simply an alias for `data`.
 |      
 |      Returns
 |      -------
 |      ndarray
 |          Two-dimensional ``numpy.ndarray`` of dissimilarities in redundant
 |          format.
 |      
 |      Notes
 |      -----
 |      Redundant format is described in [1]_.
 |      
 |      Does *not* return a copy of the data.
 |      
 |      References
 |      ----------
 |      .. [1] http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
 |  
 |  to_file(self, out_f, delimiter='\t')
 |      Save the dissimilarity matrix to file in delimited text format.
 |      
 |      See Also
 |      --------
 |      from_file
 |      
 |      Parameters
 |      ----------
 |      out_f : file-like object or filename
 |          File-like object to write serialized data to, or name of
 |          file. If it's a file-like object, it must have a ``write``
 |          method, and it won't be closed. Else, it is opened and
 |          closed after writing.
 |      delimiter : str, optional
 |          Delimiter used to separate elements in output format.
 |  
 |  transpose(self)
 |      Return the transpose of the dissimilarity matrix.
 |      
 |      Notes
 |      -----
 |      A deep copy is returned.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          Transpose of the dissimilarity matrix. Will be the same type as
 |          `self`.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from DissimilarityMatrix:
 |  
 |  from_file(cls, dm_f, delimiter='\t') from __builtin__.type
 |      Load dissimilarity matrix from a delimited text file or file path.
 |      
 |      Creates a `DissimilarityMatrix` instance from a serialized
 |      dissimilarity matrix stored as delimited text.
 |      
 |      `dm_f` can be a file-like or a file path object containing delimited
 |      text. The first line (header) must contain the IDs of each object. The
 |      subsequent lines must contain an ID followed by each dissimilarity
 |      (float) between the current object and all other objects, where the
 |      order of objects is determined by the header line.  For example, a 2x2
 |      dissimilarity matrix with IDs ``'a'`` and ``'b'`` might look like::
 |      
 |          <del>a<del>b
 |          a<del>0.0<del>1.0
 |          b<del>1.0<del>0.0
 |      
 |      where ``<del>`` is the delimiter between elements.
 |      
 |      Parameters
 |      ----------
 |      dm_f : iterable of str or str
 |          Iterable of strings (e.g., open file handle, file-like object, list
 |          of strings, etc.) or a file path (a string) containing a serialized
 |          dissimilarity matrix.
 |      delimiter : str, optional
 |          String delimiting elements in `dm_f`.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          Instance of type `cls` containing the parsed contents of `dm_f`.
 |      
 |      Notes
 |      -----
 |      Whitespace-only lines can occur anywhere throughout the "file" and are
 |      ignored. Lines starting with ``#`` are treated as comments and ignored.
 |      These comments can only occur *before* the ID header.
 |      
 |      IDs will have any leading/trailing whitespace removed when they are
 |      parsed.
 |      
 |      .. note::
 |          File-like objects passed to this method will not be closed upon the
 |          completion of the parsing, it is responsibility of the owner of the
 |          object to perform this operation.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from DissimilarityMatrix:
 |  
 |  T
 |      Transpose of the dissimilarity matrix.
 |      
 |      See Also
 |      --------
 |      transpose
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  data
 |      Array of dissimilarities.
 |      
 |      A square, hollow, two-dimensional ``numpy.ndarray`` of dissimilarities
 |      (floats). A copy is *not* returned.
 |      
 |      Notes
 |      -----
 |      This property is not writeable.
 |  
 |  dtype
 |      Data type of the dissimilarities.
 |  
 |  ids
 |      Tuple of object IDs.
 |      
 |      A tuple of strings, one for each object in the dissimilarity matrix.
 |      
 |      Notes
 |      -----
 |      This property is writeable, but the number of new IDs must match the
 |      number of objects in `data`.
 |  
 |  shape
 |      Two-element tuple containing the dissimilarity matrix dimensions.
 |      
 |      Notes
 |      -----
 |      As the dissimilarity matrix is guaranteed to be square, both tuple
 |      entries will always be equal.
 |  
 |  size
 |      Total number of elements in the dissimilarity matrix.
 |      
 |      Notes
 |      -----
 |      Equivalent to ``self.shape[0] * self.shape[1]``.


In [7]:
DistanceMatrix?

In [8]:
website('scikit-bio.org', 'scikit-bio website')




Testing/validation

Wide variety of tests:

  • Unit tests
  • Documentation builds
  • Dead link checking
  • Code style checking (PEP8)
  • Doctests
  • Code coverage (94%)

Continuous Integration (CI) via Travis-CI

  • Every pull request is tested
  • Every push to master branch is tested
  • Tested against multiple versions (Python and dependencies)

Live demo: distance-based statistics

Let's work through an analysis that uses distance-based statistics to determine whether two or more groups of samples are significantly different (either in center or spread).

We'll use ANOSIM and PERMANOVA to perform the hypothesis tests.

Background: The Data

We'll use the data from Costello et al. Science (2009) Bacterial Community Variation in Human Body Habitats Across Space and Time.

Figure 1 shows several different approaches for comparing the resulting UniFrac distance matrix (this image is linked from the Science journal website - copyright belongs to Science):

We'll start with an unweighted UniFrac distance matrix generated by QIIME and a mapping file.

Load the data


In [9]:
# Import the functionality we'll need to perform the analyses.
import pandas as pd
from skbio.core.distance import DistanceMatrix
from skbio.math.stats.distance import ANOSIM, PERMANOVA

In [10]:
# Load the distance matrix
dm = DistanceMatrix.from_file('dm.txt')
print dm


439x439 distance matrix
IDs:
M12Aptr.140800, M41Kner.140735, F24Plmr.140433, M53Tong.140327, F31Indl.140679, ...
Data:
[[ 0.          0.8261686   0.80939057 ...,  0.76901199  0.56819613
   0.67845042]
 [ 0.8261686   0.          0.6563376  ...,  0.58830727  0.71583148
   0.72233134]
 [ 0.80939057  0.6563376   0.         ...,  0.63909922  0.71900128
   0.71307195]
 ..., 
 [ 0.76901199  0.58830727  0.63909922 ...,  0.          0.6172266
   0.64172663]
 [ 0.56819613  0.71583148  0.71900128 ...,  0.6172266   0.          0.478762  ]
 [ 0.67845042  0.72233134  0.71307195 ...,  0.64172663  0.478762    0.        ]]

In [11]:
# Load the mapping file
mf = pd.read_csv('map.txt', sep='\t', index_col=0)

Run ANOSIM


In [12]:
# Run ANOSIM with 999 permutations using the body site category.
anosim = ANOSIM(dm, mf, 'BODY_SITE_COARSE')
anosim(999)


Out[12]:
Sample size Number of groups R statistic p-value Number of permutations
ANOSIM 439 3 0.643089761317 0.001 999

1 rows × 5 columns


In [13]:
# Run ANOSIM with 999 permutations using the sex category.
anosim = ANOSIM(dm, mf, 'SEX')
anosim(999)


Out[13]:
Sample size Number of groups R statistic p-value Number of permutations
ANOSIM 439 2 0.0221749413225 0.030 999

1 rows × 5 columns

Run PERMANOVA


In [14]:
# Run PERMANOVA with 999 permutations using the body site category.
permanova = PERMANOVA(dm, mf, 'BODY_SITE_COARSE')
permanova(999)


Out[14]:
Sample size Number of groups pseudo-F statistic p-value Number of permutations
PERMANOVA 439 3 51.2561893345 0.001 999

1 rows × 5 columns


In [15]:
# Run PERMANOVA with 999 permutations using the sex category.
permanova = PERMANOVA(dm, mf, 'SEX')
permanova(999)


Out[15]:
Sample size Number of groups pseudo-F statistic p-value Number of permutations
PERMANOVA 439 2 3.41228470844 0.001 999

1 rows × 5 columns

Next steps

  • Initial 0.1.0 release this week
  • Present at SciPy 2014 (July)
  • Stabilize API
  • Automated performance testing
  • New features
    • alpha/beta diversity metrics
    • more distance-based stats

Acknowledgments

Special thanks to:

  • scikit-bio developers
  • Knight Lab
  • IPython developers

This notebook is based on the template/tools provided in Fernando Perez's nb-slideshow-template repository, and is licensed under the new BSD license.

Thanks for listening!


In [16]:
import skbio
print skbio.title
print skbio.art
print skbio.motto


               _ _    _ _          _     _
              (_) |  (_) |        | |   (_)
      ___  ___ _| | ___| |_ ______| |__  _  ___
     / __|/ __| | |/ / | __|______| '_ \| |/ _ \
     \__ \ (__| |   <| | |_       | |_) | | (_) |
     |___/\___|_|_|\_\_|\__|      |_.__/|_|\___/




           Opisthokonta
                   \  Amoebozoa
                    \ /
                     *    Euryarchaeota
                      \     |_ Crenarchaeota
                       \   *
                        \ /
                         *
                        /
                       /
                      /
                     *
                    / \
                   /   \
        Proteobacteria  \
                       Cyanobacteria

It's gonna get weird, bro.