Useful third-party libraries numpy, scipy, biopython and networkx

How to install third-party libraries

There are basically two ways to install third party libraries:

  1. using the provided setup.py file
  2. using python software managers like pip
  3. using software managers like anaconda (binaries) or linuxbrew/homebrew
  4. using your operating system package managers (e.g. apt-get for debian-based linux distributions)

In [ ]:
# setup.py example
# %%bash
# wget https://github.com/biopython/biopython/archive/biopython-168.tar.gz
# tar -xvf biopython-168.tar.gz
# cd biopython-168.tar.gz
# sudo python setup.py install

In [ ]:
# using pip
# !pip install biopython

In [ ]:
# using anaconda
# !conda install biopython

In [ ]:
# using ap-get
# !sudo apt-get install python-biopython python3-biopython

NumPy: array operations in python

You already encountered numpy as the "engine" that powers pandas. It is also used by many other third-party library to allow fast computation on arrays (i.e. scipy or scikit-learn).

The most important feature of NumPy is the ndarray (n-dimensional-array), which is a multidimensional matrix with fixed-type. This allows faster computation and more intuitive array manipulations.


In [ ]:
import numpy as np

In [ ]:
# simple array creation
a = np.array([2,3,4])
a

In [ ]:
# what type is my array?
a.dtype

In [ ]:
b = np.array([1.2, 3.5, 5.1])
b

In [ ]:
b.dtype

We can check the dimensions of the array by using the shape method


In [ ]:
b.shape

Numpy offers several array constructors that can be quite handy.


In [ ]:
# like range, but returns an array
np.arange(10, step=0.2)

In [ ]:
# evenly-spaced numbers
np.linspace(0, 10, num=20)

In [ ]:
# evenly spaced numbers on a logaritmic space (base 2)
np.logspace(0, 10, num=20, base=2)

In [ ]:
np.zeros(10)

In [ ]:
# multidimensional
z = np.zeros((10, 5))
z

In [ ]:
z.shape

In [ ]:
np.ones((10, 5))

Some of the operations you applied to dataframes apply to numpy arrays too


In [ ]:
a = np.array((np.linspace(0, 10, num=10),
              np.logspace(0, 10, num=10),))
a

In [ ]:
a.shape

In [ ]:
# equivalent to a.transpose(), but more concise
a.T

In [ ]:
a.T.shape

In [ ]:
a.mean()

In [ ]:
np.std(a)

In [ ]:
np.median(a)

We can also cahnge the dimension of the array


In [ ]:
# rows/columns
a.reshape(4, 5)

In [ ]:
# rows/columns
a.reshape(6, 5)

Mathematical operations on arrays are quite simple


In [ ]:
b = np.array((np.linspace(0, 10, num=5),
              np.logspace(0, 10, num=5),))

In [ ]:
a + b

In [ ]:
b = np.array((np.linspace(0, 10, num=10),
              np.logspace(0, 10, num=10),))
a + b

In [ ]:
a - b

In [ ]:
a**2

In [ ]:
a > 5

In [ ]:
a[a > 5]

In [ ]:
# element-wise product
a * b

In [ ]:
a.shape

In [ ]:
# matrix product
a = np.random.random((3, 3))
b = np.random.random((3, 3))
np.dot(a, b)

Indexing and splicing


In [ ]:
a = np.random.random((3, 5))

In [ ]:
# get row 1 (2nd row)
a[1]

In [ ]:
# get column 1 (2nd column)
a[:,1]

In [ ]:
# three-dimensional array
a = np.random.random((3, 5, 2))
a

In [ ]:
a[1, 1:4, 1]

SciPy: scientific python

SciPy is a very large library of scientific calculations and statistics to be performed on numpy array.

The modules contained in this library are the following:

  • Special functions (scipy.special)
  • Integration (scipy.integrate)
  • Optimization (scipy.optimize)
  • Interpolation (scipy.interpolate)
  • Fourier Transforms (scipy.fftpack)
  • Signal Processing (scipy.signal)
  • Linear Algebra (scipy.linalg)
  • Sparse Eigenvalue Problems with ARPACK
  • Compressed Sparse Graph Routines (scipy.sparse.csgraph)
  • Spatial data structures and algorithms (scipy.spatial)
  • Statistics (scipy.stats)
  • Multidimensional image processing (scipy.ndimage)
  • File IO (scipy.io)
  • Weave (scipy.weave)

They are clearly too many to go through them all, but it is worth highlighting the statistical (stats) and spatial modules.


In [ ]:
from scipy import stats

In [ ]:
# get samples from a normal distribution
# loc: mean
# scale: std
n = stats.norm.rvs(loc=0, scale=1, size=100)
n

In [ ]:
stats.normaltest(n)

In [ ]:
n1 = stats.norm.rvs(loc=0, scale=1, size=100)
n2 = stats.norm.rvs(loc=0.5, scale=1, size=100)

In [ ]:
# ttest
stats.ttest_ind(n1, n2)

In [ ]:
# Kolmogorov-Smirnoff test
stats.ks_2samp(n1, n2)

In [ ]:
table = [[1, 15],
         [10, 20]]
stats.fisher_exact(table)

In [ ]:
table = [[1, 15],
         [10, 20]]
stats.fisher_exact(table, alternative='less')

In [ ]:
from scipy.spatial import distance

In [ ]:
a = np.random.random((3, 5))
a

In [ ]:
distance.pdist(a, metric='canberra')

In [ ]:
distance.squareform(distance.pdist(a, metric='canberra'))

Biopython: the swiss-army-knife library for bioinformatics

Biopython (http://biopython.org/wiki/Biopython) is a collection of libraries to manipulate files related to computational biology, from sequence data to pdb files. It allows the conversion between formats and even the interrogation of commonly used databases, such as NCBI and KEGG.

Sequence manipulations

Biopython uses a complex series of objects to respresent biological sequences: SeqRecord, Seq and so on. In most cases the user is not expected to create a sequence but to read it, so learning how to manipulate sequences is relatively easy.

When a sequence is read from a file it comes as a SeqRecord object, which can handle annotations on top of a sequence.

As in many biopython modules, parsing can be done either through the parse or the read method. The first one acts as an iterator, which means that it can be used in a for loop to access one sequence at a time. The latter is used when the file contains one and only one record.


In [ ]:
from Bio import SeqIO

s = SeqIO.read('../data/proteome.faa', 'fasta')

In [ ]:
sequences = SeqIO.parse('../data/proteome.faa', 'fasta')
sequences

In [ ]:
for s in sequences:
    print(s.id)
    break

In [ ]:
type(s)

In [ ]:
dir(s)

In [ ]:
s.description

Sequence objects can be sliced as strings; the actual sequence can be found under the attribute seq.


In [ ]:
# first 10 aminoacids
s[:10]

In [ ]:
s[:10].seq

In [ ]:
str(s[10:20].seq)

Sequence formats conversion

We are going to take a genome in genbank format (https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) and convert it to the much simpler Fasta format.


In [ ]:
# a quick look at how a GenBank file looks
!head ../data/ecoli.gbk

In [ ]:
# a quick look at how a GenBank file looks
!tail ../data/ecoli.gbk

In [ ]:
s = SeqIO.read('../data/ecoli.gbk', 'genbank')
SeqIO.write(s, 'ecoli.fasta', 'fasta')

In [ ]:
# a quick check of the result
!head ecoli.fasta

In [ ]:
!tail ecoli.fasta

Sequence manipulation example


In [ ]:
# forward
str(s[1000:2000].seq)

In [ ]:
# reverse complement
str(s[1000:2000].reverse_complement().seq)

GenBank format features extraction


In [ ]:
# first four features of this genbank file
for feat in s.features[:5]:
    print(feat)

In [ ]:
type(feat)

Features (more properly SeqFeature objects), contain all the information related to an annotation that belongs to a sequence. The most notable attributes are position, strand and qualifiers.


In [ ]:
feat.location.start, feat.location.end, feat.strand

In [ ]:
feat.qualifiers

In [ ]:
# we can also translate the original sequence
s[feat.location.start:feat.location.end].seq.translate()

In [ ]:
# we can also translate the original sequence (let's remove the last codon)
s[feat.location.start:feat.location.end-3].seq.translate()

SeqRecord as a prime example of OOP in python

As you might have noticed, parsing sequence data in python involves storing a variety of information, not only the sequence name and the sequence itself. This is especially true for annotation-rich formats such as the Genbank format (or its cousing, the GFF format).

In order to flexibly expose those useful annotations whenever a sequence file is parsed, Biopython uses the SeqRecord object. As you have seen it has several attributes with "standard" types (id, description, ...), a series of methods (reverse_complement, ...) and more interestingly, a series of attributes that are instances of other BioPython objects.


In [ ]:
s.seq, type(s.seq)

In [ ]:
s.seq.alphabet, type(s.seq.alphabet)

The Bio.Alphabet submodule contains a series of alphabets to build biological sequences. This ensures that no forbidden chars are used in making a sequence. Since the alphabet is an abstract concept, it is prone to be inherited and extended by subclasses.


In [ ]:
from Bio import Alphabet

dir(Alphabet)

In [ ]:
help(Alphabet.Alphabet)

In [ ]:
help(Alphabet.SingleLetterAlphabet)

In [ ]:
help(Alphabet.NucleotideAlphabet)

In [ ]:
help(Alphabet.ThreeLetterProtein)

The Bio.Seq.Seq object is a lower level representation of the biological sequence, meant to represent the sequence itself and its name only. It also features several utility functions, such as reverse_complement, transcribe, translate, and so on...


In [ ]:
(s.seq, type(s.seq))

In [ ]:
dir(s.seq)

As you can see, the Seq class seem to be an extension of the string class, with which it shares several methods and attributes.


In [ ]:
s.features[:10]

In [ ]:
feat = s.features[2]
feat

In [ ]:
dir(feat)

The SeqFeature class is also a very interesting example: apart from having a series of "regular" attributes (id, qualifiers, ...), it also contains instances of biopython-defined classes. The most interesting one is probably the location attribute, which contains a complex representation of the position of the feature inside the parent sequence. This is because locations are not always exactly defined (biology is messier than informatics!).


In [ ]:
feat.location.start, type(feat.location.start)

In [ ]:
from Bio import SeqFeature
help(SeqFeature.ExactPosition)

In [ ]:
feat.location

If we had to draw an extremely simplified version of the SeqRecord hierarchy, it will look like this:

SeqRecord
  |
  +---- Alphabet
  |
  +---- Seq
  |
  +---- features
        |
        +---- SeqFeature
              |
              +---- FeatureLocation
                    |
                    +---- position
[...]

Reading PDB files

BioPython also contains a useful module to parse protein 3D structures, again in a variety of formats


In [ ]:
# fetch pdb file
!wget http://www.rcsb.org/pdb/files/1g59.pdb

In [ ]:
from Bio.PDB.PDBParser import PDBParser

parser = PDBParser()
structure = parser.get_structure('1g59', '1g59.pdb')
header = parser.get_header()
# fetch the structural method and the resolution
print('Method: {0}'.format(header['structure_method']))
print('Resolution: {0}'.format(header['resolution']))

The returned object (structure) has a complex structure, which follows the structure, model, chain, residue, atom hierarchy (SMCRA).

Structure['1g59']
  |
  +---- Model[0]
          |
          +---- Chain['A']
          |       |
          |       +---- Residue[' ', 1, ' ']
          |       |       |
          |       |       +---- Atom['N']
          |       |       |
          |       |       +---- [...]
          |       |       |
          |       |       +---- Atom['CE']
          |       |
          |       +---- [...]
          |       |
          |       +---- Residue[' ', 468, ' '] [...]
          |
          +---- Chain['B'] [...]
          |
          +---- Chain['C'] [...]
          |
          +---- Chain['D'] [...]
          |
          +---- Chain[' ']
                  |
                  +---- Residue['W', 1, ' ']
                  |       |
                  |       +---- Atom['O']
                  |
                  +---- [...]
                  |
                  +---- Residue['W', 283, ' '] [...]

Q: why do you think that there can be more than one model inside a single PDB file?


In [ ]:
model = structure[0]
chain = model['A']
residue = chain[(' ', 1, ' ')]
atom = residue['CE']

In [ ]:
chain.id

In [ ]:
residue.id[1], residue.resname

In [ ]:
atom.name, atom.occupancy, atom.bfactor, atom.coord

Read/manipulate phylogenetic trees

The Bio.Phylo module allow to read/write/manipulate phylogenetic treesd, as well as run complex evolutionary analysis software like codeml.

Note: even though Bio.Phylo can be used to draw phylogenetic trees, other libraries such as ete3 are suggested for their great power and versatility.


In [ ]:
from Bio import Phylo

In [ ]:
tree = Phylo.read('../data/tree.nwk', 'nexus')

In [ ]:
tree = Phylo.read('../data/tree.nwk', 'newick')

In [ ]:
dir(tree)

In [ ]:
# very simple visualization of a tree
Phylo.draw_ascii(tree)

In [ ]:
# get a list of terminal nodes
tree.get_terminals()

Each bifurcation and terminal node in the tree is a Clade object, with several network-like properties. Most of the attributes and methods are shared with the Tree object.


In [ ]:
node = tree.get_terminals()[0]

In [ ]:
dir(node)

In [ ]:
# distance between root and our node
print('Distance between root and "{0}": {1}'.format(node.name,
                                                    tree.distance(tree.root, node)))

In [ ]:
# the root can be changed too
tree.root_at_midpoint()

In [ ]:
Phylo.draw_ascii(tree)

Interrogate the NCBI database using Bio.Entrez

The NCBI has a very useful programmatic interface for data retrieval, for which BioPython has a very complex module. Find more information about Entrez here: https://www.ncbi.nlm.nih.gov/books/NBK3837/


In [ ]:
from Bio import Entrez
Entrez.email = 'your@email.org'

In this minimal example we are going to link a Bioproject ID to a NCBI taxonomy record. The possibility of the interface are numerous and complex, given that also pubmed and its reach metadata can be reached through Entrez.


In [ ]:
r = Entrez.esearch(db='bioproject',
                   term='PRJNA57779')
h = Entrez.read(r)

In [ ]:
h

In [ ]:
bioproject_id = h['IdList'][0]
bioproject_id

In [ ]:
r = Entrez.elink(dbfrom='bioproject', id=bioproject_id, linkname='bioproject_taxonomy')
h = Entrez.read(r)

In [ ]:
h

In [ ]:
taxonomy_id = h[0]['LinkSetDb'][0]['Link'][0]['Id']
taxonomy_id

In [ ]:
r = Entrez.efetch(db='taxonomy', id=taxonomy_id)

In [ ]:
h = Entrez.read(r)

In [ ]:
h

If you have to deal with complex analysis involving taxonomy, a better way to go is to use the ete3 library. Have a look at this page for more details.

The Entrez module is also useful for doing literature searches through PubMed.


In [ ]:
handle = Entrez.esearch(db='pubmed', 
                        sort='relevance', 
                        retmax='20',
                        retmode='xml', 
                        term='Escherichia coli')
results = Entrez.read(handle)
results

In [ ]:
handle = Entrez.efetch(db='pubmed',
                       retmode='xml',
                       id=results['IdList'][0])
results = Entrez.read(handle)
results

NetworkX: Cytoscape-like library

This library collects many well-known algorithms to inspect graphs and network properties.

Graphs are encoded in a dictionary-like way, allowing easy and intuitive parsing. Simple plotting functions are available as well.


In [ ]:
import networkx as nx

In [ ]:
# undirected graph
g = nx.Graph()

# add nodes
g.add_node('eggs', price=2.5)
g.add_node('spam', price=3.1, rating=1)

# add edges
g.add_edge('eggs', 'spam', rating=3)
g.add_edge('steak', 'spam')

# add edges (and implicitly new nodes)
g.add_edge('eggs', 'omelette')
g.add_edge('vanilla', 'fudge')

In [ ]:
g.nodes()

In [ ]:
g.edges()

In [ ]:
# access nodes and edges with a dictionary-like syntax
g.node['eggs']

In [ ]:
g['eggs']

In [ ]:
g['eggs']['spam']

In [ ]:
g['spam']['eggs']

All the obvious properties can be easily computed.


In [ ]:
nx.degree(g)

In [ ]:
nx.betweenness_centrality(g)

In [ ]:
nx.edge_betweenness_centrality(g)

In [ ]:
for component in nx.connected_components(g):
    print(component)

Graph visualization example


In [ ]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('white')
import random

In [ ]:
nx.draw_networkx_nodes?

In [ ]:
# Generate a series of random graphs
gs = [nx.random_graphs.powerlaw_cluster_graph(n=random.randint(10, 20),
                                             m=random.randint(1, 3),
                                             p=random.random()*0.05)
      for x in range(7)]

# Concatenate then in a single graph
# (there might be a more efficient way)
g = gs[0]
for g1 in gs[1:]:
    i = max(g.nodes()) + 1
    g.add_edges_from([(x+i, y+i) for (x, y) in g1.edges()])

# Calculate nodes and edge properties
# to have something to plot
betw_cent = nx.betweenness.betweenness_centrality(g).values()
edge_betw_cent = nx.edge_betweenness_centrality(g).values()

# Graph layout
graph_pos = nx.layout.fruchterman_reingold_layout(g)

plt.figure(figsize=(9, 9))

# Draw nodes
nx.draw_networkx_nodes(g, graph_pos,
                       # Node size depends on node degree
                       node_size=[x*15 for x in nx.degree(g).values()],
                       # Node color depends on node centrality
                       node_color=list(betw_cent),
                       cmap=plt.get_cmap('Blues'),
                       vmax=max(betw_cent),
                       vmin=0)
# Draw edges
nx.draw_networkx_edges(g, graph_pos,
                       # Width depends on edge centrality
                       width=[x*250 for x in edge_betw_cent],
                       color='k')
sns.despine(bottom=True, left=True)
plt.xticks([])
plt.yticks([]);

Other useful libraries

  • GOAtools: GO terms enrichment analysis in python
  • statmodels: advanced statistics
  • rpy2: useful interface to R, when your favorite library doesn\'t have a python alternative
  • pysam: read and manipulate sam files
  • pyvcf: read and manipulate VCF files

...and many more