Useful third-party libraries `numpy`, `scipy`, `biopython` and `networkx`

How to install third-party libraries

There are basically two ways to install third party libraries:

using the provided setup.py file
using python software managers like pip
using software managers like anaconda (binaries) or linuxbrew/homebrew
using your operating system package managers (e.g. apt-get for debian-based linux distributions)



In [ ]:

    
# setup.py example
# %%bash
# wget https://github.com/biopython/biopython/archive/biopython-168.tar.gz
# tar -xvf biopython-168.tar.gz
# cd biopython-168.tar.gz
# sudo python setup.py install



In [ ]:

    
# using pip
# !pip install biopython



In [ ]:

    
# using anaconda
# !conda install biopython



In [ ]:

    
# using ap-get
# !sudo apt-get install python-biopython python3-biopython

NumPy: array operations in python

You already encountered numpy as the "engine" that powers pandas. It is also used by many other third-party library to allow fast computation on arrays (i.e. scipy or scikit-learn).

The most important feature of NumPy is the ndarray (n-dimensional-array), which is a multidimensional matrix with fixed-type. This allows faster computation and more intuitive array manipulations.



In [ ]:

    
import numpy as np



In [ ]:

    
# simple array creation
a = np.array([2,3,4])
a



In [ ]:

    
# what type is my array?
a.dtype



In [ ]:

    
b = np.array([1.2, 3.5, 5.1])
b



In [ ]:

    
b.dtype

We can check the dimensions of the array by using the shape method



In [ ]:

    
b.shape

Numpy offers several array constructors that can be quite handy.



In [ ]:

    
# like range, but returns an array
np.arange(10, step=0.2)



In [ ]:

    
# evenly-spaced numbers
np.linspace(0, 10, num=20)



In [ ]:

    
# evenly spaced numbers on a logaritmic space (base 2)
np.logspace(0, 10, num=20, base=2)



In [ ]:

    
np.zeros(10)



In [ ]:

    
# multidimensional
z = np.zeros((10, 5))
z



In [ ]:

    
z.shape



In [ ]:

    
np.ones((10, 5))

Some of the operations you applied to dataframes apply to numpy arrays too



In [ ]:

    
a = np.array((np.linspace(0, 10, num=10),
              np.logspace(0, 10, num=10),))
a



In [ ]:

    
a.shape



In [ ]:

    
# equivalent to a.transpose(), but more concise
a.T



In [ ]:

    
a.T.shape



In [ ]:

    
a.mean()



In [ ]:

    
np.std(a)



In [ ]:

    
np.median(a)

We can also cahnge the dimension of the array



In [ ]:

    
# rows/columns
a.reshape(4, 5)



In [ ]:

    
# rows/columns
a.reshape(6, 5)

Mathematical operations on arrays are quite simple



In [ ]:

    
b = np.array((np.linspace(0, 10, num=5),
              np.logspace(0, 10, num=5),))



In [ ]:

    
a + b



In [ ]:

    
b = np.array((np.linspace(0, 10, num=10),
              np.logspace(0, 10, num=10),))
a + b



In [ ]:

    
a - b



In [ ]:

    
a**2



In [ ]:

    
a > 5



In [ ]:

    
a[a > 5]



In [ ]:

    
# element-wise product
a * b



In [ ]:

    
a.shape



In [ ]:

    
# matrix product
a = np.random.random((3, 3))
b = np.random.random((3, 3))
np.dot(a, b)

Indexing and splicing



In [ ]:

    
a = np.random.random((3, 5))



In [ ]:

    
# get row 1 (2nd row)
a[1]



In [ ]:

    
# get column 1 (2nd column)
a[:,1]



In [ ]:

    
# three-dimensional array
a = np.random.random((3, 5, 2))
a



In [ ]:

    
a[1, 1:4, 1]

SciPy: scientific python

SciPy is a very large library of scientific calculations and statistics to be performed on numpy array.

The modules contained in this library are the following:

Special functions (scipy.special)
Integration (scipy.integrate)
Optimization (scipy.optimize)
Interpolation (scipy.interpolate)
Fourier Transforms (scipy.fftpack)
Signal Processing (scipy.signal)
Linear Algebra (scipy.linalg)
Sparse Eigenvalue Problems with ARPACK
Compressed Sparse Graph Routines (scipy.sparse.csgraph)
Spatial data structures and algorithms (scipy.spatial)
Statistics (scipy.stats)
Multidimensional image processing (scipy.ndimage)
File IO (scipy.io)
Weave (scipy.weave)

They are clearly too many to go through them all, but it is worth highlighting the statistical (stats) and spatial modules.



In [ ]:

    
from scipy import stats



In [ ]:

    
# get samples from a normal distribution
# loc: mean
# scale: std
n = stats.norm.rvs(loc=0, scale=1, size=100)
n



In [ ]:

    
stats.normaltest(n)



In [ ]:

    
n1 = stats.norm.rvs(loc=0, scale=1, size=100)
n2 = stats.norm.rvs(loc=0.5, scale=1, size=100)



In [ ]:

    
# ttest
stats.ttest_ind(n1, n2)



In [ ]:

    
# Kolmogorov-Smirnoff test
stats.ks_2samp(n1, n2)



In [ ]:

    
table = [[1, 15],
         [10, 20]]
stats.fisher_exact(table)



In [ ]:

    
table = [[1, 15],
         [10, 20]]
stats.fisher_exact(table, alternative='less')



In [ ]:

    
from scipy.spatial import distance



In [ ]:

    
a = np.random.random((3, 5))
a



In [ ]:

    
distance.pdist(a, metric='canberra')



In [ ]:

    
distance.squareform(distance.pdist(a, metric='canberra'))

Biopython: the swiss-army-knife library for bioinformatics

Biopython (http://biopython.org/wiki/Biopython) is a collection of libraries to manipulate files related to computational biology, from sequence data to pdb files. It allows the conversion between formats and even the interrogation of commonly used databases, such as NCBI and KEGG.

Sequence manipulations

Biopython uses a complex series of objects to respresent biological sequences: SeqRecord, Seq and so on. In most cases the user is not expected to create a sequence but to read it, so learning how to manipulate sequences is relatively easy.

When a sequence is read from a file it comes as a SeqRecord object, which can handle annotations on top of a sequence.

As in many biopython modules, parsing can be done either through the parse or the read method. The first one acts as an iterator, which means that it can be used in a for loop to access one sequence at a time. The latter is used when the file contains one and only one record.



In [ ]:

    
from Bio import SeqIO

s = SeqIO.read('../data/proteome.faa', 'fasta')



In [ ]:

    
sequences = SeqIO.parse('../data/proteome.faa', 'fasta')
sequences



In [ ]:

    
for s in sequences:
    print(s.id)
    break



In [ ]:

    
type(s)



In [ ]:

    
dir(s)



In [ ]:

    
s.description

Sequence objects can be sliced as strings; the actual sequence can be found under the attribute seq.



In [ ]:

    
# first 10 aminoacids
s[:10]



In [ ]:

    
s[:10].seq



In [ ]:

    
str(s[10:20].seq)

Sequence formats conversion

We are going to take a genome in genbank format (https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) and convert it to the much simpler Fasta format.



In [ ]:

    
# a quick look at how a GenBank file looks
!head ../data/ecoli.gbk



In [ ]:

    
# a quick look at how a GenBank file looks
!tail ../data/ecoli.gbk



In [ ]:

    
s = SeqIO.read('../data/ecoli.gbk', 'genbank')
SeqIO.write(s, 'ecoli.fasta', 'fasta')



In [ ]:

    
# a quick check of the result
!head ecoli.fasta



In [ ]:

    
!tail ecoli.fasta

Sequence manipulation example



In [ ]:

    
# forward
str(s[1000:2000].seq)



In [ ]:

    
# reverse complement
str(s[1000:2000].reverse_complement().seq)

GenBank format features extraction



In [ ]:

    
# first four features of this genbank file
for feat in s.features[:5]:
    print(feat)



In [ ]:

    
type(feat)

Features (more properly SeqFeature objects), contain all the information related to an annotation that belongs to a sequence. The most notable attributes are position, strand and qualifiers.



In [ ]:

    
feat.location.start, feat.location.end, feat.strand



In [ ]:

    
feat.qualifiers



In [ ]:

    
# we can also translate the original sequence
s[feat.location.start:feat.location.end].seq.translate()



In [ ]:

    
# we can also translate the original sequence (let's remove the last codon)
s[feat.location.start:feat.location.end-3].seq.translate()

SeqRecord as a prime example of OOP in python

As you might have noticed, parsing sequence data in python involves storing a variety of information, not only the sequence name and the sequence itself. This is especially true for annotation-rich formats such as the Genbank format (or its cousing, the GFF format).

In order to flexibly expose those useful annotations whenever a sequence file is parsed, Biopython uses the SeqRecord object. As you have seen it has several attributes with "standard" types (id, description, ...), a series of methods (reverse_complement, ...) and more interestingly, a series of attributes that are instances of other BioPython objects.



In [ ]:

    
s.seq, type(s.seq)



In [ ]:

    
s.seq.alphabet, type(s.seq.alphabet)

The Bio.Alphabet submodule contains a series of alphabets to build biological sequences. This ensures that no forbidden chars are used in making a sequence. Since the alphabet is an abstract concept, it is prone to be inherited and extended by subclasses.



In [ ]:

    
from Bio import Alphabet

dir(Alphabet)



In [ ]:

    
help(Alphabet.Alphabet)



In [ ]:

    
help(Alphabet.SingleLetterAlphabet)



In [ ]:

    
help(Alphabet.NucleotideAlphabet)



In [ ]:

    
help(Alphabet.ThreeLetterProtein)

The Bio.Seq.Seq object is a lower level representation of the biological sequence, meant to represent the sequence itself and its name only. It also features several utility functions, such as reverse_complement, transcribe, translate, and so on...



In [ ]:

    
(s.seq, type(s.seq))



In [ ]:

    
dir(s.seq)

As you can see, the Seq class seem to be an extension of the string class, with which it shares several methods and attributes.



In [ ]:

    
s.features[:10]



In [ ]:

    
feat = s.features[2]
feat



In [ ]:

    
dir(feat)

The SeqFeature class is also a very interesting example: apart from having a series of "regular" attributes (id, qualifiers, ...), it also contains instances of biopython-defined classes. The most interesting one is probably the location attribute, which contains a complex representation of the position of the feature inside the parent sequence. This is because locations are not always exactly defined (biology is messier than informatics!).



In [ ]:

    
feat.location.start, type(feat.location.start)



In [ ]:

    
from Bio import SeqFeature
help(SeqFeature.ExactPosition)



In [ ]:

    
feat.location

If we had to draw an extremely simplified version of the SeqRecord hierarchy, it will look like this:

SeqRecord
  |
  +---- Alphabet
  |
  +---- Seq
  |
  +---- features
        |
        +---- SeqFeature
              |
              +---- FeatureLocation
                    |
                    +---- position
[...]

Reading PDB files

BioPython also contains a useful module to parse protein 3D structures, again in a variety of formats



In [ ]:

    
# fetch pdb file
!wget http://www.rcsb.org/pdb/files/1g59.pdb



In [ ]:

    
from Bio.PDB.PDBParser import PDBParser

parser = PDBParser()
structure = parser.get_structure('1g59', '1g59.pdb')
header = parser.get_header()
# fetch the structural method and the resolution
print('Method: {0}'.format(header['structure_method']))
print('Resolution: {0}'.format(header['resolution']))

The returned object (structure) has a complex structure, which follows the structure, model, chain, residue, atom hierarchy (SMCRA).

Structure['1g59']
  |
  +---- Model[0]
          |
          +---- Chain['A']
          |       |
          |       +---- Residue[' ', 1, ' ']
          |       |       |
          |       |       +---- Atom['N']
          |       |       |
          |       |       +---- [...]
          |       |       |
          |       |       +---- Atom['CE']
          |       |
          |       +---- [...]
          |       |
          |       +---- Residue[' ', 468, ' '] [...]
          |
          +---- Chain['B'] [...]
          |
          +---- Chain['C'] [...]
          |
          +---- Chain['D'] [...]
          |
          +---- Chain[' ']
                  |
                  +---- Residue['W', 1, ' ']
                  |       |
                  |       +---- Atom['O']
                  |
                  +---- [...]
                  |
                  +---- Residue['W', 283, ' '] [...]

Q: why do you think that there can be more than one model inside a single PDB file?



In [ ]:

    
model = structure[0]
chain = model['A']
residue = chain[(' ', 1, ' ')]
atom = residue['CE']



In [ ]:

    
chain.id



In [ ]:

    
residue.id[1], residue.resname



In [ ]:

    
atom.name, atom.occupancy, atom.bfactor, atom.coord

Read/manipulate phylogenetic trees

The Bio.Phylo module allow to read/write/manipulate phylogenetic treesd, as well as run complex evolutionary analysis software like codeml.

Note: even though Bio.Phylo can be used to draw phylogenetic trees, other libraries such as ete3 are suggested for their great power and versatility.



In [ ]:

    
from Bio import Phylo



In [ ]:

    
tree = Phylo.read('../data/tree.nwk', 'nexus')



In [ ]:

    
tree = Phylo.read('../data/tree.nwk', 'newick')



In [ ]:

    
dir(tree)



In [ ]:

    
# very simple visualization of a tree
Phylo.draw_ascii(tree)



In [ ]:

    
# get a list of terminal nodes
tree.get_terminals()

Each bifurcation and terminal node in the tree is a Clade object, with several network-like properties. Most of the attributes and methods are shared with the Tree object.



In [ ]:

    
node = tree.get_terminals()[0]



In [ ]:

    
dir(node)



In [ ]:

    
# distance between root and our node
print('Distance between root and "{0}": {1}'.format(node.name,
                                                    tree.distance(tree.root, node)))



In [ ]:

    
# the root can be changed too
tree.root_at_midpoint()



In [ ]:

    
Phylo.draw_ascii(tree)

Interrogate the NCBI database using Bio.Entrez

The NCBI has a very useful programmatic interface for data retrieval, for which BioPython has a very complex module. Find more information about Entrez here: https://www.ncbi.nlm.nih.gov/books/NBK3837/



In [ ]:

    
from Bio import Entrez
Entrez.email = 'your@email.org'

In this minimal example we are going to link a Bioproject ID to a NCBI taxonomy record. The possibility of the interface are numerous and complex, given that also pubmed and its reach metadata can be reached through Entrez.



In [ ]:

    
r = Entrez.esearch(db='bioproject',
                   term='PRJNA57779')
h = Entrez.read(r)



In [ ]:

    
h



In [ ]:

    
bioproject_id = h['IdList'][0]
bioproject_id



In [ ]:

    
r = Entrez.elink(dbfrom='bioproject', id=bioproject_id, linkname='bioproject_taxonomy')
h = Entrez.read(r)



In [ ]:

    
h



In [ ]:

    
taxonomy_id = h[0]['LinkSetDb'][0]['Link'][0]['Id']
taxonomy_id



In [ ]:

    
r = Entrez.efetch(db='taxonomy', id=taxonomy_id)



In [ ]:

    
h = Entrez.read(r)



In [ ]:

    
h

If you have to deal with complex analysis involving taxonomy, a better way to go is to use the ete3 library. Have a look at this page for more details.

The Entrez module is also useful for doing literature searches through PubMed.



In [ ]:

    
handle = Entrez.esearch(db='pubmed', 
                        sort='relevance', 
                        retmax='20',
                        retmode='xml', 
                        term='Escherichia coli')
results = Entrez.read(handle)
results



In [ ]:

    
handle = Entrez.efetch(db='pubmed',
                       retmode='xml',
                       id=results['IdList'][0])
results = Entrez.read(handle)
results

NetworkX: Cytoscape-like library

This library collects many well-known algorithms to inspect graphs and network properties.

Graphs are encoded in a dictionary-like way, allowing easy and intuitive parsing. Simple plotting functions are available as well.



In [ ]:

    
import networkx as nx



In [ ]:

    
# undirected graph
g = nx.Graph()

# add nodes
g.add_node('eggs', price=2.5)
g.add_node('spam', price=3.1, rating=1)

# add edges
g.add_edge('eggs', 'spam', rating=3)
g.add_edge('steak', 'spam')

# add edges (and implicitly new nodes)
g.add_edge('eggs', 'omelette')
g.add_edge('vanilla', 'fudge')



In [ ]:

    
g.nodes()



In [ ]:

    
g.edges()



In [ ]:

    
# access nodes and edges with a dictionary-like syntax
g.node['eggs']



In [ ]:

    
g['eggs']



In [ ]:

    
g['eggs']['spam']



In [ ]:

    
g['spam']['eggs']

All the obvious properties can be easily computed.



In [ ]:

    
nx.degree(g)



In [ ]:

    
nx.betweenness_centrality(g)



In [ ]:

    
nx.edge_betweenness_centrality(g)



In [ ]:

    
for component in nx.connected_components(g):
    print(component)

Graph visualization example



In [ ]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('white')
import random



In [ ]:

    
nx.draw_networkx_nodes?



In [ ]:

    
# Generate a series of random graphs
gs = [nx.random_graphs.powerlaw_cluster_graph(n=random.randint(10, 20),
                                             m=random.randint(1, 3),
                                             p=random.random()*0.05)
      for x in range(7)]

# Concatenate then in a single graph
# (there might be a more efficient way)
g = gs[0]
for g1 in gs[1:]:
    i = max(g.nodes()) + 1
    g.add_edges_from([(x+i, y+i) for (x, y) in g1.edges()])

# Calculate nodes and edge properties
# to have something to plot
betw_cent = nx.betweenness.betweenness_centrality(g).values()
edge_betw_cent = nx.edge_betweenness_centrality(g).values()

# Graph layout
graph_pos = nx.layout.fruchterman_reingold_layout(g)

plt.figure(figsize=(9, 9))

# Draw nodes
nx.draw_networkx_nodes(g, graph_pos,
                       # Node size depends on node degree
                       node_size=[x*15 for x in nx.degree(g).values()],
                       # Node color depends on node centrality
                       node_color=list(betw_cent),
                       cmap=plt.get_cmap('Blues'),
                       vmax=max(betw_cent),
                       vmin=0)
# Draw edges
nx.draw_networkx_edges(g, graph_pos,
                       # Width depends on edge centrality
                       width=[x*250 for x in edge_betw_cent],
                       color='k')
sns.despine(bottom=True, left=True)
plt.xticks([])
plt.yticks([]);

Other useful libraries

GOAtools: GO terms enrichment analysis in python
statmodels: advanced statistics
rpy2: useful interface to R, when your favorite library doesn\'t have a python alternative
pysam: read and manipulate sam files
pyvcf: read and manipulate VCF files

...and many more