There are basically two ways to install third party libraries:
setup.py
filepip
anaconda
(binaries) or linuxbrew/homebrew
apt-get
for debian-based linux distributions)
In [ ]:
# setup.py example
# %%bash
# wget https://github.com/biopython/biopython/archive/biopython-168.tar.gz
# tar -xvf biopython-168.tar.gz
# cd biopython-168.tar.gz
# sudo python setup.py install
In [ ]:
# using pip
# !pip install biopython
In [ ]:
# using anaconda
# !conda install biopython
In [ ]:
# using ap-get
# !sudo apt-get install python-biopython python3-biopython
You already encountered numpy
as the "engine" that powers pandas
. It is also used by many other third-party library to allow fast computation on arrays (i.e. scipy
or scikit-learn
).
The most important feature of NumPy is the ndarray
(n-dimensional-array), which is a multidimensional matrix with fixed-type. This allows faster computation and more intuitive array manipulations.
In [ ]:
import numpy as np
In [ ]:
# simple array creation
a = np.array([2,3,4])
a
In [ ]:
# what type is my array?
a.dtype
In [ ]:
b = np.array([1.2, 3.5, 5.1])
b
In [ ]:
b.dtype
We can check the dimensions of the array by using the shape
method
In [ ]:
b.shape
Numpy offers several array constructors that can be quite handy.
In [ ]:
# like range, but returns an array
np.arange(10, step=0.2)
In [ ]:
# evenly-spaced numbers
np.linspace(0, 10, num=20)
In [ ]:
# evenly spaced numbers on a logaritmic space (base 2)
np.logspace(0, 10, num=20, base=2)
In [ ]:
np.zeros(10)
In [ ]:
# multidimensional
z = np.zeros((10, 5))
z
In [ ]:
z.shape
In [ ]:
np.ones((10, 5))
Some of the operations you applied to dataframes apply to numpy arrays too
In [ ]:
a = np.array((np.linspace(0, 10, num=10),
np.logspace(0, 10, num=10),))
a
In [ ]:
a.shape
In [ ]:
# equivalent to a.transpose(), but more concise
a.T
In [ ]:
a.T.shape
In [ ]:
a.mean()
In [ ]:
np.std(a)
In [ ]:
np.median(a)
We can also cahnge the dimension of the array
In [ ]:
# rows/columns
a.reshape(4, 5)
In [ ]:
# rows/columns
a.reshape(6, 5)
Mathematical operations on arrays are quite simple
In [ ]:
b = np.array((np.linspace(0, 10, num=5),
np.logspace(0, 10, num=5),))
In [ ]:
a + b
In [ ]:
b = np.array((np.linspace(0, 10, num=10),
np.logspace(0, 10, num=10),))
a + b
In [ ]:
a - b
In [ ]:
a**2
In [ ]:
a > 5
In [ ]:
a[a > 5]
In [ ]:
# element-wise product
a * b
In [ ]:
a.shape
In [ ]:
# matrix product
a = np.random.random((3, 3))
b = np.random.random((3, 3))
np.dot(a, b)
Indexing and splicing
In [ ]:
a = np.random.random((3, 5))
In [ ]:
# get row 1 (2nd row)
a[1]
In [ ]:
# get column 1 (2nd column)
a[:,1]
In [ ]:
# three-dimensional array
a = np.random.random((3, 5, 2))
a
In [ ]:
a[1, 1:4, 1]
SciPy is a very large library of scientific calculations and statistics to be performed on numpy array.
The modules contained in this library are the following:
They are clearly too many to go through them all, but it is worth highlighting the statistical (stats
) and spatial
modules.
In [ ]:
from scipy import stats
In [ ]:
# get samples from a normal distribution
# loc: mean
# scale: std
n = stats.norm.rvs(loc=0, scale=1, size=100)
n
In [ ]:
stats.normaltest(n)
In [ ]:
n1 = stats.norm.rvs(loc=0, scale=1, size=100)
n2 = stats.norm.rvs(loc=0.5, scale=1, size=100)
In [ ]:
# ttest
stats.ttest_ind(n1, n2)
In [ ]:
# Kolmogorov-Smirnoff test
stats.ks_2samp(n1, n2)
In [ ]:
table = [[1, 15],
[10, 20]]
stats.fisher_exact(table)
In [ ]:
table = [[1, 15],
[10, 20]]
stats.fisher_exact(table, alternative='less')
In [ ]:
from scipy.spatial import distance
In [ ]:
a = np.random.random((3, 5))
a
In [ ]:
distance.pdist(a, metric='canberra')
In [ ]:
distance.squareform(distance.pdist(a, metric='canberra'))
Biopython (http://biopython.org/wiki/Biopython) is a collection of libraries to manipulate files related to computational biology, from sequence data to pdb files. It allows the conversion between formats and even the interrogation of commonly used databases, such as NCBI and KEGG.
Sequence manipulations
Biopython uses a complex series of objects to respresent biological sequences: SeqRecord
, Seq
and so on. In most cases the user is not expected to create a sequence but to read it, so learning how to manipulate sequences is relatively easy.
When a sequence is read from a file it comes as a SeqRecord
object, which can handle annotations on top of a sequence.
As in many biopython modules, parsing can be done either through the parse
or the read
method. The first one acts as an iterator
, which means that it can be used in a for
loop to access one sequence at a time. The latter is used when the file contains one and only one record.
In [ ]:
from Bio import SeqIO
s = SeqIO.read('../data/proteome.faa', 'fasta')
In [ ]:
sequences = SeqIO.parse('../data/proteome.faa', 'fasta')
sequences
In [ ]:
for s in sequences:
print(s.id)
break
In [ ]:
type(s)
In [ ]:
dir(s)
In [ ]:
s.description
Sequence objects can be sliced as strings; the actual sequence can be found under the attribute seq
.
In [ ]:
# first 10 aminoacids
s[:10]
In [ ]:
s[:10].seq
In [ ]:
str(s[10:20].seq)
Sequence formats conversion
We are going to take a genome in genbank format (https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) and convert it to the much simpler Fasta format.
In [ ]:
# a quick look at how a GenBank file looks
!head ../data/ecoli.gbk
In [ ]:
# a quick look at how a GenBank file looks
!tail ../data/ecoli.gbk
In [ ]:
s = SeqIO.read('../data/ecoli.gbk', 'genbank')
SeqIO.write(s, 'ecoli.fasta', 'fasta')
In [ ]:
# a quick check of the result
!head ecoli.fasta
In [ ]:
!tail ecoli.fasta
Sequence manipulation example
In [ ]:
# forward
str(s[1000:2000].seq)
In [ ]:
# reverse complement
str(s[1000:2000].reverse_complement().seq)
GenBank format features extraction
In [ ]:
# first four features of this genbank file
for feat in s.features[:5]:
print(feat)
In [ ]:
type(feat)
Features (more properly SeqFeature
objects), contain all the information related to an annotation that belongs to a sequence. The most notable attributes are position
, strand
and qualifiers
.
In [ ]:
feat.location.start, feat.location.end, feat.strand
In [ ]:
feat.qualifiers
In [ ]:
# we can also translate the original sequence
s[feat.location.start:feat.location.end].seq.translate()
In [ ]:
# we can also translate the original sequence (let's remove the last codon)
s[feat.location.start:feat.location.end-3].seq.translate()
SeqRecord as a prime example of OOP in python
As you might have noticed, parsing sequence data in python involves storing a variety of information, not only the sequence name and the sequence itself. This is especially true for annotation-rich formats such as the Genbank
format (or its cousing, the GFF
format).
In order to flexibly expose those useful annotations whenever a sequence file is parsed, Biopython uses the SeqRecord
object. As you have seen it has several attributes
with "standard" types (id
, description
, ...), a series of methods (reverse_complement
, ...) and more interestingly, a series of attributes that are instances
of other BioPython objects.
In [ ]:
s.seq, type(s.seq)
In [ ]:
s.seq.alphabet, type(s.seq.alphabet)
The Bio.Alphabet
submodule contains a series of alphabets to build biological sequences. This ensures that no forbidden chars are used in making a sequence. Since the alphabet is an abstract concept, it is prone to be inherited and extended by subclasses.
In [ ]:
from Bio import Alphabet
dir(Alphabet)
In [ ]:
help(Alphabet.Alphabet)
In [ ]:
help(Alphabet.SingleLetterAlphabet)
In [ ]:
help(Alphabet.NucleotideAlphabet)
In [ ]:
help(Alphabet.ThreeLetterProtein)
The Bio.Seq.Seq
object is a lower level representation of the biological sequence, meant to represent the sequence itself and its name only. It also features several utility functions, such as reverse_complement
, transcribe
, translate
, and so on...
In [ ]:
(s.seq, type(s.seq))
In [ ]:
dir(s.seq)
As you can see, the Seq
class seem to be an extension of the string class, with which it shares several methods
and attributes
.
In [ ]:
s.features[:10]
In [ ]:
feat = s.features[2]
feat
In [ ]:
dir(feat)
The SeqFeature
class is also a very interesting example: apart from having a series of "regular" attributes (id
, qualifiers
, ...), it also contains instances of biopython-defined classes. The most interesting one is probably the location
attribute, which contains a complex representation of the position of the feature inside the parent sequence. This is because locations are not always exactly defined (biology is messier than informatics!).
In [ ]:
feat.location.start, type(feat.location.start)
In [ ]:
from Bio import SeqFeature
help(SeqFeature.ExactPosition)
In [ ]:
feat.location
If we had to draw an extremely simplified version of the SeqRecord
hierarchy, it will look like this:
SeqRecord
|
+---- Alphabet
|
+---- Seq
|
+---- features
|
+---- SeqFeature
|
+---- FeatureLocation
|
+---- position
[...]
Reading PDB files
BioPython also contains a useful module to parse protein 3D structures, again in a variety of formats
In [ ]:
# fetch pdb file
!wget http://www.rcsb.org/pdb/files/1g59.pdb
In [ ]:
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
structure = parser.get_structure('1g59', '1g59.pdb')
header = parser.get_header()
# fetch the structural method and the resolution
print('Method: {0}'.format(header['structure_method']))
print('Resolution: {0}'.format(header['resolution']))
The returned object (structure
) has a complex structure, which follows the structure, model, chain, residue, atom hierarchy (SMCRA).
Structure['1g59']
|
+---- Model[0]
|
+---- Chain['A']
| |
| +---- Residue[' ', 1, ' ']
| | |
| | +---- Atom['N']
| | |
| | +---- [...]
| | |
| | +---- Atom['CE']
| |
| +---- [...]
| |
| +---- Residue[' ', 468, ' '] [...]
|
+---- Chain['B'] [...]
|
+---- Chain['C'] [...]
|
+---- Chain['D'] [...]
|
+---- Chain[' ']
|
+---- Residue['W', 1, ' ']
| |
| +---- Atom['O']
|
+---- [...]
|
+---- Residue['W', 283, ' '] [...]
Q: why do you think that there can be more than one model inside a single PDB file?
In [ ]:
model = structure[0]
chain = model['A']
residue = chain[(' ', 1, ' ')]
atom = residue['CE']
In [ ]:
chain.id
In [ ]:
residue.id[1], residue.resname
In [ ]:
atom.name, atom.occupancy, atom.bfactor, atom.coord
Read/manipulate phylogenetic trees
The Bio.Phylo
module allow to read/write/manipulate phylogenetic treesd, as well as run complex evolutionary analysis software like codeml
.
Note: even though Bio.Phylo
can be used to draw phylogenetic trees, other libraries such as ete3
are suggested for their great power and versatility.
In [ ]:
from Bio import Phylo
In [ ]:
tree = Phylo.read('../data/tree.nwk', 'nexus')
In [ ]:
tree = Phylo.read('../data/tree.nwk', 'newick')
In [ ]:
dir(tree)
In [ ]:
# very simple visualization of a tree
Phylo.draw_ascii(tree)
In [ ]:
# get a list of terminal nodes
tree.get_terminals()
Each bifurcation and terminal node in the tree is a Clade
object, with several network-like properties. Most of the attributes and methods are shared with the Tree
object.
In [ ]:
node = tree.get_terminals()[0]
In [ ]:
dir(node)
In [ ]:
# distance between root and our node
print('Distance between root and "{0}": {1}'.format(node.name,
tree.distance(tree.root, node)))
In [ ]:
# the root can be changed too
tree.root_at_midpoint()
In [ ]:
Phylo.draw_ascii(tree)
Interrogate the NCBI database using Bio.Entrez
The NCBI has a very useful programmatic interface for data retrieval, for which BioPython has a very complex module. Find more information about Entrez here: https://www.ncbi.nlm.nih.gov/books/NBK3837/
In [ ]:
from Bio import Entrez
Entrez.email = 'your@email.org'
In this minimal example we are going to link a Bioproject ID to a NCBI taxonomy record. The possibility of the interface are numerous and complex, given that also pubmed and its reach metadata can be reached through Entrez.
In [ ]:
r = Entrez.esearch(db='bioproject',
term='PRJNA57779')
h = Entrez.read(r)
In [ ]:
h
In [ ]:
bioproject_id = h['IdList'][0]
bioproject_id
In [ ]:
r = Entrez.elink(dbfrom='bioproject', id=bioproject_id, linkname='bioproject_taxonomy')
h = Entrez.read(r)
In [ ]:
h
In [ ]:
taxonomy_id = h[0]['LinkSetDb'][0]['Link'][0]['Id']
taxonomy_id
In [ ]:
r = Entrez.efetch(db='taxonomy', id=taxonomy_id)
In [ ]:
h = Entrez.read(r)
In [ ]:
h
If you have to deal with complex analysis involving taxonomy, a better way to go is to use the ete3
library. Have a look at this page for more details.
The Entrez module is also useful for doing literature searches through PubMed.
In [ ]:
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='20',
retmode='xml',
term='Escherichia coli')
results = Entrez.read(handle)
results
In [ ]:
handle = Entrez.efetch(db='pubmed',
retmode='xml',
id=results['IdList'][0])
results = Entrez.read(handle)
results
In [ ]:
import networkx as nx
In [ ]:
# undirected graph
g = nx.Graph()
# add nodes
g.add_node('eggs', price=2.5)
g.add_node('spam', price=3.1, rating=1)
# add edges
g.add_edge('eggs', 'spam', rating=3)
g.add_edge('steak', 'spam')
# add edges (and implicitly new nodes)
g.add_edge('eggs', 'omelette')
g.add_edge('vanilla', 'fudge')
In [ ]:
g.nodes()
In [ ]:
g.edges()
In [ ]:
# access nodes and edges with a dictionary-like syntax
g.node['eggs']
In [ ]:
g['eggs']
In [ ]:
g['eggs']['spam']
In [ ]:
g['spam']['eggs']
All the obvious properties can be easily computed.
In [ ]:
nx.degree(g)
In [ ]:
nx.betweenness_centrality(g)
In [ ]:
nx.edge_betweenness_centrality(g)
In [ ]:
for component in nx.connected_components(g):
print(component)
Graph visualization example
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
import random
In [ ]:
nx.draw_networkx_nodes?
In [ ]:
# Generate a series of random graphs
gs = [nx.random_graphs.powerlaw_cluster_graph(n=random.randint(10, 20),
m=random.randint(1, 3),
p=random.random()*0.05)
for x in range(7)]
# Concatenate then in a single graph
# (there might be a more efficient way)
g = gs[0]
for g1 in gs[1:]:
i = max(g.nodes()) + 1
g.add_edges_from([(x+i, y+i) for (x, y) in g1.edges()])
# Calculate nodes and edge properties
# to have something to plot
betw_cent = nx.betweenness.betweenness_centrality(g).values()
edge_betw_cent = nx.edge_betweenness_centrality(g).values()
# Graph layout
graph_pos = nx.layout.fruchterman_reingold_layout(g)
plt.figure(figsize=(9, 9))
# Draw nodes
nx.draw_networkx_nodes(g, graph_pos,
# Node size depends on node degree
node_size=[x*15 for x in nx.degree(g).values()],
# Node color depends on node centrality
node_color=list(betw_cent),
cmap=plt.get_cmap('Blues'),
vmax=max(betw_cent),
vmin=0)
# Draw edges
nx.draw_networkx_edges(g, graph_pos,
# Width depends on edge centrality
width=[x*250 for x in edge_betw_cent],
color='k')
sns.despine(bottom=True, left=True)
plt.xticks([])
plt.yticks([]);