The Scientific Python Stack

Quentin CAUDRON
Ecology and Evolutionary Biology
qcaudron@princeton.edu
@QuentinCAUDRON

Quick note : this section will go fairly slowly on Numpy, and will speed up for Scipy and Matplotlib. This is to give you an idea of how things are done - just a feel for the language and syntax. Beyond that, your applications may be very different from mine, and by far the best thing to do then is to Google whatever you'd like to do. For Voronoi tessellations, for example, looking up voronoi tessellation python will give you the function scipy.spatial.Voronoi, along with examples on how to use it.

Numpy

The scientific Python stack consists of Numpy, Scipy, and Matplotlib. Numpy brings to Python what Matlab is well known for : strong array manipulation and matrix operations, elementary mathematical functions, random numbers, ...


In [1]:
import numpy as np

A = np.ones(10) # an array of ones ( floating point type )
B = np.arange(5, 10, 0.1) # like range(), but allows non-integer stride
C = np.linspace(0, 2 * np.pi, 20) # between a and b in c steps
D = np.random.normal(0, 1, 20) # also beta, binomial, gamma, poisson, uniform, lognormal, negative binomial, geometric, ...
E = np.sin(C)

print A, "\n"
print B, "\n"
print C, "\n"
print D, "\n"
print E


[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 

[ 5.   5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.   6.1  6.2  6.3  6.4
  6.5  6.6  6.7  6.8  6.9  7.   7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
  8.   8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.   9.1  9.2  9.3  9.4
  9.5  9.6  9.7  9.8  9.9] 

[ 0.          0.33069396  0.66138793  0.99208189  1.32277585  1.65346982
  1.98416378  2.31485774  2.64555171  2.97624567  3.30693964  3.6376336
  3.96832756  4.29902153  4.62971549  4.96040945  5.29110342  5.62179738
  5.95249134  6.28318531] 

[-0.12372478  0.96574845  0.86066009 -0.22660784  1.27674623 -3.35089124
  1.11719492 -0.80448459  0.55150595 -1.74308613 -0.38478309  0.2794948
 -1.13913828  0.33799872 -1.08312201  1.19984285  0.11623732 -0.52187102
 -0.32714641 -0.34950851] 

[  0.00000000e+00   3.24699469e-01   6.14212713e-01   8.37166478e-01
   9.69400266e-01   9.96584493e-01   9.15773327e-01   7.35723911e-01
   4.75947393e-01   1.64594590e-01  -1.64594590e-01  -4.75947393e-01
  -7.35723911e-01  -9.15773327e-01  -9.96584493e-01  -9.69400266e-01
  -8.37166478e-01  -6.14212713e-01  -3.24699469e-01  -2.44929360e-16]

Before we go on, a note on namespaces. A namespace is a way to "partition" your functions into different spaces. This is particularly useful because you may have functions with the same name, but they may not do the same thing :


In [2]:
# Another function for sin can be found in the Python math library
import math

print math.sin(0.5)
print np.sin(0.5)

sin(0.5) == np.sin(0.5)


0.479425538604
0.479425538604
Out[2]:
True

In [4]:
# BUT
x = np.linspace(0, 2 * np.pi, 30)
print math.sin(x)
print np.sin(x)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-24026b413017> in <module>()
      1 # BUT
      2 x = np.linspace(0, 2 * np.pi, 30)
----> 3 print math.sin(x)
      4 print np.sin(x)

TypeError: only length-1 arrays can be converted to Python scalars

Note how this function doesn't like to be passed lists or arrays. It's thrown us an error, and provided us with a traceback - a "chronological" trace of where the error occurred, and what caused it. This one's not particularly long, as the error is immediate ( and not in a subfunction ), but it does tell us which line it's on. It seems that math.sin() doesn't like to be passed anything but length-1 arrays - that is, single floats or ints. Not particularly useful when we're trying to be array-oriented, which is what numpy excels at.

OK, back to numpy.


In [5]:
# Numpy arrays can be easily manipulated
blah = np.random.lognormal(0, 1, (4, 3))

print np.dot(blah, blah.T), "\n"
print blah.shape, "\n"
print type(blah)


[[   2.08626246    3.95342369    4.40580451    1.43341017]
 [   3.95342369  139.55782374  112.94954215   27.71783008]
 [   4.40580451  112.94954215  713.39648572   59.64179302]
 [   1.43341017   27.71783008   59.64179302    7.89929757]] 

(4, 3) 

<type 'numpy.ndarray'>

In [6]:
# The numpy array is its own type, but it's still a container. It expands on the capabilities of the standard Python list.
# Access is easier though : for multidimensional arrays, you just use a comma to separate dimensions
print type(blah[0, 0]), "\n"
print blah[:, 2::2]


<type 'numpy.float64'> 

[[  0.12590238]
 [  3.81470513]
 [ 26.68571033]
 [  2.16120495]]

In [39]:
# The : operator is still used. Standard arithmetic is ELEMENT-wise. Dimensions are broadcast.
print blah[:, 1] * blah[:, 2] - 2.5, "\n"
print blah.sum(), blah.mean(), blah.min(), blah.max(), blah.var(), blah.cumsum().std()


[-2.02886406 -2.36869705 -1.59532266 -2.25353598] 

19.015713298 1.58464277483 0.136657280535 9.18220004759 5.66212318678 7.12541093353

In [7]:
# Quick note on interactive environments. If you're in IPython, Spyder, or similar, you can tab-complete :
blah.var()


Out[7]:
54.719063754811003

In [8]:
# We can manipulate the shapes of arrays
print blah.shape
print blah.ravel().shape # Unravel / flatten the array
print blah.reshape(2, 6).shape
print blah.reshape(2, 6).T


(4, 3)
(12,)
(2, 6)
[[  1.41412094   0.55541179]
 [  0.26584396   0.98023854]
 [  0.12590238  26.68571033]
 [  0.35523731   0.49659578]
 [ 11.1749566    1.72681307]
 [  3.81470513   2.16120495]]

Linear Algebra

Numpy has a module called linalg, which offers things like norms, decompositions, matrix powers, inner and outer products, trace, eigenvalues, etc...


In [9]:
# Linear algebra
A = np.random.normal(size=(3, 3))
print np.linalg.inv(A), "\n" # computational linear algebraists, look away

# Modules can be imported under their own namespace for short
import numpy.linalg as lin
print lin.inv(A), "\n"
print lin.eigvals(A), "\n"

# If we want to do some real linear algebra, we can skip elementwise syntax by declaring an array as a numpy matrix :
B = np.matrix(A)
print B * B, "\n" # now MATRIX multiplication
print A * A # still ELEMENTWISE multiplication


[[ 1.2694572  -0.09671629 -0.52482347]
 [ 0.65484796 -0.43529086 -0.20379716]
 [-1.074248   -0.05555358 -0.16075868]] 

[[ 1.2694572  -0.09671629 -0.52482347]
 [ 0.65484796 -0.43529086 -0.20379716]
 [-1.074248   -0.05555358 -0.16075868]] 

[ 0.63902913 -2.50099932 -2.03406634] 

[[ 1.19478192 -0.486762    0.93766928]
 [-1.97734601  6.12118527  0.41364217]
 [ 2.8910407  -2.38971614  3.48481451]] 

[[  3.63588870e-02   1.95693828e-03   4.60480807e-01]
 [  1.11077957e+00   6.23118295e+00   7.62970114e-02]
 [  2.68436546e+00   3.21508231e-01   2.52963565e+00]]

In [11]:
MyArray = np.random.randn(3, 3)
MyMatrix = np.matrix(np.random.randn(3, 3))
print type(MyArray), type(MyMatrix)


<type 'numpy.ndarray'> <class 'numpy.matrixlib.defmatrix.matrix'>

Matplotlib

Before we continue onto Scipy, let's look into some plotting. Matplotlib is aimed to feel familiar to Matlab users.


In [12]:
import matplotlib.pyplot as plt

%pylab inline

plt.figure(figsize(12, 8))

x = np.random.exponential(3, size=10000)

plt.subplot(1, 2, 1)
plt.plot(x[::50])
plt.axhline(x.mean(), color="red")
plt.subplot(1, 2, 2)
plt.hist(x, 50)
plt.axvline(x.mean(), c="r") # c="r" is shorter, both are valid

# Note that, in an interactive setting, you will need to call plt.show(), which I've included above for completeness
# In IPython using the --pylab inline argument, you don't need it; I'll drop it from here


Populating the interactive namespace from numpy and matplotlib

In [13]:
# Like Matlab, we can feed style arguments to the plot function
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x) + np.random.normal(scale = 0.2, size = 100)

# Calling plot() several times without creating a new figure or subplot puts them on the same figure
plt.plot(x, y, "r.")
plt.plot(x, np.sin(x), "k-")


Out[13]:
[<matplotlib.lines.Line2D at 0x107e8cdd0>]

In [14]:
# 50x50x3 matrix on a scatter plot, third dimension plotted as size
x = np.random.uniform(low = 0, high = 1, size = (50, 50, 3))

# Transparency ? Sure.
plt.scatter(x[:, 0], x[:, 1], s=x[:, 2]*500, alpha = 0.4)

# Set axis limits
plt.xlim([0, 1])
plt.ylim([0, 1])


Out[14]:
(0, 1)

In [15]:
# Area plots ?
x = np.linspace(0, 6 * np.pi, 100)
y = np.exp(-0.2*x) * np.sin(x)

plt.plot(x, y, "k", linewidth=4)
plt.fill_between(x, y, y2=0, facecolor="red")


Out[15]:
<matplotlib.collections.PolyCollection at 0x107eef290>

In [16]:
# Error bars ? Let's say we got ten measurements of the above process, the decaying sine wave
print y.shape
measuredy = np.tile(y, (100, 1)) # tile y, 10x1 copies
print measuredy.shape

# Let's add some heteroscedastic noise to our observations
measuredy = np.random.normal(loc=y, scale=np.abs(y+0.001)/2., size = measuredy.shape)

# Let's assume we already know our error is Gaussian, for simplicity. Compute mean and std :
estmean = measuredy.mean(axis=0)
eststd = measuredy.std(axis=0)

# Plot the estimated mean with one standard deviation error bars, and the real signal
plt.plot(x, y, "r", linewidth=3)
plt.errorbar(x, estmean, yerr = eststd)


(100,)
(100, 100)
Out[16]:
<Container object of 3 artists>

In [17]:
# Two dimensional, or image plots ? Perlin noise is fun !

# Initial 2D noisy signal
X = np.random.normal(size=(256, 256))
plt.figure()
plt.imshow(X, cmap="gray")
plt.title("Original 2D Gaussian noise")

# We'll use a 2D Gaussian for smoothing, with identity as covariance matrix
# We'll grab a scipy function for it for now
import scipy.ndimage as nd
plt.figure()
temp = np.zeros_like(X)
temp[128, 128] = 1.
plt.imshow(nd.filters.gaussian_filter(temp, 20, mode="wrap"))
plt.title("Gaussian kernel")

# Generate the Perlin noise
perlin = zeros_like(X)
for i in 2**np.arange(6) :
    perlin += nd.filters.gaussian_filter(X, i, mode="wrap") * i**2
    
# and plot it several ways
plt.figure()
plt.imshow(perlin)
plt.title("Perlin Noise")

plt.figure()
plt.imshow(perlin, cmap="gray")
plt.contour(perlin, linewidths=3)
plt.title("Greyscale, with contours")
#plt.xlim([0, 256])
#plt.ylim([0, 256])
#plt.axes().set_aspect('equal', 'datalim')


Out[17]:
<matplotlib.text.Text at 0x10c59da90>

In [18]:
# And, of course ( stolen from Jake VanderPlas )

def norm(x, x0, sigma):
    return np.exp(-0.5 * (x - x0) ** 2 / sigma ** 2)

def sigmoid(x, x0, alpha):
    return 1. / (1. + np.exp(- (x - x0) / alpha))
    
# define the curves
x = np.linspace(0, 1, 100)
y1 = np.sqrt(norm(x, 0.7, 0.05)) + 0.2 * (1.5 - sigmoid(x, 0.8, 0.05))

y2 = 0.2 * norm(x, 0.5, 0.2) + np.sqrt(norm(x, 0.6, 0.05)) + 0.1 * (1 - sigmoid(x, 0.75, 0.05))

y3 = 0.05 + 1.4 * norm(x, 0.85, 0.08)
y3[x > 0.85] = 0.05 + 1.4 * norm(x[x > 0.85], 0.85, 0.3)


with plt.xkcd() :
    
    plt.plot(x, y1, c='gray')
    plt.plot(x, y2, c='blue')
    plt.plot(x, y3, c='red')

    
    plt.text(0.3, -0.1, "Yard")
    plt.text(0.5, -0.1, "Steps")
    plt.text(0.7, -0.1, "Door")
    plt.text(0.9, -0.1, "Inside")
    
    plt.text(0.05, 1.1, "fear that\nthere's\nsomething\nbehind me")
    plt.plot([0.15, 0.2], [1.0, 0.2], '-k', lw=0.5)
    
    plt.text(0.25, 0.8, "forward\nspeed")
    plt.plot([0.32, 0.35], [0.75, 0.35], '-k', lw=0.5)
    
    plt.text(0.9, 0.4, "embarrassment")
    plt.plot([1.0, 0.8], [0.55, 1.05], '-k', lw=0.5)
    
    plt.title("Walking back to my\nfront door at night:")
    
    plt.xlim([0, 1])
    plt.ylim([0, 1.5])


Scipy

Scipy does all sorts of awesome stuff : integration, interpolation, optimisation, FFTs, signal processing, more linear algebra, stats, image processing, ...


In [19]:
import scipy as sp
import scipy.spatial as spatial
from scipy.integrate import odeint
from scipy.optimize import minimize
from scipy.signal import detrend
import scipy.stats as st

# Convex hulls
ax = plt.axes()
points = np.random.normal(0.5, 0.2, size=(20,2))
plt.scatter(points[:, 0], points[:, 1], s=200, c="r")
hull = spatial.ConvexHull(points)
for i in hull.simplices :
    plt.plot(points[i, 0], points[i, 1], "k", linewidth=4)
spatial.voronoi_plot_2d(spatial.Voronoi(points), ax)
plt.title("Convex Hull and Voronoi Tessellation")


# Ordinary Differential Equations
r0 = 16.
gamma = .5
t = np.linspace(0, 8, 200)
def SIR(y, t) :
    S = - r0 * gamma * y[0] * y[1]
    I = r0 * gamma * y[0] * y[1] - gamma * y[1]
    R = gamma * y[1]
    return [S, I, R]
solution = odeint(SIR, [0.99, .01, 0.], t)

plt.figure()
plt.plot(t, solution, linewidth=3)
plt.title("Measles, Single Epidemic")
plt.xlabel("Time (weeks)")
plt.ylabel("Proportion of Population")
plt.legend(["Susceptible", "Infected", "Recovered"])


# Signal Processing - create trending signal, detrend, FFT
x = np.linspace(0, 1, 300)
trend = np.zeros_like(x)
trend[100:200] = np.linspace(0, 10, 100)
trend[200:] = np.linspace(10, -20, 100)
y = np.random.normal(loc = 3*np.sin(2*np.pi*20*x)) + trend # Signal is a sine wave with noise and trend
yt = detrend(y, bp = [100, 200]) # detrend, giving break points
Yfft = sp.fftpack.rfft(yt) # Calculate FFT
freqs = sp.fftpack.rfftfreq(len(yt), x[1] - x[0]) # Frequencies of the FFT

plt.figure()
plt.subplot(311)
plt.title("Detrending")
plt.plot(x, y, linewidth=2)
plt.plot(x, trend, linewidth=4)
plt.subplot(312)
plt.plot(x, yt, linewidth=2)
plt.subplot(313)
plt.title("FFT")
plt.plot(freqs, (np.abs(Yfft)**2), linewidth=2)
plt.tight_layout()


# Distributions
plt.figure()
plt.subplot(131)
plt.title("Normal PDF")
plt.plot(np.arange(-5, 5, 0.1), st.norm.pdf(np.arange(-5, 5, 0.1), 0, 1), linewidth=2)
plt.subplot(132)
plt.title("Beta CDF")
plt.plot(st.beta.cdf(np.linspace(0, 1, 100), 0.5, 0.5), linewidth=2)
plt.subplot(133)
plt.title("Erlang PDF")
plt.plot(np.linspace(0, 10, 100), st.erlang.pdf(np.linspace(0, 10, 100), 2, loc=0), linewidth=2)


# Statistical Tests
a = np.random.normal(0, 1.5, size=1000)
b = np.random.normal(.2, 1, size=1000)
plt.figure()
plt.hist(a, 30)
plt.hist(b, 30)
plt.title("p = " + str(st.ttest_ind(a, b)[1]))


Out[19]:
<matplotlib.text.Text at 0x110a76450>

There's a lot more you can do with numpy, scipy, and matplotlib. The documentation's pretty good, so look around for what you're after - it's probably already been implemented.