Scientific Programming I

Reference Document

What is a SciPy?

The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc.

scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab’s toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.

Before implementing a routine, it is worth checking if the desired data processing is not already implemented in Scipy. As non-professional programmers, scientists often tend to re-invent the wheel, which leads to buggy, non-optimal, difficult-to-share and unmaintainable code. By contrast, Scipy‘s routines are optimized and tested, and should therefore be used when possible.

Available packages

The following packages are available in SciPy:

Subpackage Description
cluster Clustering algorithms
constants Physical and mathematical constants
fftpack Fast Fourier Transform routines
integrate Integration and ordinary differential equation solvers
interpolate Interpolation and smoothing splines
io Input and Output
linalg Linear algebra
ndimage N-dimensional image processing
odr Orthogonal distance regression
optimize Optimization and root-finding routines
signal Signal processing
sparse Sparse matrices and associated routines
spatial Spatial data structures and algorithms
special Special functions
stats Statistical distributions and functions
weave C/C++ integration

Getting Started


In [ ]:
%pylab inline
# Increase default figure resolution
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

Input / Output (scipy.io)

scipy has built-in functions to read and write to a wide variety of data formats, including Matlab, IDL, and netcdf. Plain text and binary capabilites are available from numpy. hdf5 has its own python module.


In [ ]:
from scipy import io
a = np.ones((3, 3))
io.savemat('file.mat', {'a': a}); # savemat expects a dictionary
data = io.loadmat('file.mat', struct_as_record=True)
data['a']

Integration (scipy.integrate)


In [ ]:
from scipy import integrate

In [ ]:
%%latex
We want to compute the integral:

$$ I = \int_{x=\pi}^{2\pi}\int_{y=0}^{\pi}{(y\sin x + x\cos y)dydx} $$

In [ ]:
def integrand(y, x):
    'y must be the first argument, and x the second.'
    return y * np.sin(x) + x * np.cos(y)

In [ ]:
ans, err = integrate.dblquad(integrand, 
                             np.pi, 2*np.pi,  # x limits
                   lambda x: 0,               # y limits
                   lambda x: np.pi)
print ans

Linear Algebra Operations (scipy.linalg)

scipy.linalg contains all the functions in numpy.linalg plus some other more advanced ones not contained in numpy.linalg.

Another advantage of using scipy.linalg over numpy.linalg is that it is always compiled with BLAS/LAPACK support, while for numpy this is optional. Therefore, the scipy version might be faster depending on how numpy was installed.

Therefore, unless you don’t want to add scipy as a dependency to your numpy program, use scipy.linalg instead of numpy.linalg

The most commonly used methods are to calculate the determinant of a matrix (linalg.det) and to invert a matrix (linalg.inv)


In [ ]:
from scipy import linalg

In [ ]:
arr1 = np.array([[1, 2], [3, 4]]); arr2 = np.array([[3, 2], [6, 4]]); arr3 = np.ones((3, 3))
print "arr 1 = \n",arr1
print "arr 2 = \n",arr2
print "arr 3 = \n",arr3

In [ ]:
linalg.det(arr1), linalg.det(arr2)

In [ ]:
linalg.det(arr3)

In [ ]:
linalg.inv(arr1)

In [ ]:
# np.allclose returns True if two arrays are element-wise equal within a tolerance
np.allclose(np.dot(arr1, linalg.inv(arr1)), np.eye(2))

In [ ]:
print arr3
linalg.inv(arr3)

If a matrix has a determinant of zero LinAlg will raise an error. This is the definition of a Singular matrix (one for which an inverse does not exist).


In [ ]:
print arr1
linalg.inv(arr1)

Fast Fourier Transforms (scipy.fftpack)


In [ ]:
from scipy import fftpack

The scipy.fftpack module allows to compute fast Fourier transforms. As an illustration, a (noisy) input signal may look like:


In [ ]:
time_step = 0.02
period = 5.
time_vec = np.arange(0, 20, time_step)
sig = np.sin(2 * np.pi / period * time_vec) + 0.5 * np.random.randn(time_vec.size)

In [ ]:
plt.plot(time_vec, sig)

The observer doesn’t know the signal frequency, only the sampling time step of the signal sig. The signal is supposed to come from a real function so the Fourier transform will be symmetric. The scipy.fftpack.fftfreq() function will generate the sampling frequencies and scipy.fftpack.fft() will compute the fast Fourier transform:


In [ ]:
sample_freq = fftpack.fftfreq(sig.size, d=time_step)
sig_fft = fftpack.fft(sig)

Because the resulting power is symmetric, only the positive part of the spectrum needs to be used for finding the frequency:


In [ ]:
pidxs = np.where(sample_freq > 0)
freqs = sample_freq[pidxs]
power = np.abs(sig_fft)[pidxs]

In [ ]:
plt.figure()
plt.plot(freqs, power)
plt.xlabel('Frequency [Hz]')
plt.ylabel('plower')
axes = plt.axes([0.3, 0.3, 0.5, 0.5])
plt.title('Peak frequency')
plt.plot(freqs[:8], power[:8])
plt.setp(axes, yticks=[])

The signal frequency can be found by:


In [ ]:
freq = freqs[power.argmax()]
freq, np.allclose(freq, 1./period)

Now the high-frequency noise will be removed from the Fourier transformed signal:


In [ ]:
sig_fft[np.abs(sample_freq) > freq] = 0

The resulting filtered signal can be computed by the scipy.fftpack.ifft() function:


In [ ]:
main_sig = fftpack.ifft(sig_fft)

The result can be viewed with:


In [ ]:
plt.figure()
plt.plot(time_vec, sig)
plt.plot(time_vec, main_sig, linewidth=3)
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')

Optimization and Fitting (scipy.optimize)

The scipy.optimize module provides useful algorithms for function minimization (scalar or multi-dimensional), curve fitting and root finding.


In [ ]:
from scipy import optimize

Finding the minimum of a scalar function


In [ ]:
f = lambda x: x**2 + 10*np.sin(x)

In [ ]:
x = np.arange(-10, 10, 0.1)
plt.plot(x, f(x))

This function has a global minimum around -1.3 and a local minimum around 3.8.

The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:


In [ ]:
optimize.fmin_bfgs(f, 0)

A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:


In [ ]:
optimize.fmin_bfgs(f, 3)

If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, the simplest algorithm is the brute force algorithm, in which the function is evaluated on each point of a given grid:


In [ ]:
grid = (-10, 10, 0.1)
xmin_global = optimize.brute(f, (grid,))
xmin_global

For larger grid sizes, scipy.optimize.brute() becomes quite slow. scipy.optimize.anneal() provides an alternative, using simulated annealing. More efficient algorithms for different classes of global optimization problems exist, but this is out of the scope of scipy. Some useful packages for global optimization are OpenOpt, IPOPT, PyGMO and PyEvolve.

To find the local minimum, let’s constraint the variable to the interval (0, 10) using scipy.optimize.fminbound():


In [ ]:
xmin_local = optimize.fminbound(f, 0, 10)
xmin_local

Finding the roots of a scalar function

To find a root, i.e. a point where f(x) = 0, of the function f above we can use for example scipy.optimize.fsolve():


In [ ]:
root = optimize.fsolve(f, 1)  # our initial guess is 1
root

Note that only one root is found. Inspecting the plot of f reveals that there is a second root around -2.5. We find the exact value of it by adjusting our initial guess:


In [ ]:
root2 = optimize.fsolve(f, -2.5)
root2

Curve Fitting

Suppose we have data sampled from f with some noise:


In [ ]:
xdata = np.linspace(-10, 10, num=20)
ydata = f(xdata) + np.random.randn(xdata.size)
plt.plot(xdata,ydata)

Now if we know the functional form of the function from which the samples were drawn (x^2 + sin(x) in this case) but not the amplitudes of the terms, we can find those by least squares curve fitting. First we have to define the function to fit:


In [ ]:
f2 = lambda x, a, b: a*x**2 + b*np.sin(x)

Then we can use scipy.optimize.curve_fit() to find a and b:


In [ ]:
guess = [2, 2]
params, params_covariance = optimize.curve_fit(f2, xdata, ydata, guess)
params

In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, f(x), 'b-', label="f(x)")
ax.plot(x, f2(x, *params), 'r--', label="Curve fit result")
xmins = np.array([xmin_global[0], xmin_local])
ax.plot(xmins, f(xmins), 'go', label="Minima")
roots = np.array([root, root2])
ax.plot(roots, f(roots), 'kv', label="Roots")
ax.set_xlabel('x')
ax.set_ylabel('f(x)')

Note: In Scipy >= 0.11 unified interfaces to all minimization and root finding algorithms are available: scipy.optimize.minimize(), scipy.optimize.minimize_scalar() and scipy.optimize.root(). They allow comparing various algorithms easily through the method keyword.

Exercise: Curve fitting of temperature data

The temperature extremes in Alaska for each month, starting in January, are given by (in degrees Celcius):

Temp Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
max: 17 19 21 28 33 38 37 37 31 23 19 18
min: -62 -59 -56 -46 -32 -18 -9 -13 -25 -46 -52 -58

Plot these temperature extremes.


In [ ]:
max = [17,  19,  21,  28,  33,  38, 37,  37,  31,  23,  19,  18]
min = [-62, -59, -56, -46, -32, -18, -9, -13, -25, -46, -52, -58]
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(range(12),max,'r-', label='Max')
plt.plot(range(12),min,'b-', label='Min')
ax.legend()
ax.set_xlabel('Month')
ax.set_ylabel('Temperature ($^o C$)')

Define a function that can describe min and max temperatures. Hint: this function has to have a period of 1 year. Hint: include a time offset.


In [ ]:
temp = lambda t, a, b, c: b*np.sin(2.*math.pi*t+a)+c

Fit this function to the data with scipy.optimize.curve_fit().


In [ ]:
guess = [6, 1, 1]
print "min temp"
params, params_covariance = optimize.curve_fit(temp, range(12), min, guess)
print params

print "max temp"
params2, params_covariance2 = optimize.curve_fit(temp, range(12), max, guess)
print params2

Plot the result. Is the fit reasonable? If not, why?

Is the time offset for min and max temperatures the same within the fit accuracy?


In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
x=np.linspace(-1,12,1000)
ax.plot(range(12), min, 'b.', label="min")
ax.plot(range(12), max, 'r.', label="max")
ax.plot(x, temp(x, *params), 'y--', label="min fit")
ax.plot(x, temp(x, *params2), 'g--', label="max fit")
ax.set_xlabel('Month')
ax.set_ylabel('Temperature ($^o C$)')
ax.set_xlim=[-1.,11.]
ax.set_ylim=[-80,60]

Interpolation (scipy.interpolate)

The scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points where no measure exists. The module is based on the FITPACK Fortran subroutines from the netlib project.

By imagining experimental data close to a sine function:


In [ ]:
measured_time = np.linspace(0, 1, 10)
noise = (np.random.random(10)*2 - 1) * 1e-1
measures = np.sin(2 * np.pi * measured_time) + noise
plt.plot(measured_time,measures)

The scipy.interpolate.interp1d class can build a linear interpolation function:


In [ ]:
from scipy.interpolate import interp1d
linear_interp = interp1d(measured_time, measures)

Then the scipy.interpolate.linear_interp instance needs to be evaluated at the time of interest:


In [ ]:
computed_time = np.linspace(0, 1, 50)
linear_results = linear_interp(computed_time)

A cubic interpolation can also be selected by providing the kind optional keyword argument:


In [ ]:
cubic_interp = interp1d(measured_time, measures, kind='cubic')
cubic_results = cubic_interp(computed_time)

In [ ]:
plt.plot(measured_time, measures, 'o', ms=6, label='measures')
plt.plot(computed_time, linear_results, label='linear interp')
plt.plot(computed_time, cubic_results, label='cubic interp')
plt.legend()

Statistics and Random Numbers

The module scipy.stats contains statistical tools and probabilistic descriptions of random processes. Random number generators for various random process can be found in numpy.random.

Histogram and probability density function

Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability density function):


In [ ]:
a = np.random.normal(size=1000)
bins = np.arange(-4, 5)
bins

histogram = np.histogram(a, bins=bins, normed=True)[0]
bins = 0.5*(bins[1:] + bins[:-1])
bins

from scipy import stats
b = stats.norm.pdf(bins)  # norm is a distribution

In [ ]:
plt.plot(bins, histogram)
plt.plot(bins, b)

If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:


In [ ]:
loc, std = stats.norm.fit(a)
loc, std

Percentiles

The median is the value with half of the observations below, and half above:


In [ ]:
np.median(a)

It is also called the percentile 50, because 50% of the observation are below it:


In [ ]:
stats.scoreatpercentile(a, 50)

Similarly, we can calculate the percentile 90:


In [ ]:
stats.scoreatpercentile(a, 90)

The percentile is an estimator of the CDF: cumulative distribution function.

Statistical tests

A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the two sets of observations are significantly different:


In [ ]:
a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)

The resulting output is composed of:

1) The T statistic value: it is a number the sign of which is proportional to the difference between the two random processes and the magnitude is related to the significance of this difference.

2) The p value: the probability of both processes being identical. If it is close to 1, the two process are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.

Signal Processing (scipy.signal)


In [ ]:
from scipy import signal

scipy.signal.detrend(): remove linear trend from signal:


In [ ]:
t = np.linspace(0, 5, 100)
x = t + np.random.normal(size=100)

plt.plot(t, x, linewidth=3)
plt.plot(t, signal.detrend(x), linewidth=3)

scipy.signal.resample(): resample a signal to n points using FFT.


In [ ]:
t = np.linspace(0, 5, 100)
x = np.sin(t)

plt.plot(t, x, linewidth=3)
plt.plot(t[::2], signal.resample(x, 50), 'ko')

scipy.signal has many window functions: scipy.signal.hamming(), scipy.signal.bartlett(), scipy.signal.blackman()...

scipy.signal has filtering (median filter scipy.signal.medfilt(), Wiener scipy.signal.wiener()), but we will discuss this in the image section.

Image Processing (scipy.ndimage)


In [ ]:
from scipy import ndimage

Image processing routines may be sorted according to the category of processing they perform.

Geometrical transformations on images

Changing orientation, resolution, ...


In [ ]:
from scipy import misc
ascent = misc.ascent()

In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(ascent, cmap=cm.gray)
ax.axis('off')

In [ ]:
shifted_ascent = ndimage.shift(ascent, (50, 50))
shifted_ascent2 = ndimage.shift(ascent, (50, 50), mode='nearest')
rotated_ascent = ndimage.rotate(ascent, 30)
cropped_ascent = ascent[50:-50, 50:-50]
zoomed_ascent = ndimage.zoom(ascent, 2)

In [ ]:
fig = plt.figure()
ax = fig.add_subplot(151)
ax.imshow(shifted_ascent, cmap=cm.gray)
ax.axis('off')
ax2 = fig.add_subplot(152)
ax2.imshow(shifted_ascent2, cmap=cm.gray)
ax2.axis('off')
ax3 = fig.add_subplot(153)
ax3.imshow(rotated_ascent, cmap=cm.gray)
ax3.axis('off')
ax4 = fig.add_subplot(154)
ax4.imshow(cropped_ascent, cmap=cm.gray)
ax4.axis('off')
ax5 = fig.add_subplot(155)
ax5.imshow(zoomed_ascent, cmap=cm.gray)
ax5.axis('off')

In [ ]:
noisy_ascent = np.copy(ascent).astype(np.float)
noisy_ascent += ascent.std()*0.5*np.random.standard_normal(ascent.shape)
blurred_ascent = ndimage.gaussian_filter(noisy_ascent, sigma=3)
median_ascent = ndimage.median_filter(blurred_ascent, size=5)
wiener_ascent = signal.wiener(blurred_ascent, (5,5))

In [ ]:
fig = plt.figure()
ax = fig.add_subplot(141)
ax.imshow(noisy_ascent, cmap=cm.gray)
ax.axis('off')
ax.set_title("noisy ascent")
ax2 = fig.add_subplot(142)
ax2.imshow(blurred_ascent, cmap=cm.gray)
ax2.axis('off')
ax2.set_title("Gaussian filter")
ax3 = fig.add_subplot(143)
ax3.imshow(median_ascent, cmap=cm.gray)
ax3.axis('off')
ax3.set_title("median filter")
ax4 = fig.add_subplot(144)
ax4.imshow(wiener_ascent, cmap=cm.gray)
ax4.axis('off')
ax4.set_title("Weiner filter")

Measurements on images


In [ ]:
x, y = np.indices((100, 100))
sig = np.sin(2*np.pi*x/50.)*np.sin(2*np.pi*y/50.)*(1+x*y/50.**2)**2
mask = sig > 1

Now we look for various information about the objects in the image:


In [ ]:
labels, nb = ndimage.label(mask)
nb

In [ ]:
areas = ndimage.sum(mask, labels, xrange(1, labels.max()+1))
areas

In [ ]:
maxima = ndimage.maximum(sig, labels, xrange(1, labels.max()+1))
maxima

In [ ]:
ndimage.find_objects(labels==4)
sl = ndimage.find_objects(labels==4)

In [ ]:
fig = plt.figure()
ax = fig.add_subplot(131)
ax.imshow(sig)
ax.axis('off')
ax.set_title("sig")
ax2 = fig.add_subplot(132)
ax2.imshow(mask)
ax2.axis('off')
ax2.set_title("mask")
ax3 = fig.add_subplot(133)
ax3.imshow(labels)
ax3.axis('off')
ax3.set_title("labels")

Breakout Session: Imaging Processing

  1. Open the image file MV_HFV_012.png and display it. This Scanning Element Microscopy image shows a glass sample (light gray) with some bubbles (on black) and unmolten sand grains (dark gray).
  2. Crop the image to remove the lower panel with measure information.
  3. Slightly filter the image with a median filter in order to refine its histogram. Check how the histogram changes.
  4. Using the histogram of the filtered image, determine thresholds that allow to define masks for sand pixels, glass pixels and bubble pixels.
  5. Display an image in which the three phases are colored with three different colors.
  6. Attribute labels to all bubbles and sand grains, and remove from the sand mask grains that are smaller than 10 pixels. To do so, use ndimage.sum or np.bincount to compute the grain sizes.
  7. Compute the mean and median size of bubbles.

In [ ]: