The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc.
scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab’s toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.
Before implementing a routine, it is worth checking if the desired data processing is not already implemented in Scipy. As non-professional programmers, scientists often tend to re-invent the wheel, which leads to buggy, non-optimal, difficult-to-share and unmaintainable code. By contrast, Scipy‘s routines are optimized and tested, and should therefore be used when possible.
The following packages are available in SciPy:
Subpackage | Description |
cluster | Clustering algorithms |
constants | Physical and mathematical constants |
fftpack | Fast Fourier Transform routines |
integrate | Integration and ordinary differential equation solvers |
interpolate | Interpolation and smoothing splines |
io | Input and Output |
linalg | Linear algebra |
ndimage | N-dimensional image processing |
odr | Orthogonal distance regression |
optimize | Optimization and root-finding routines |
signal | Signal processing |
sparse | Sparse matrices and associated routines |
spatial | Spatial data structures and algorithms |
special | Special functions |
stats | Statistical distributions and functions |
weave | C/C++ integration |
In [ ]:
%pylab inline
# Increase default figure resolution
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']
scipy has built-in functions to read and write to a wide variety of data formats, including Matlab, IDL, and netcdf. Plain text and binary capabilites are available from numpy. hdf5 has its own python module.
In [ ]:
from scipy import io
a = np.ones((3, 3))
io.savemat('file.mat', {'a': a}); # savemat expects a dictionary
data = io.loadmat('file.mat', struct_as_record=True)
data['a']
In [ ]:
from scipy import integrate
In [ ]:
%%latex
We want to compute the integral:
$$ I = \int_{x=\pi}^{2\pi}\int_{y=0}^{\pi}{(y\sin x + x\cos y)dydx} $$
In [ ]:
def integrand(y, x):
'y must be the first argument, and x the second.'
return y * np.sin(x) + x * np.cos(y)
In [ ]:
ans, err = integrate.dblquad(integrand,
np.pi, 2*np.pi, # x limits
lambda x: 0, # y limits
lambda x: np.pi)
print ans
scipy.linalg
contains all the functions in numpy.linalg
plus some other more advanced ones not contained in numpy.linalg
.
Another advantage of using scipy.linalg
over numpy.linalg
is that it is always compiled with BLAS/LAPACK support, while for numpy this is optional. Therefore, the scipy
version might be faster depending on how numpy
was installed.
Therefore, unless you don’t want to add scipy
as a dependency to your numpy program, use scipy.linalg
instead of numpy.linalg
The most commonly used methods are to calculate the determinant of a matrix (linalg.det
) and to invert a matrix (linalg.inv
)
In [ ]:
from scipy import linalg
In [ ]:
arr1 = np.array([[1, 2], [3, 4]]); arr2 = np.array([[3, 2], [6, 4]]); arr3 = np.ones((3, 3))
print "arr 1 = \n",arr1
print "arr 2 = \n",arr2
print "arr 3 = \n",arr3
In [ ]:
linalg.det(arr1), linalg.det(arr2)
In [ ]:
linalg.det(arr3)
In [ ]:
linalg.inv(arr1)
In [ ]:
# np.allclose returns True if two arrays are element-wise equal within a tolerance
np.allclose(np.dot(arr1, linalg.inv(arr1)), np.eye(2))
In [ ]:
print arr3
linalg.inv(arr3)
If a matrix has a determinant of zero LinAlg will raise an error. This is the definition of a Singular matrix (one for which an inverse does not exist).
In [ ]:
print arr1
linalg.inv(arr1)
In [ ]:
from scipy import fftpack
The scipy.fftpack module allows to compute fast Fourier transforms. As an illustration, a (noisy) input signal may look like:
In [ ]:
time_step = 0.02
period = 5.
time_vec = np.arange(0, 20, time_step)
sig = np.sin(2 * np.pi / period * time_vec) + 0.5 * np.random.randn(time_vec.size)
In [ ]:
plt.plot(time_vec, sig)
The observer doesn’t know the signal frequency, only the sampling time step of the signal sig. The signal is supposed to come from a real function so the Fourier transform will be symmetric. The scipy.fftpack.fftfreq() function will generate the sampling frequencies and scipy.fftpack.fft() will compute the fast Fourier transform:
In [ ]:
sample_freq = fftpack.fftfreq(sig.size, d=time_step)
sig_fft = fftpack.fft(sig)
Because the resulting power is symmetric, only the positive part of the spectrum needs to be used for finding the frequency:
In [ ]:
pidxs = np.where(sample_freq > 0)
freqs = sample_freq[pidxs]
power = np.abs(sig_fft)[pidxs]
In [ ]:
plt.figure()
plt.plot(freqs, power)
plt.xlabel('Frequency [Hz]')
plt.ylabel('plower')
axes = plt.axes([0.3, 0.3, 0.5, 0.5])
plt.title('Peak frequency')
plt.plot(freqs[:8], power[:8])
plt.setp(axes, yticks=[])
The signal frequency can be found by:
In [ ]:
freq = freqs[power.argmax()]
freq, np.allclose(freq, 1./period)
Now the high-frequency noise will be removed from the Fourier transformed signal:
In [ ]:
sig_fft[np.abs(sample_freq) > freq] = 0
The resulting filtered signal can be computed by the scipy.fftpack.ifft() function:
In [ ]:
main_sig = fftpack.ifft(sig_fft)
The result can be viewed with:
In [ ]:
plt.figure()
plt.plot(time_vec, sig)
plt.plot(time_vec, main_sig, linewidth=3)
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
The scipy.optimize module provides useful algorithms for function minimization (scalar or multi-dimensional), curve fitting and root finding.
In [ ]:
from scipy import optimize
In [ ]:
f = lambda x: x**2 + 10*np.sin(x)
In [ ]:
x = np.arange(-10, 10, 0.1)
plt.plot(x, f(x))
This function has a global minimum around -1.3 and a local minimum around 3.8.
The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:
In [ ]:
optimize.fmin_bfgs(f, 0)
A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:
In [ ]:
optimize.fmin_bfgs(f, 3)
If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, the simplest algorithm is the brute force algorithm, in which the function is evaluated on each point of a given grid:
In [ ]:
grid = (-10, 10, 0.1)
xmin_global = optimize.brute(f, (grid,))
xmin_global
For larger grid sizes, scipy.optimize.brute() becomes quite slow. scipy.optimize.anneal() provides an alternative, using simulated annealing. More efficient algorithms for different classes of global optimization problems exist, but this is out of the scope of scipy. Some useful packages for global optimization are OpenOpt, IPOPT, PyGMO and PyEvolve.
To find the local minimum, let’s constraint the variable to the interval (0, 10) using scipy.optimize.fminbound():
In [ ]:
xmin_local = optimize.fminbound(f, 0, 10)
xmin_local
To find a root, i.e. a point where f(x) = 0, of the function f above we can use for example scipy.optimize.fsolve():
In [ ]:
root = optimize.fsolve(f, 1) # our initial guess is 1
root
Note that only one root is found. Inspecting the plot of f reveals that there is a second root around -2.5. We find the exact value of it by adjusting our initial guess:
In [ ]:
root2 = optimize.fsolve(f, -2.5)
root2
Suppose we have data sampled from f with some noise:
In [ ]:
xdata = np.linspace(-10, 10, num=20)
ydata = f(xdata) + np.random.randn(xdata.size)
plt.plot(xdata,ydata)
Now if we know the functional form of the function from which the samples were drawn (x^2 + sin(x) in this case) but not the amplitudes of the terms, we can find those by least squares curve fitting. First we have to define the function to fit:
In [ ]:
f2 = lambda x, a, b: a*x**2 + b*np.sin(x)
Then we can use scipy.optimize.curve_fit() to find a and b:
In [ ]:
guess = [2, 2]
params, params_covariance = optimize.curve_fit(f2, xdata, ydata, guess)
params
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, f(x), 'b-', label="f(x)")
ax.plot(x, f2(x, *params), 'r--', label="Curve fit result")
xmins = np.array([xmin_global[0], xmin_local])
ax.plot(xmins, f(xmins), 'go', label="Minima")
roots = np.array([root, root2])
ax.plot(roots, f(roots), 'kv', label="Roots")
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
Note: In Scipy >= 0.11 unified interfaces to all minimization and root finding algorithms are available: scipy.optimize.minimize(), scipy.optimize.minimize_scalar() and scipy.optimize.root(). They allow comparing various algorithms easily through the method keyword.
The temperature extremes in Alaska for each month, starting in January, are given by (in degrees Celcius):
Temp | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
max: | 17 | 19 | 21 | 28 | 33 | 38 | 37 | 37 | 31 | 23 | 19 | 18 |
min: | -62 | -59 | -56 | -46 | -32 | -18 | -9 | -13 | -25 | -46 | -52 | -58 |
Plot these temperature extremes.
In [ ]:
max = [17, 19, 21, 28, 33, 38, 37, 37, 31, 23, 19, 18]
min = [-62, -59, -56, -46, -32, -18, -9, -13, -25, -46, -52, -58]
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(range(12),max,'r-', label='Max')
plt.plot(range(12),min,'b-', label='Min')
ax.legend()
ax.set_xlabel('Month')
ax.set_ylabel('Temperature ($^o C$)')
Define a function that can describe min and max temperatures. Hint: this function has to have a period of 1 year. Hint: include a time offset.
In [ ]:
temp = lambda t, a, b, c: b*np.sin(2.*math.pi*t+a)+c
Fit this function to the data with scipy.optimize.curve_fit()
.
In [ ]:
guess = [6, 1, 1]
print "min temp"
params, params_covariance = optimize.curve_fit(temp, range(12), min, guess)
print params
print "max temp"
params2, params_covariance2 = optimize.curve_fit(temp, range(12), max, guess)
print params2
Plot the result. Is the fit reasonable? If not, why?
Is the time offset for min and max temperatures the same within the fit accuracy?
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
x=np.linspace(-1,12,1000)
ax.plot(range(12), min, 'b.', label="min")
ax.plot(range(12), max, 'r.', label="max")
ax.plot(x, temp(x, *params), 'y--', label="min fit")
ax.plot(x, temp(x, *params2), 'g--', label="max fit")
ax.set_xlabel('Month')
ax.set_ylabel('Temperature ($^o C$)')
ax.set_xlim=[-1.,11.]
ax.set_ylim=[-80,60]
The scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points where no measure exists. The module is based on the FITPACK Fortran subroutines from the netlib project.
By imagining experimental data close to a sine function:
In [ ]:
measured_time = np.linspace(0, 1, 10)
noise = (np.random.random(10)*2 - 1) * 1e-1
measures = np.sin(2 * np.pi * measured_time) + noise
plt.plot(measured_time,measures)
The scipy.interpolate.interp1d class can build a linear interpolation function:
In [ ]:
from scipy.interpolate import interp1d
linear_interp = interp1d(measured_time, measures)
Then the scipy.interpolate.linear_interp instance needs to be evaluated at the time of interest:
In [ ]:
computed_time = np.linspace(0, 1, 50)
linear_results = linear_interp(computed_time)
A cubic interpolation can also be selected by providing the kind optional keyword argument:
In [ ]:
cubic_interp = interp1d(measured_time, measures, kind='cubic')
cubic_results = cubic_interp(computed_time)
In [ ]:
plt.plot(measured_time, measures, 'o', ms=6, label='measures')
plt.plot(computed_time, linear_results, label='linear interp')
plt.plot(computed_time, cubic_results, label='cubic interp')
plt.legend()
The module scipy.stats contains statistical tools and probabilistic descriptions of random processes. Random number generators for various random process can be found in numpy.random.
Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability density function):
In [ ]:
a = np.random.normal(size=1000)
bins = np.arange(-4, 5)
bins
histogram = np.histogram(a, bins=bins, normed=True)[0]
bins = 0.5*(bins[1:] + bins[:-1])
bins
from scipy import stats
b = stats.norm.pdf(bins) # norm is a distribution
In [ ]:
plt.plot(bins, histogram)
plt.plot(bins, b)
If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:
In [ ]:
loc, std = stats.norm.fit(a)
loc, std
The median is the value with half of the observations below, and half above:
In [ ]:
np.median(a)
It is also called the percentile 50, because 50% of the observation are below it:
In [ ]:
stats.scoreatpercentile(a, 50)
Similarly, we can calculate the percentile 90:
In [ ]:
stats.scoreatpercentile(a, 90)
The percentile is an estimator of the CDF: cumulative distribution function.
A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the two sets of observations are significantly different:
In [ ]:
a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)
The resulting output is composed of:
1) The T statistic value: it is a number the sign of which is proportional to the difference between the two random processes and the magnitude is related to the significance of this difference.
2) The p value: the probability of both processes being identical. If it is close to 1, the two process are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.
In [ ]:
from scipy import signal
scipy.signal.detrend(): remove linear trend from signal:
In [ ]:
t = np.linspace(0, 5, 100)
x = t + np.random.normal(size=100)
plt.plot(t, x, linewidth=3)
plt.plot(t, signal.detrend(x), linewidth=3)
scipy.signal.resample(): resample a signal to n points using FFT.
In [ ]:
t = np.linspace(0, 5, 100)
x = np.sin(t)
plt.plot(t, x, linewidth=3)
plt.plot(t[::2], signal.resample(x, 50), 'ko')
scipy.signal has many window functions: scipy.signal.hamming(), scipy.signal.bartlett(), scipy.signal.blackman()...
scipy.signal has filtering (median filter scipy.signal.medfilt(), Wiener scipy.signal.wiener()), but we will discuss this in the image section.
In [ ]:
from scipy import ndimage
Image processing routines may be sorted according to the category of processing they perform.
Changing orientation, resolution, ...
In [ ]:
from scipy import misc
ascent = misc.ascent()
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(ascent, cmap=cm.gray)
ax.axis('off')
In [ ]:
shifted_ascent = ndimage.shift(ascent, (50, 50))
shifted_ascent2 = ndimage.shift(ascent, (50, 50), mode='nearest')
rotated_ascent = ndimage.rotate(ascent, 30)
cropped_ascent = ascent[50:-50, 50:-50]
zoomed_ascent = ndimage.zoom(ascent, 2)
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(151)
ax.imshow(shifted_ascent, cmap=cm.gray)
ax.axis('off')
ax2 = fig.add_subplot(152)
ax2.imshow(shifted_ascent2, cmap=cm.gray)
ax2.axis('off')
ax3 = fig.add_subplot(153)
ax3.imshow(rotated_ascent, cmap=cm.gray)
ax3.axis('off')
ax4 = fig.add_subplot(154)
ax4.imshow(cropped_ascent, cmap=cm.gray)
ax4.axis('off')
ax5 = fig.add_subplot(155)
ax5.imshow(zoomed_ascent, cmap=cm.gray)
ax5.axis('off')
In [ ]:
noisy_ascent = np.copy(ascent).astype(np.float)
noisy_ascent += ascent.std()*0.5*np.random.standard_normal(ascent.shape)
blurred_ascent = ndimage.gaussian_filter(noisy_ascent, sigma=3)
median_ascent = ndimage.median_filter(blurred_ascent, size=5)
wiener_ascent = signal.wiener(blurred_ascent, (5,5))
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(141)
ax.imshow(noisy_ascent, cmap=cm.gray)
ax.axis('off')
ax.set_title("noisy ascent")
ax2 = fig.add_subplot(142)
ax2.imshow(blurred_ascent, cmap=cm.gray)
ax2.axis('off')
ax2.set_title("Gaussian filter")
ax3 = fig.add_subplot(143)
ax3.imshow(median_ascent, cmap=cm.gray)
ax3.axis('off')
ax3.set_title("median filter")
ax4 = fig.add_subplot(144)
ax4.imshow(wiener_ascent, cmap=cm.gray)
ax4.axis('off')
ax4.set_title("Weiner filter")
In [ ]:
x, y = np.indices((100, 100))
sig = np.sin(2*np.pi*x/50.)*np.sin(2*np.pi*y/50.)*(1+x*y/50.**2)**2
mask = sig > 1
Now we look for various information about the objects in the image:
In [ ]:
labels, nb = ndimage.label(mask)
nb
In [ ]:
areas = ndimage.sum(mask, labels, xrange(1, labels.max()+1))
areas
In [ ]:
maxima = ndimage.maximum(sig, labels, xrange(1, labels.max()+1))
maxima
In [ ]:
ndimage.find_objects(labels==4)
sl = ndimage.find_objects(labels==4)
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(131)
ax.imshow(sig)
ax.axis('off')
ax.set_title("sig")
ax2 = fig.add_subplot(132)
ax2.imshow(mask)
ax2.axis('off')
ax2.set_title("mask")
ax3 = fig.add_subplot(133)
ax3.imshow(labels)
ax3.axis('off')
ax3.set_title("labels")
In [ ]: