Numpy: advanced array operations
Scipy introduction: from linear algebra to image analisys
Simpy: symbolic math
Further reading:
Most of the time when doing scientific computing speed is critical. There are several reasons Numpy is much faster than standard Python. The most important is that Numpy enforces strong typing, while Python is a dynamic typed language. That translates in Numpy using less heap space for representing data. Because array operations are the core of scientific computing, we will look at this library in greater depth.
In [1]:
import numpy as np
L = range(1000)
%timeit [i**2 for i in L]
a = np.arange(1000)
%timeit a**2
print(type(L[1]))
print(a.dtype)
In [3]:
import numpy as np
##Get help!
#np.lookfor('create array')
#np.array?
#np.arr*?
In [5]:
a = np.array([0, 1, 2, 3])
b = np.array([[0, 1, 2], [3, 4, 5]])
c = np.array([[[1], [2]], [[3], [4]]])
print(a)
print(b)
print(b.ndim, b.shape, len(b))
In [10]:
##Array/Matrix creation
%pylab inline
import matplotlib.pyplot as plt
b = np.arange(1, 9, 2)
c = np.linspace(0, 1, 6)
print b, c
a = np.ones((3, 3))
b = np.zeros((2, 2))
c = np.eye(3)
print a, b, c
d = np.diag(np.array([1, 2, 3, 4]))
a = np.random.rand(4) # uniform in [0, 1]
b = np.random.randn(4) # Gaussian
print a, b, d
#np.random.seed(1234)
image = np.random.rand(30, 30)
plt.imshow(image, cmap=plt.cm.hot)
plt.colorbar()
Out[10]:
In [14]:
# Indexing, Slicing and Selection
a = np.arange(10)
print a[0], a[2], a[-1]
print a[2:5], a[2:], a[:-2], a[::2], a[2::2]
a = np.diag(np.arange(3))
print a
print a[1, 1]
a[2, 1] = 10 # third line, second column
print a[1]
print a[:,1]
print a[1:,2:]
m = np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]
print np.arange(0, 51, 10)[:, np.newaxis]
print m
print m[2::2,::2]
In [15]:
a = np.arange(10)
b = a[::2]
b[0] = 12
print b, a
print np.may_share_memory(a, b)
a = np.arange(10)
c = a[::2].copy() # force a copy
c[0] = 12
print a, c
print np.may_share_memory(a, c)
In [20]:
a = np.random.random_integers(0, 20, 15)
print a
print a%3==0
print a[a%3==0]
a[a % 3 == 0] = -1
print a
a = np.arange(0, 100, 10)
print a[[2, 3, 2, 4, 2]]
a[[9, 7]] = -100
print a
a = np.arange(0,100,10)
idx = np.array([[3,4],[9,7]])
print a[idx]
In [21]:
m = np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]
print m
print m[(0,1,2,3,4),(1,2,3,4,5)]
print m[[2,5], [1,3]]
print m[3:,[1,3]]
In [6]:
def get_primes():
primes = np.ones((100,), dtype=bool)
primes[:2] = 0
N_max = int(np.sqrt(len(primes)))
for j in range(2, N_max):
primes[2*j::j] = 0
return primes
print(get_primes())
In [24]:
a = np.triu(np.ones((3, 3)), 1)
print a
print a.T
In [40]:
#Array operations
a = np.array([1, 2, 3, 4])
print a + 1, 2**a
b = np.ones(4) + 1
print a - b, a * b
j = np.arange(5)
print 2**(j + 1) - j
c = np.ones((3, 3))
print 2*c + 1, c * c, c + c
print c.dot(c) #matrix multiplication!
a = np.arange(5)
print np.sin(a), np.log(a), np.exp(a)
a = np.triu(np.ones((3, 3)), 1)
print a, a.T
In [ ]:
##Reductions
x = np.array([1, 2, 3, 4])
print np.sum(x), x.sum(), x.sum(axis=0)
x = np.array([[1, 1], [2, 2]])
print x.sum(axis=0), x[:, 0].sum() # columns (first dimension)
print x.sum(axis=1), x[0, :].sum() # rows (second dimension)
x = np.random.rand(2,2,2)
print x, x.sum(axis=2)[0,1], x[0,1,:].sum()
x = np.array([1, 3, 2])
print x.min(), x.max(), x.argmin(), x.argmax()
print np.all([True, True, False]), np.any([True, True, False])
a = np.zeros((100, 100))
print np.any(a != 0), np.all(a == a)
a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
print ((a <= b) & (b <= c)).all()
x = np.array([1, 2, 3, 1])
y = np.array([[1, 2, 3], [5, 6, 1]])
print x.mean(), x.std(), np.median(x), np.median(y, axis=-1)
In [8]:
x = np.array([1, 2, 3, 4])
print(np.sum(x), x.sum(), x.sum(axis=0))
x = np.array([[1, 1], [2, 2]])
print(x)
print(x.sum(axis=0), x[:, 0].sum()) # columns (first dimension)
print(x.sum(axis=1), x[0, :].sum()) # rows (second dimension)
In [9]:
a = np.zeros((100, 100))
print(np.any(a != 0), np.all(a == a))
In [43]:
##Broadcasting
a = np.tile(np.arange(0, 40, 10), (3, 1)).T
b = np.array([0, 1, 2])
print a+b
a = np.arange(0, 40, 10)
print a.shape
a = a[:, np.newaxis] # adds a new axis -> 2D array
print a.shape
print a, b, a+b
a = np.ones((4, 5))
a[0] = 2
print a
x, y = np.ogrid[0:5, 0:5]
x, y = np.mgrid[0:5, 0:5]
In [14]:
a = np.arange(0, 40, 10)
b = np.array([0, 1, 2])
print(a.shape)
print(a)
a = a[:, np.newaxis] # adds a new axis -> 2D array
print(a.shape)
print(a)
print(b)
print(a+b)
In [27]:
x, y = np.ogrid[0:5, 0:5]
print x
print y
x, y = np.mgrid[0:5, 0:5]
print x
print y
In [29]:
##Shape manipulations
a = np.array([[1, 2, 3], [4, 5, 6]])
print a.ravel(), a.T.ravel()
b = a.ravel()
b = b.reshape((2, 3)) #!reshape may return a copy!
print b
z = np.array([1, 2, 3])
print z, z[:, np.newaxis], z[np.newaxis, :]
a = np.arange(4*3*2).reshape(4, 3, 2)
print a.shape, a[0, 2, 1]
b = a.transpose(1, 2, 0)
print b.shape, b[2, 1, 0]
a = np.arange(4)
print a.resize((8,))
In [30]:
##Sorting
a = np.array([[4, 3, 5], [1, 2, 1]])
b = np.sort(a, axis=1)
print b
print a.sort(axis=1)
a = np.array([4, 3, 1, 2])
j = np.argsort(a)
print j, a[j]
a = np.array([4, 3, 1, 2])
print np.argmax(a), np.argmin(a)
Scipy is Python's favorite library for scientific computing. Some of its functionality overlaps with Numpy and Scikit-learn, but in general Scipy modules are viewed as the equivalent of Matlab's standard toolboxes. One is supposed to use it in conjunction with the so call "Scipy-stack" so for example it is best to deal array functionality with Numpy and machine learning with Scikit-learn. We will quickly go through several more common uses of Scipy, while it is good to keep in mind that we are barely scratching the skin of actual scientific computing. http://www.c3se.chalmers.se/common/python_course_2012/Lecture5_SciPy_2012.pdf
There are three main uses of this module: solving linear equations, solving eigenvalues problems and matrix factorizations.
One of the most common techniques for data summarization is Singuar Value Decomposition (SVD) and Partial Component Analysis (PCA). We will solve one SVD task here, and a PCA task later in the machine learning section. Keep in mind that most machine learning algorithms in scikit-learn are wrappers for the basic functions in this scipy module. From the linear algebra perspective PCA is a matrix decomposition method. Several other such methods are available in the linalg module.
If X is a matrix with each variable in a column and each observation in a row then the SVD is $$X = U S V$$ where the columns of U are orthogonal (left singular vectors), the columns of V are orthogonal (right singluar vectors) and S is a diagonal matrix of zeroes with main diagonal s (singular values).
In the example below, most of the variation in the dataset is explained by the first two singular values, corresponding to the first two features.
Obs:
scipy.linalg.orth(A) - uses SVD to find an orthonormal basis for A.
In [33]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import linalg
from sklearn.datasets import load_iris
iris = load_iris()
print iris.feature_names, iris.target_names
print iris.DESCR
#print iris.data
A = iris.data
U, s, V = linalg.svd(A)
print U.shape, V.shape, s.shape
print s
#a = U.dot(eye())
S = linalg.diagsvd(s, 150, 4)
At = np.dot(U, np.dot(S, V))
print np.allclose(A, At)
fig, axes = plt.subplots(1, 2, figsize=(15,5))
# fig = plt.figure()
ax1 = axes[0]
colors = np.array(['blue','red','black'])
labels = np.array(['setosa','versicolour','verginica'])
ax1.scatter(U[:,0], U[:,1], color = colors[iris.target])
ax1.set_xlabel("First singular vector")
ax1.set_ylabel("Second singular vector")
ax1.legend()
ax2 = axes[1]
colors = np.array(['blue','red','black'])
labels = np.array(['setosa','versicolour','verginica'])
ax2.scatter(A[:,0], A[:,1], color = colors[iris.target])
ax2.set_xlabel("First feature vector")
ax2.set_ylabel("Second feature vector")
ax2.legend()
In [15]:
#np.allclose?
Signal processing is useful in order to interpret the data of many measuring instruments. We are performing a simple example, but for those that want to learn more applications of Python for signal processing I reccomend a number of online IPython courses.
http://nbviewer.ipython.org/github/calebmadrigal/FourierTalkOSCON/tree/master/: Basic entry level course focused only on FFT
http://nbviewer.ipython.org/github/unpingco/Python-for-Signal-Processing/tree/master/: A more complex and complete course of signal processing with Python.
A small example would be a noisy signal whose frequency is unknown to the observer, who only knows the sampling time step. The signal is supposed to come from a real function so the Fourier transform will be symmetric. The scipy.fftpack.fftfreq() function will generate the sampling frequencies and scipy.fftpack.fft() will compute the fast Fourier transform:
In [17]:
%matplotlib inline
import numpy as np
from scipy import fftpack
import matplotlib.pyplot as plt
time_step = 0.22
period = 5.
time_vec = np.arange(0, 20, time_step)
sig = np.sin(2 * np.pi / period * time_vec) + 0.5 * np.random.randn(time_vec.size)
from scipy import fftpack
#print(sig.size)
sample_freq = fftpack.fftfreq(sig.size, d=time_step)
sig_fft = fftpack.fft(sig)
pidxs = np.where(sample_freq > 0)
freqs, power = sample_freq[pidxs], np.abs(sig_fft)[pidxs]
freq = freqs[power.argmax()]
print("Determined frequency:",freq)
sig_fft[np.abs(sample_freq) > freq] = 0
main_sig = fftpack.ifft(sig_fft)#Discrete inverse Fourier transform
fig = plt.figure()
ax1 = fig.add_subplot(311)
ax1.plot(time_vec,sig)
ax1.set_title('Signal')
ax2 = fig.add_subplot(312)
ax2.plot(freqs, power)
ax2.set_xlabel('Frequency [Hz]')
ax2.set_ylabel('power')
ax2.set_title('Peak frequency')
ax3 = fig.add_subplot(313)
ax3.plot(time_vec,main_sig)
ax1.set_title('Cleaned signal')
Out[17]:
In the statistics chapter we already used this package for line fitting. We estimated the parameters of a function by performing an error minimization. An optimization problem complexity is dependent on several factors:
To exemplify, we use Newton's optimization to find the minima of a nonlinear function.
In [35]:
import numpy as np
import scipy.optimize as optimize
def f(x): # The rosenbrock function
return .5*(1 - x[0])**2 + (x[1] - x[0]**2)**2
def fprime(x):
return np.array((-2*.5*(1 - x[0]) - 4*x[0]*(x[1] - x[0]**2), 2*(x[1] - x[0]**2)))
print optimize.fmin_ncg(f, [2, 2], fprime=fprime)
def hessian(x): # Computed with sympy
return np.array(((1 - 4*x[1] + 12*x[0]**2, -4*x[0]), (-4*x[0], 2)))
print optimize.fmin_ncg(f, [2, 2], fprime=fprime, fhess=hessian)
%matplotlib inline
from matplotlib import cm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
Z = .5*(1 - X)**2 + (Y - X**2)**2
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(-1000.01, 1000.01)
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
Interpolation is useful when we have sampled a function but want to approximate its values on different points. A well known class of interpolation functions are the splines, most commonly three spline curves are combined in order to interpolate a smooth curved line between two datapoints.
In [13]:
%matplotlib inline
import numpy as np
from scipy.interpolate import interp1d
import pylab as pl
measured_time = np.linspace(0, 1, 10)
noise = 0.1 * np.random.randn(10)
measures = np.sin(2 * np.pi * measured_time) + noise
linear_interp = interp1d(measured_time, measures)
computed_time = np.linspace(0, 1, 50)
linear_results = linear_interp(computed_time)
cubic_interp = interp1d(measured_time, measures, kind='cubic')
cubic_results = cubic_interp(computed_time)
pl.plot(measured_time, measures, 'o', ms=6, label='measures')
pl.plot(computed_time, linear_results, label='linear interp')
pl.plot(computed_time, cubic_results, label='cubic interp')
pl.legend()
Out[13]:
This submodule is useful for summing up function values over intervals (integration) and solving ordinary differential equations. Partial differential equations are not covered and require other Python packages. As a quick example, we solve a case of Michaelis-Menten enzime kinetics.
In [36]:
%matplotlib inline
#from scipy import *
import scipy.integrate as integrate
'''
Slightly modified from a sample code generated by this program,
that formulates a solver for different cases of enzime reactions:
http://code.google.com/p/kinpy/
## Reaction ##
#Michaelis-Menten enzyme kinetics.
E + S <-> ES
ES <-> E + P
## Mapping ##
E 0 -1*v_0(y[1], y[0], y[2]) +1*v_1(y[3], y[0], y[2])
S 1 -1*v_0(y[1], y[0], y[2])
ES 2 +1*v_0(y[1], y[0], y[2]) -1*v_1(y[3], y[0], y[2])
P 3 +1*v_1(y[3], y[0], y[2])
'''
dy = lambda y, t: array([\
-1*v_0(y[1], y[0], y[2]) +1*v_1(y[3], y[0], y[2]),\
-1*v_0(y[1], y[0], y[2]),\
+1*v_0(y[1], y[0], y[2]) -1*v_1(y[3], y[0], y[2]),\
+1*v_1(y[3], y[0], y[2])\
])
#Initial concentrations:
y0 = array([\
#E
0.6,\
#S
1.2,\
#ES
3.0,\
#P
0.2,\
])
#E + S <-> ES
v_0 = lambda S, E, ES : k0 * E**1 * S**1 - k0r * ES**1
k0 = 1.2
k0r = 1.5
#ES <-> E + P
v_1 = lambda P, E, ES : k1 * ES**1 - k1r * E**1 * P**1
k1 = 0.9
k1r = 1.9
t = arange(0, 10, 0.01)
Y = integrate.odeint(dy, y0, t)
import pylab as pl
pl.plot(t, Y, label='y')
Out[36]:
This module is useful for containing functions for multidimensional image manipulation, butthere image processing is deloped in scikit-learn and there are also independent packages for image manipulation. My favourite is PIL, but if falls out of the scope of this course.
In [18]:
%matplotlib inline
import numpy as np
from scipy import ndimage
from scipy import misc
import matplotlib.pyplot as plt
from matplotlib import cm
import pylab as pl
lena = misc.lena()
shifted_lena = ndimage.shift(lena, (50, 50))
shifted_lena2 = ndimage.shift(lena, (50, 50), mode='nearest')
rotated_lena = ndimage.rotate(lena, 30)
cropped_lena = lena[50:-50, 50:-50]
zoomed_lena = ndimage.zoom(lena, 2)
zoomed_lena.shape
noisy_lena = np.copy(lena).astype(np.float)
noisy_lena += lena.std()*0.5*np.random.standard_normal(lena.shape)
blurred_lena = ndimage.gaussian_filter(noisy_lena, sigma=3)
median_lena = ndimage.median_filter(blurred_lena, size=5)
from scipy import signal
wiener_lena = signal.wiener(blurred_lena, (5,5))
fig = plt.figure()
ax1 = fig.add_subplot(111)
pl.imshow(blurred_lena, cmap=cm.gray)
Symbolic math is sometimes important, especially if we are weak at calculus or if we need to perform automated calculus on long formulas. We are briefly going through a few test cases, to get the feel of it. Symbolic math is especially developed for Mathematica, or Sage which is an open-source equivalent.
In [2]:
import sympy
print sympy.sqrt(8)
import math
print math.sqrt(8)
In [11]:
from sympy import symbols
x, y, z, t = symbols('x y z t')
expr = x + 2*y
print expr
print x * expr
from sympy import expand, factor, simplify
expanded_expr = expand(x*expr)
print expanded_expr
print factor(expanded_expr)
exp = expanded_expr.subs(x, z**t)
print exp
print simplify(exp)
In the scipy.optimize paragraph we needed the Hessian matrix for a function f. Here is how you can obtain it in sympy:
In [21]:
import sympy
x, y = sympy.symbols('x y')
f = .5*(1 - x)**2 + (y - x**2)**2
h = sympy.hessian(f, [x,y])
print h
from IPython.display import Latex
Latex(sympy.latex(h))
Out[21]:
In [22]:
from IPython.display import HTML
HTML('<iframe src=http://en.wikipedia.org/wiki/Hessian_matrix width=700 height=350></iframe>')
Out[22]:
In [ ]: