Scientific Python

The goal of today's lab is to get familiar with data representation and manipulation in Python.

To do the lab, you should make a copy of the repository on your machine (either by downloading it directly or by cloning/forking it from https://github.com/chagaz/ma2823_2016), then open this file from Jupyter. If you don't know how to start Jupyter, try jupyter notebook from a terminal (not from a Python console).

1. Let us check that your installation is working


In [1]:
# scientific python
import numpy as np
import scipy as sp

In [2]:
# interactive plotting 
%pylab inline


Populating the interactive namespace from numpy and matplotlib

The previous command is one of the "magics" of Jupyter. As indicated by the message you have gotten, it imports numpy and matplotlib. See http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=pylab%20inline for details

2. Numeric Python: Numpy arrays

NumPp arrays are a fundamental structure for scientific computing. Numpy arrays are homogeneous (i.e. all objects it contains have the same type) multi-dimensional arrays, which we’ll use among other things to represent vectors and matrices.

Let us explore some basic Numpy commands.

Creating arrays


In [3]:
# Create a random array of size 3 x 5
X = np.random.random((3, 5))

In [32]:
# Create an array of zeros of size 3 x 5
np.zeros((3, 5))


Out[32]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [33]:
# Create an array of ones of size 3 x 5
np.ones((3, 5))


Out[33]:
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [34]:
# Create the identity matrix of size 4 x 4
np.eye(4)


Out[34]:
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

In [4]:
# Visualize X
print X


[[ 0.34328135  0.76855593  0.1406508   0.81835393  0.06789015]
 [ 0.10553242  0.31894428  0.02453661  0.58056544  0.63625817]
 [ 0.63648313  0.69784094  0.9071026   0.25335043  0.28890735]]

In [5]:
# The dimensions of X are accessible via
print X.shape


(3, 5)

In [6]:
# The total number of elements of X are accessible via
print X.size


15

Accessing elements, rows, and columns of arrays

Remember, in Python indices start at 0.


In [7]:
# Get a single element: X[0,1]
print X[0, 1]


0.768555932678

In [16]:
# Get a row
print X[0, :]
print X[0]
print "shape of a row vector:", X[0].shape


[ 0.34328135  0.76855593  0.1406508   0.81835393  0.06789015]
[ 0.34328135  0.76855593  0.1406508   0.81835393  0.06789015]
shape of a row vector: (5,)

In [10]:
# Get a column
print X[:, 3]


[ 0.81835393  0.58056544  0.25335043]

Array manipulation

We use 2-dimensional arrays to represent matrices, and can do basic linear algebra operations on them.


In [11]:
# Transposing an array
print X.T


[[ 0.34328135  0.10553242  0.63648313]
 [ 0.76855593  0.31894428  0.69784094]
 [ 0.1406508   0.02453661  0.9071026 ]
 [ 0.81835393  0.58056544  0.25335043]
 [ 0.06789015  0.63625817  0.28890735]]

In [17]:
# Applying the same transformation to all entries in an array 
# Multiply all entries of X by 2:
print 2*X


[[ 0.68656271  1.53711187  0.28130161  1.63670787  0.1357803 ]
 [ 0.21106483  0.63788856  0.04907323  1.16113089  1.27251634]
 [ 1.27296625  1.39568189  1.8142052   0.50670085  0.5778147 ]]

In [18]:
# Add 1 to all entries of x

In [21]:
# Compute the array that has as entries the logarithm (base 2) of the entries of X

In [19]:
# Square all entries of X

In [22]:
# Compute the array that has as entries the logarithm (base 10) of the entries of X

In [23]:
# Element-wise matrix multiplication
print X*X


[[  1.17842087e-01   5.90678222e-01   1.97826487e-02   6.69703162e-01
    4.60907268e-03]
 [  1.11370911e-02   1.01725453e-01   6.02045452e-04   3.37056235e-01
    4.04824461e-01]
 [  4.05110769e-01   4.86981984e-01   8.22835124e-01   6.41864391e-02
    8.34674567e-02]]

In [27]:
# Matrix multiplication 
print np.dot(X, X.T)
print X.dot(X.T)


[[ 1.40261519  0.8031086   1.10935158]
 [ 0.8031086   0.85534529  0.64290537]
 [ 1.10935158  0.64290537  1.86258177]]
[[ 1.40261519  0.8031086   1.10935158]
 [ 0.8031086   0.85534529  0.64290537]
 [ 1.10935158  0.64290537  1.86258177]]

In [24]:
# Create a random array B of size 5 x 4
# Multiply X by B

In [30]:
# Get the diagonal of  X. Note that X is not square.
np.diag(X)


Out[30]:
array([ 0.34328135,  0.31894428,  0.9071026 ])

In [28]:
# Compute the trace of  X
np.trace(X)


Out[28]:
1.569328230319464

More complex linear algebra operations are available via numpy.linalg:
http://docs.scipy.org/doc/numpy/reference/routines.linalg.html


In [38]:
# Compute the determinant of X'X
np.linalg.det(X.dot(X.T))


Out[38]:
0.54643068174873233

In [42]:
# Compute the eigenvalues and eigenvectors of X'X
np.linalg.eig(X.dot(X.T))


Out[42]:
(array([ 3.20004721,  0.25758842,  0.66290662]),
 array([[-0.60390259, -0.67665969, -0.42122836],
        [-0.39644678,  0.71347156, -0.57774413],
        [-0.69147062,  0.18190655,  0.69912688]]))

In [45]:
# Compute the inverse of X'X
np.linalg.inv(X.dot(X.T))


Out[45]:
array([[ 2.15914527, -1.43229028, -0.79160234],
       [-1.43229028,  2.5288195 , -0.01979948],
       [-0.79160234, -0.01979948,  1.0152008 ]])

For more on arrays, you can refer to http://docs.scipy.org/doc/numpy/reference/arrays.html.

For more about NumPy, you can refer to:

3. Scientific Python with SciPy

SciPy is a collection of mathematical algorithms and convenience functions built on Numpy. It has specialized submodules for integration, Fourier transforms, optimization, statistics and much more:
http://docs.scipy.org/doc/scipy/reference/

It offers more linear algebra manipulation tools than Numpy:
http://docs.scipy.org/doc/scipy/reference/linalg.html#module-scipy.linalg

SciPy is particularly useful to manipulate sparse matrices. It often happens that the data we manipulate contains a lot of zeros. In this case, storing all the zeros is inefficient, and it is much more efficient to use data structures meant for sparse matrices. This can be done with the scipy.sparse submodule, which allows to store sparse matrices efficiently, and implements many interesting functions (linear algebra, sparse solvers, graph algorithms, etc.).
http://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse

4. Plotting with Matplotlib

Visualization is an important part of machine learning. Plotting your data will allow you to have a better feel for it (how are the features distributed, are there outliers, etc.). Plotting measures of performance (whether ROC curves or single-valued performance measures, with error bars) allows you to rapidly compare methods.

matplotlib is a very flexible data visualization package, partially inspired by MATLAB.

Lines


In [77]:
# Plotting a sinusoide
# create an array of 100 equally-spaced points between 0 and 10 (to serve as x coordinates)
x = np.linspace(0, 10, 100)
# create the y coordinates
y = np.sin(x)
plt.plot(x, y)


Out[77]:
[<matplotlib.lines.Line2D at 0x7f04319e6350>]

In [78]:
# Tweak some options
plt.plot(x, y, color='orange', linestyle='--', linewidth=3)


Out[78]:
[<matplotlib.lines.Line2D at 0x7f043193fad0>]

In [92]:
# Plot the individual points
plt.plot(x, y, color='orange', marker='x', linestyle='')


Out[92]:
[<matplotlib.lines.Line2D at 0x7f043117c150>]

In [79]:
# Plot multiple lines
plt.plot(x, y, color='orange', linewidth=2, label='sine')
plt.plot(x, np.cos(x), color='blue', linewidth=2, label='cosine')
plt.legend()


Out[79]:
<matplotlib.legend.Legend at 0x7f0431865a50>

In [80]:
# Add a title and caption and label the axes
plt.plot(x, y, color='orange', linewidth=2, label='sine')
plt.plot(x, np.cos(x), color='blue', linewidth=2, label='cosine')
plt.legend(loc='lower left', fontsize=14)
plt.title("Sinusoides", fontsize=14)
plt.xlabel("$f(x)$", fontsize=16)
plt.ylabel("$sin(x)$", fontsize=16)


Out[80]:
<matplotlib.text.Text at 0x7f0431834350>

In [81]:
# Save the plot
plt.plot(x, y, color='orange', linewidth=2, label='sine')
plt.plot(x, np.cos(x), color='blue', linewidth=2, label='cosine')
plt.legend(loc='lower left', fontsize=14)
plt.title("Sinusoides", fontsize=14)
plt.xlabel("$x$", fontsize=16)
plt.ylabel("$f(x)$", fontsize=16)
plt.savefig("my_sinusoide.png")



In [ ]:
# Add to the previous plot a sinusoide of half the amplitude and twice the frequency of the sine one.
# Plot the line in green and give each line a different line style.

Scatterplots


In [93]:
# Create 500 points with random (x, y) coordinates
x = np.random.normal(size=500)
y = np.random.normal(size=500)

In [94]:
# Plot them
plt.scatter(x, y)


Out[94]:
<matplotlib.collections.PathCollection at 0x7f0431115950>

In [95]:
# Use the same ranges for both axes
plt.scatter(x, y)
plt.xlim([-4, 4])
plt.ylim([-4, 4])


Out[95]:
(-4, 4)

In [96]:
# Add a title and axis captions to the previous plot. Change the marker style and color.

Heatmaps

Matplotlib will automatically assign a color to each numerical value, based on a color map. For more about color maps see:
http://matplotlib.org/users/colormaps.html
http://matplotlib.org/1.2.1/examples/pylab_examples/show_colormaps.html


In [140]:
# Create a random 50 x 100 array
X = np.random.random((50, 100))
heatmap = plt.pcolor(X, cmap=plt.cm.Blues)
plt.colorbar(heatmap)


Out[140]:
<matplotlib.colorbar.Colorbar instance at 0x7f042abe3290>

Histograms


In [108]:
# Create a random vector (normally distributed) of size 5000
X = np.random.normal(size=(5000,))

In [111]:
# Plot the histogram of its values over 50 bins
h = plt.hist(X, bins=50, color='orange', histtype='stepfilled')


Images


In [134]:
# create an image
x = np.linspace(1, 12, 100)

# transform an array of shape (100,) into an array of shape (100, 1)
y = x[:, np.newaxis]
y = y * np.cos(y)

# Create an image matrix: image[i,j] = y cos(y)[i] * sin(x)[j] 
image = y * np.sin(x)

In [144]:
# show the image (the origin is, by default, at the top-left corner!)
plt.imshow(image, cmap=plt.cm.prism)


Out[144]:
<matplotlib.image.AxesImage at 0x7f042a8e3310>

In [149]:
# Contour plot - note that origin here is at the bottom-left by default!
# A contour line or isoline of a function of two variables is a curve along which the function has a constant value.
contours = plt.contour(image, cmap=plt.cm.prism)
plt.clabel(contours, inline=1, fontsize=10)


Out[149]:
<a list of 27 text.Text objects>

Many more types of plots and functionalities to label axes, display legends, etc. are available. The matplotlib gallery (http://matplotlib.org/gallery.html) is a good place to start to get an idea of what is possible and how to do it.

Note that there are many more plotting libraries for Python. Two of the more popular are:


In [ ]: