The goal of today's lab is to get familiar with data representation and manipulation in Python.
To do the lab, you should make a copy of the repository on your machine (either by downloading it directly or by cloning/forking it from https://github.com/chagaz/ma2823_2016), then open this file from Jupyter. If you don't know how to start Jupyter, try jupyter notebook from a terminal (not from a Python console).
In [1]:
# scientific python
import numpy as np
import scipy as sp
In [2]:
# interactive plotting
%pylab inline
The previous command is one of the "magics" of Jupyter. As indicated by the message you have gotten, it imports numpy and matplotlib. See http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=pylab%20inline for details
NumPp arrays are a fundamental structure for scientific computing. Numpy arrays are homogeneous (i.e. all objects it contains have the same type) multi-dimensional arrays, which we’ll use among other things to represent vectors and matrices.
Let us explore some basic Numpy commands.
In [3]:
# Create a random array of size 3 x 5
X = np.random.random((3, 5))
In [32]:
# Create an array of zeros of size 3 x 5
np.zeros((3, 5))
Out[32]:
In [33]:
# Create an array of ones of size 3 x 5
np.ones((3, 5))
Out[33]:
In [34]:
# Create the identity matrix of size 4 x 4
np.eye(4)
Out[34]:
In [4]:
# Visualize X
print X
In [5]:
# The dimensions of X are accessible via
print X.shape
In [6]:
# The total number of elements of X are accessible via
print X.size
In [7]:
# Get a single element: X[0,1]
print X[0, 1]
In [16]:
# Get a row
print X[0, :]
print X[0]
print "shape of a row vector:", X[0].shape
In [10]:
# Get a column
print X[:, 3]
In [11]:
# Transposing an array
print X.T
In [17]:
# Applying the same transformation to all entries in an array
# Multiply all entries of X by 2:
print 2*X
In [18]:
# Add 1 to all entries of x
In [21]:
# Compute the array that has as entries the logarithm (base 2) of the entries of X
In [19]:
# Square all entries of X
In [22]:
# Compute the array that has as entries the logarithm (base 10) of the entries of X
In [23]:
# Element-wise matrix multiplication
print X*X
In [27]:
# Matrix multiplication
print np.dot(X, X.T)
print X.dot(X.T)
In [24]:
# Create a random array B of size 5 x 4
# Multiply X by B
In [30]:
# Get the diagonal of X. Note that X is not square.
np.diag(X)
Out[30]:
In [28]:
# Compute the trace of X
np.trace(X)
Out[28]:
More complex linear algebra operations are available via numpy.linalg:
http://docs.scipy.org/doc/numpy/reference/routines.linalg.html
In [38]:
# Compute the determinant of X'X
np.linalg.det(X.dot(X.T))
Out[38]:
In [42]:
# Compute the eigenvalues and eigenvectors of X'X
np.linalg.eig(X.dot(X.T))
Out[42]:
In [45]:
# Compute the inverse of X'X
np.linalg.inv(X.dot(X.T))
Out[45]:
For more on arrays, you can refer to http://docs.scipy.org/doc/numpy/reference/arrays.html.
For more about NumPy, you can refer to:
SciPy is a collection of mathematical algorithms and convenience functions built on Numpy. It has specialized submodules for integration, Fourier transforms, optimization, statistics and much more:
http://docs.scipy.org/doc/scipy/reference/
It offers more linear algebra manipulation tools than Numpy:
http://docs.scipy.org/doc/scipy/reference/linalg.html#module-scipy.linalg
SciPy is particularly useful to manipulate sparse matrices. It often happens that the data we manipulate contains a lot of zeros. In this case, storing all the zeros is inefficient, and it is much more efficient to use data structures meant for sparse matrices. This can be done with the scipy.sparse submodule, which allows to store sparse matrices efficiently, and implements many interesting functions (linear algebra, sparse solvers, graph algorithms, etc.).
http://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse
Visualization is an important part of machine learning. Plotting your data will allow you to have a better feel for it (how are the features distributed, are there outliers, etc.). Plotting measures of performance (whether ROC curves or single-valued performance measures, with error bars) allows you to rapidly compare methods.
matplotlib is a very flexible data visualization package, partially inspired by MATLAB.
In [77]:
# Plotting a sinusoide
# create an array of 100 equally-spaced points between 0 and 10 (to serve as x coordinates)
x = np.linspace(0, 10, 100)
# create the y coordinates
y = np.sin(x)
plt.plot(x, y)
Out[77]:
In [78]:
# Tweak some options
plt.plot(x, y, color='orange', linestyle='--', linewidth=3)
Out[78]:
In [92]:
# Plot the individual points
plt.plot(x, y, color='orange', marker='x', linestyle='')
Out[92]:
In [79]:
# Plot multiple lines
plt.plot(x, y, color='orange', linewidth=2, label='sine')
plt.plot(x, np.cos(x), color='blue', linewidth=2, label='cosine')
plt.legend()
Out[79]:
In [80]:
# Add a title and caption and label the axes
plt.plot(x, y, color='orange', linewidth=2, label='sine')
plt.plot(x, np.cos(x), color='blue', linewidth=2, label='cosine')
plt.legend(loc='lower left', fontsize=14)
plt.title("Sinusoides", fontsize=14)
plt.xlabel("$f(x)$", fontsize=16)
plt.ylabel("$sin(x)$", fontsize=16)
Out[80]:
In [81]:
# Save the plot
plt.plot(x, y, color='orange', linewidth=2, label='sine')
plt.plot(x, np.cos(x), color='blue', linewidth=2, label='cosine')
plt.legend(loc='lower left', fontsize=14)
plt.title("Sinusoides", fontsize=14)
plt.xlabel("$x$", fontsize=16)
plt.ylabel("$f(x)$", fontsize=16)
plt.savefig("my_sinusoide.png")
In [ ]:
# Add to the previous plot a sinusoide of half the amplitude and twice the frequency of the sine one.
# Plot the line in green and give each line a different line style.
In [93]:
# Create 500 points with random (x, y) coordinates
x = np.random.normal(size=500)
y = np.random.normal(size=500)
In [94]:
# Plot them
plt.scatter(x, y)
Out[94]:
In [95]:
# Use the same ranges for both axes
plt.scatter(x, y)
plt.xlim([-4, 4])
plt.ylim([-4, 4])
Out[95]:
In [96]:
# Add a title and axis captions to the previous plot. Change the marker style and color.
Matplotlib will automatically assign a color to each numerical value, based on a color map.
For more about color maps see:
http://matplotlib.org/users/colormaps.html
http://matplotlib.org/1.2.1/examples/pylab_examples/show_colormaps.html
In [140]:
# Create a random 50 x 100 array
X = np.random.random((50, 100))
heatmap = plt.pcolor(X, cmap=plt.cm.Blues)
plt.colorbar(heatmap)
Out[140]:
In [108]:
# Create a random vector (normally distributed) of size 5000
X = np.random.normal(size=(5000,))
In [111]:
# Plot the histogram of its values over 50 bins
h = plt.hist(X, bins=50, color='orange', histtype='stepfilled')
In [134]:
# create an image
x = np.linspace(1, 12, 100)
# transform an array of shape (100,) into an array of shape (100, 1)
y = x[:, np.newaxis]
y = y * np.cos(y)
# Create an image matrix: image[i,j] = y cos(y)[i] * sin(x)[j]
image = y * np.sin(x)
In [144]:
# show the image (the origin is, by default, at the top-left corner!)
plt.imshow(image, cmap=plt.cm.prism)
Out[144]:
In [149]:
# Contour plot - note that origin here is at the bottom-left by default!
# A contour line or isoline of a function of two variables is a curve along which the function has a constant value.
contours = plt.contour(image, cmap=plt.cm.prism)
plt.clabel(contours, inline=1, fontsize=10)
Out[149]:
Many more types of plots and functionalities to label axes, display legends, etc. are available. The matplotlib gallery (http://matplotlib.org/gallery.html) is a good place to start to get an idea of what is possible and how to do it.
Note that there are many more plotting libraries for Python. Two of the more popular are:
In [ ]: