Python for mathematics, science and engineering

https://scipy.org/

Scipy

(pronounced "Sigh Pie")

Higher level algorithms on top of numpy

  • numerical integration
  • optimization
  • interpolation
  • Signal Processing
  • Linear Algebra
    • with sparse matrices
  • statistics

In [ ]:
import numpy, scipy
%matplotlib inline
import matplotlib.pyplot

Statistics

The scipy.stats package "contains a large number of probability distributions as well as a growing library of statistical functions". Here we demonstrate how you can extract various statistics from a dataset.

One can import the Iris dataset from scikit-learn

  • collected by Ronald Fisher (1936)
  • contains 50 flower samples
  • 4 parameters for each sample
    • sepal length, sepal width, petal length, petal width
  • the samples are labeled according to species
    • setosa, virginica, versicolor
  • the raw data is a $50\times 4$ matrix
    • the rows are labeled with $\{0,1,2\}$

It is often used to test machine learning algorithms, to see if they can guess the species from the size of the perianth measurements.


In [ ]:
from sklearn import datasets
iris = datasets.load_iris()
print('Target names:', iris.target_names)
print('Features:', iris.feature_names)
print(iris.data)

The samples can be divided into three classes, according to the labels.


In [ ]:
first = iris.data[iris.target == 0]
second = iris.data[iris.target == 1]
third = iris.data[iris.target == 2]
print(len(first), len(second), len(third))

Tasks

  • Calculate the class-wise mean of the data points (three 4-dimensional vectors)

Use numpy.average!

  • Calculate the geometric mean. You won't find a function for that in numpy; use scipy.stats.gmean.

  • Calculate the Pearson correlation between

    • the sepal width and length
    • the petal width and length
    • all of the above but for each of the three classes separately

Use scipy.stats.pearsonr for correlation!


In [ ]:

Linear algebra

The scipy.linalg module contains

  • linear (equation system) solvers
  • advanced matrix functions (pseudo inverse, etc.)
  • matrix decomposition functions (eigen-, singular value-, etc.)
  • special matrix generators
  • matrix equations solvers
  • etc.

A few examples follow below.

Linear equation systems

Solves $A\cdot x = b$, where $A\in\mathbb{R}^{n\times n}, b\in \mathbb{R}^n$ for $x\in\mathbb{R}^n$.


In [ ]:
import scipy.linalg
A = 0.5*(numpy.diag(numpy.ones(7), k=1) - numpy.diag(numpy.ones(7), k=-1))
b = numpy.ones(len(A))

print('[A|b]:\n{}'.format(numpy.concatenate((A, b.reshape(-1,1)), axis=1)))

x = scipy.linalg.solve(A, b)
print('x:', x)

# Let's test if the solution is correct
assert numpy.allclose(A.dot(x), b)

Least square problem

Finds optimal solution, for non-invertible coefficient matrix. Used when the equation system is overdetermined: there are more equations than variables: $A\in\mathbb{R}^{n\times k}, b\in \mathbb{R}^n$ and $x\in\mathbb{R}^k, \quad n > k$.


In [ ]:
A2 = (numpy.diag(numpy.ones(9), k=1) - numpy.diag(numpy.ones(10), k=0))[:-1, :].T
print('A:\n{}'.format(A2))
b2 = numpy.linspace(-1, 1, num=len(A2))
x2 = scipy.linalg.lstsq(A2, b2)[0]
print('b:', b2)
print('x:', x2)
# matplotlib.pyplot.plot(range(1, len(b)+1), b)
# matplotlib.pyplot.plot(range(0, len(x)), x)

# In this case, the solution is exact.
assert numpy.allclose(A2.dot(x2), b2)

Sparse matrices

Many matrices in practice only have nonzero values in some of their cells; i.e. they are sparse. Storing large sparse matrices takes up a lot of memory space unneccesarily. The scipy.sparse module implements memory-efficient sparse matrix classes.

However, memory-efficiency comes at a price. There are several types of sparse matrices, all with specific advantages and disadvantages. A few examples:

  • Optimized for storage:
  • Aimed for incrementally creating sparse matrices:
  • Optimized for arithmetic operations

For further gotchas, see the package and matrix descriptions.

Example

A csc_matrix (or csr_matrix) is created from three lists: values, row indices and column indices.

  • below:
    • matrix values: [1, 2, 3, 4]
    • row indices: [0, 1, 1, 2]
    • col indices: [1, 0, 2, 1]
  • meaning:
    • $1$ is at position $(0,1)$
    • $2$ is at position $(1,0)$
    • $3$ is at position $(1,2)$
    • $4$ is at position $(2,1)$

We cannot print the whole sparse matrix; use

  • .todense() to convert it into a dense matrix
  • .toarray() to convert it into an array

first; although not recommended if the matrix is huge (why?)


In [ ]:
import scipy.sparse
import scipy.sparse.linalg
csc = scipy.sparse.csc_matrix(([1, 2, 3, 4], ([0, 1, 1, 2], [1, 0, 2, 1])), shape=(3,3), dtype=float)
print("csc:\n{}".format(csc))
print("csc.toarray():\n{}".format(csc.toarray()))
csc

Sparse linear algebra

The scipy.linalg package has a sparse equivalent: scipy.sprase.linalg. Use the latter for sparse matrices.

Task

  • Solve the linear equation system you did above with numpy.
    • call the variables As, bs, xs to avoid accidentally overriding the originals
    • use scipy.sparse.diags instead of numpy.diag
    • note that the signatures of the sparse functions might be different!

In [ ]:
import scipy.sparse
import scipy.sparse.linalg

# Create As and bs
print('As:\n{}'.format(As.toarray()))
print('bs:', bs)

# Solve the equation system!
print('xs:', xs)

Decomposition example

Below we run sparse singular value decomposition on As:

  • first, we obtain the component matrices $U, d, V^*$ (note that these won't be sparse)
  • then, we reconstruct the original matrix from them

In [ ]:
def reconstruct_svd(M, k=None):
    if k is None:
        U, d, Vh = scipy.sparse.linalg.svds(M)
    else:
        U, d, Vh = scipy.sparse.linalg.svds(M, k)
    M_rec = U.dot(numpy.diag(d).dot(Vh))
    # Set small elements to zero
    M_rec[numpy.abs(M_rec) < 1e-15] = 0
    return M_rec

print("Full sparse SVD:\n{}\n".format(reconstruct_svd(A)))
print("Sparse SVD, first 2 singular values:\n{}".format(reconstruct_svd(A, 2)))

Document-term matrix decomposition

The following file contains (preprocessed) movie descriptions, from CMU Movie Corpus.

  • One movie per line
  • "title\tdescription\n" format
  • description is space separated list of words, tokenized
  • Some of its UTF8 characters are broken, so we have to read it binary (byte array)

Download the file and put it in the same folder, as your notebook!

Task 1.

  • Make a vocabulary of titles (dict keyed by titles)
    • Each movie title should get a unique ID, from 0 to the number of movies (~39k)
    • call this movie_to_id
  • Make a vocabulary of words (dict keyed by words)
    • Each word, which occurs in any of the descriptions, should get a unique ID, from 0 to the number of unique words (~182k)
  • Also make reverse vocabularies (movie id to movie title, word id to the word itself)
    • for movies, call it id_to_movie

In [ ]:
movie_descriptions = {}
with open("movies.txt", "rb") as f:
    for i, line in enumerate(f):
        title, description = line.strip().split(b'\t')
        movie_descriptions[title] = description.split()

In [ ]:
print(len(movie_descriptions))
print(b" ".join(movie_descriptions[b"The Matrix"]))

Task 2.

Make a sparse matrix defined in the following way:

  • $a_{i,j} = $ number of times the word with ID $j$ was mentioned in the movie with ID $i$
  • the rows of the matrix are movies
  • columns are words
  • use float32 representation (dtype)

In [ ]:

Task 3.

  • Perform a sparse SVD with k=40 and store the left singular vectors as rows (U) and the right singular vectors as columns (Vh).
  • normalize the vectors (rows of U and columns of Vh) to unit length.
  • $U\in\mathbb{R}^{\text{~39k}\times 40}$
  • $Vh\in\mathbb{R}^{40\times \text{~182k}}$

In [ ]:

Task 4.

Write a function, which searches the closest vectors to a given vector.

  • Use the global variable U
  • the input is a vector $v$ and a number $k\in\mathbb{N}$.
  • return the row indices of the $k$ closest vector to $v$ in $U$!

Try to use vectorization and numpy.argpartition.


In [ ]:
def closests(v, k=1):
    return list(range(k))

In [ ]:
closests(numpy.ones(len(Vh)), 3)

Now you can search similar movies!


In [ ]:
print([id_to_movie[i] for i in closests(U[movie_to_id[b"Monsters, Inc."]], 5)])
print([id_to_movie[i] for i in closests(U[movie_to_id[b"Popeye"]], 5)])

Or even mixture of movies by adding "movie vectors"!


In [ ]:
[id_to_movie[i] for i in closests(U[movie_to_id[b"Popeye"]] + U[movie_to_id[b"Monsters, Inc."]], 10)]

In [ ]: