In [ ]:
import numpy, scipy
%matplotlib inline
import matplotlib.pyplot
The scipy.stats
package "contains a large number of probability distributions as well as a growing library of statistical functions". Here we demonstrate how you can extract various statistics from a dataset.
One can import the Iris dataset from scikit-learn
It is often used to test machine learning algorithms, to see if they can guess the species from the size of the perianth measurements.
In [ ]:
from sklearn import datasets
iris = datasets.load_iris()
print('Target names:', iris.target_names)
print('Features:', iris.feature_names)
print(iris.data)
The samples can be divided into three classes, according to the labels.
In [ ]:
first = iris.data[iris.target == 0]
second = iris.data[iris.target == 1]
third = iris.data[iris.target == 2]
print(len(first), len(second), len(third))
Use numpy.average
!
Calculate the geometric mean. You won't find a function for that in numpy
; use scipy.stats.gmean
.
Calculate the Pearson correlation between
Use scipy.stats.pearsonr
for correlation!
In [ ]:
The scipy.linalg
module contains
A few examples follow below.
In [ ]:
import scipy.linalg
A = 0.5*(numpy.diag(numpy.ones(7), k=1) - numpy.diag(numpy.ones(7), k=-1))
b = numpy.ones(len(A))
print('[A|b]:\n{}'.format(numpy.concatenate((A, b.reshape(-1,1)), axis=1)))
x = scipy.linalg.solve(A, b)
print('x:', x)
# Let's test if the solution is correct
assert numpy.allclose(A.dot(x), b)
In [ ]:
A2 = (numpy.diag(numpy.ones(9), k=1) - numpy.diag(numpy.ones(10), k=0))[:-1, :].T
print('A:\n{}'.format(A2))
b2 = numpy.linspace(-1, 1, num=len(A2))
x2 = scipy.linalg.lstsq(A2, b2)[0]
print('b:', b2)
print('x:', x2)
# matplotlib.pyplot.plot(range(1, len(b)+1), b)
# matplotlib.pyplot.plot(range(0, len(x)), x)
# In this case, the solution is exact.
assert numpy.allclose(A2.dot(x2), b2)
Many matrices in practice only have nonzero values in some of their cells; i.e. they are sparse. Storing large sparse matrices takes up a lot of memory space unneccesarily. The scipy.sparse
module implements memory-efficient sparse matrix classes.
However, memory-efficiency comes at a price. There are several types of sparse matrices, all with specific advantages and disadvantages. A few examples:
coo_matrix
: coordinate-data tupleslil_matrix
: based on a linked listdok_matrix
: based on a dict
of dict
scsr_matrix
: fast row operationscsc_matrix
: fast column operationsFor further gotchas, see the package and matrix descriptions.
A csc_matrix
(or csr_matrix
) is created from three lists: values, row indices and column indices.
[1, 2, 3, 4]
[0, 1, 1, 2]
[1, 0, 2, 1]
We cannot print the whole sparse matrix; use
.todense()
to convert it into a dense matrix.toarray()
to convert it into an arrayfirst; although not recommended if the matrix is huge (why?)
In [ ]:
import scipy.sparse
import scipy.sparse.linalg
csc = scipy.sparse.csc_matrix(([1, 2, 3, 4], ([0, 1, 1, 2], [1, 0, 2, 1])), shape=(3,3), dtype=float)
print("csc:\n{}".format(csc))
print("csc.toarray():\n{}".format(csc.toarray()))
csc
The scipy.linalg
package has a sparse equivalent: scipy.sprase.linalg
. Use the latter for sparse matrices.
In [ ]:
import scipy.sparse
import scipy.sparse.linalg
# Create As and bs
print('As:\n{}'.format(As.toarray()))
print('bs:', bs)
# Solve the equation system!
print('xs:', xs)
Below we run sparse singular value decomposition on As
:
In [ ]:
def reconstruct_svd(M, k=None):
if k is None:
U, d, Vh = scipy.sparse.linalg.svds(M)
else:
U, d, Vh = scipy.sparse.linalg.svds(M, k)
M_rec = U.dot(numpy.diag(d).dot(Vh))
# Set small elements to zero
M_rec[numpy.abs(M_rec) < 1e-15] = 0
return M_rec
print("Full sparse SVD:\n{}\n".format(reconstruct_svd(A)))
print("Sparse SVD, first 2 singular values:\n{}".format(reconstruct_svd(A, 2)))
The following file contains (preprocessed) movie descriptions, from CMU Movie Corpus.
"title\tdescription\n"
formatDownload the file and put it in the same folder, as your notebook!
dict
keyed by titles)movie_to_id
dict
keyed by words)id_to_movie
In [ ]:
movie_descriptions = {}
with open("movies.txt", "rb") as f:
for i, line in enumerate(f):
title, description = line.strip().split(b'\t')
movie_descriptions[title] = description.split()
In [ ]:
print(len(movie_descriptions))
print(b" ".join(movie_descriptions[b"The Matrix"]))
In [ ]:
In [ ]:
Write a function, which searches the closest vectors to a given vector.
U
Try to use vectorization and numpy.argpartition
.
In [ ]:
def closests(v, k=1):
return list(range(k))
In [ ]:
closests(numpy.ones(len(Vh)), 3)
Now you can search similar movies!
In [ ]:
print([id_to_movie[i] for i in closests(U[movie_to_id[b"Monsters, Inc."]], 5)])
print([id_to_movie[i] for i in closests(U[movie_to_id[b"Popeye"]], 5)])
Or even mixture of movies by adding "movie vectors"!
In [ ]:
[id_to_movie[i] for i in closests(U[movie_to_id[b"Popeye"]] + U[movie_to_id[b"Monsters, Inc."]], 10)]
In [ ]: