Exercises about Numpy and MLLib Data Types

Notebook version: 1.0 (Mar 15, 2016)

Author: Jerónimo Arenas García (jarenas@tsc.uc3m.es)

Changes: v.1.0 - First version - UTAD version

Pending changes: *


In [1]:
# Import some libraries that will be necessary for working with data and displaying plots

# To visualize plots in the notebook
%matplotlib inline 

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io       # To read matlab files
import pylab
from test_helper import Test

1. Objectives

This notebook reviews some of the Python modules that make it possible to work with data structures in an easy an efficient manner. We will start by reviewing Numpy arrays and matrices, and some of the common operations which are needed when working with these data structures in Machine Learning. The second part of the notebook will present some of the data types inherent to MLlib, and explain the basics of distributing data sets for parallel optimization of models

2. Numpy exercises

2.1. Create numpy arrays of different types

The following code fragment defines variable x as a list of 4 integers, you can check that by printing the type of any element of x. Use python command map() to create a new list with the same elements as x, but where each element of the list is a float.


In [2]:
x = [5, 4, 3, 4]
print type(x[0])

# Create a list of floats containing the same elements as in x
# x_f = <FILL IN>
x_f = map(float, x)


<type 'int'>

In [3]:
Test.assertTrue(np.all(x == x_f), 'Elements of both lists are not the same')
Test.assertTrue(((type(x[-2])==int) & (type(x_f[-2])==float)),'Type conversion incorrect')


1 test passed.
1 test passed.

Numpy arrays can be defined directly using methods such as np.arange(), np.ones(), np.zeros(), as well as random number generators. Alternatively, you can easily generate them from python lists (or lists of lists) containing elements of numeric type.

You can easily check the shape of any numpy vector with the property .shape, and reshape it with the method reshape(). Note the difference between 1-D and N-D numpy arrays (ndarrays). You should also be aware of the existance of another numpy data type: Numpy matrices (http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.matrix.html) are inherently 2-D structures where operators * and ** have the meaning of matrix multiplication and matrix power.

In the code below, you can check the types and shapes of different numpy arrays. Complete also the exercise where you are asked to convert a unidimensional array into a vector of size $4\times2$.


In [4]:
# Numpy arrays can be created from numeric lists or using different numpy methods
y = np.arange(8)+1
x = np.array(x_f)

# Check the different data types involved
print 'El tipo de la variable x_f es ', type(x_f)
print 'El tipo de la variable x es ', type(x)
print 'El tipo de la variable y es ', type(y)

# Print the shapes of the numpy arrays
print 'La variable y tiene dimensiones ', y.shape
print 'La variable x tiene dimensiones ', x.shape

#Complete the following exercises
# Convert x into a variable x_matrix, of type `numpy.matrixlib.defmatrix.matrix` using command
# np.matrix(). The resulting matrix should be of dimensions 4x1
x_matrix = np.matrix(x).T
#x_matrix = <FILL IN>
# Convert x into a variable x_array, of type `ndarray`, and dimensions 4x2
x_array = x[:,np.newaxis]
#x_array = <FILL IN>
# Reshape array y into a 4x2 matrix using command np.reshape()
y = y.reshape((4,2))
#y = <FILL IN>


El tipo de la variable x_f es  <type 'list'>
El tipo de la variable x es  <type 'numpy.ndarray'>
El tipo de la variable y es  <type 'numpy.ndarray'>
La variable y tiene dimensiones  (8,)
La variable x tiene dimensiones  (4,)

In [5]:
Test.assertEquals(type(x_matrix),np.matrixlib.defmatrix.matrix,'x_matrix is not defined as a matrix')
Test.assertEqualsHashed(x_matrix,'f4239d385605dc62b73c9a6f8945fdc65e12e43b','Incorrect variable x_matrix')
Test.assertEquals(type(x_array),np.ndarray,'x_array is not defined as a numpy ndarray')
Test.assertEqualsHashed(x_array,'f4239d385605dc62b73c9a6f8945fdc65e12e43b','Incorrect variable x_array')
Test.assertEquals(type(y),np.ndarray,'y is not defined as a numpy ndarray')
Test.assertEqualsHashed(y,'66d90401cb8ed9e1b888b76b0f59c23c8776ea42','Incorrect variable y')


1 test passed.
1 test passed.
1 test passed.
1 test passed.
1 test passed.
1 test passed.

Some other useful Numpy methods are:

  • np.flatten(): converts a numpy array or matrix into a vector by concatenating the elements in the different dimension. Note that the result of the method keeps the type of the original variable, so the result is a 1-D ndarray when invoked on a numpy array, and a numpy matrix (and necessarily 2-D) when invoked on a matrix.
  • np.tolist(): converts a numpy array or matrix into a python list.

These uses are illustrated in the code fragment below.


In [6]:
print 'Uso de flatten sobre la matriz x_matrix (de tipo matrix)'
print 'x_matrix.flatten(): ', x_matrix.flatten()
print 'Su tipo es: ', type(x_matrix.flatten())
print 'Sus dimensiones son: ', x_matrix.flatten().shape

print '\nUso de flatten sobre la matriz y (de tipo ndarray)'
print 'x_matrix.flatten(): ', y.flatten()
print 'Su tipo es: ', type(y.flatten())
print 'Sus dimensiones son: ', y.flatten().shape

print '\nUso de tolist sobre la matriz x_matrix (de tipo matrix) y el vector (2D) y (de tipo ndarray)'
print 'x_matrix.tolist(): ', x_matrix.tolist()
print 'y.tolist(): ', y.tolist()


Uso de flatten sobre la matriz x_matrix (de tipo matrix)
x_matrix.flatten():  [[ 5.  4.  3.  4.]]
Su tipo es:  <class 'numpy.matrixlib.defmatrix.matrix'>
Sus dimensiones son:  (1, 4)

Uso de flatten sobre la matriz y (de tipo ndarray)
x_matrix.flatten():  [1 2 3 4 5 6 7 8]
Su tipo es:  <type 'numpy.ndarray'>
Sus dimensiones son:  (8,)

Uso de tolist sobre la matriz x_matrix (de tipo matrix) y el vector (2D) y (de tipo ndarray)
x_matrix.tolist():  [[5.0], [4.0], [3.0], [4.0]]
y.tolist():  [[1, 2], [3, 4], [5, 6], [7, 8]]

2.2. Products and powers of numpy arrays and matrices

  • * and ** when used with Numpy arrays implement elementwise product and exponentiation
  • * and ** when used with Numpy matrices implement matrix product and exponentiation
  • Method np.dot() implements matrix multiplication, and can be used both with numpy arrays and matrices.

So you have to be careful about the types you are using for each variable


In [7]:
# Try to run the following command on variable x_matrix, and see what happens
print x_array**2


[[ 25.]
 [ 16.]
 [  9.]
 [ 16.]]

In [8]:
# Try to run the following command on variable x_matrix, and see what happens
print 'Remember that the shape of x_array is ', x_array.shape
print 'Remember that the shape of y is ', y.shape

# Complete the following exercises. You can print the partial results to visualize them

# Multiply the 2-D array `y` by 2
y_by2 = y * 2
#y_by2 = <FILL IN>

# Multiply each of the columns in `y` by the column vector x_array
z_4_2 = x_array * y
#z_4_2 = <FILL IN>

# Obtain the matrix product of the transpose of x_array and y
x_by_y = x_array.T.dot(y)
#x_by_y = <FILL IN>

# Repeat the previous calculation, this time using x_matrix (of type numpy matrix) instead of x_array
# Note that in this case you do not need to use method dot()
x_by_y2 = x_matrix.T * y
#x_by_y2 = <FILL IN>

# Multiply vector x_array by its transpose to obtain a 4 x 4 matrix
x_4_4 = x_array.dot(x_array.T)
#x_4_4 = <FILL IN>

# Multiply the transpose of vector x_array by vector x_array. The result is the squared-norm of the vector
x_norm2 = x_array.T.dot(x_array)
#x_norm2 = <FILL IN>


Remember that the shape of x_array is  (4, 1)
Remember that the shape of y is  (4, 2)

In [9]:
Test.assertEqualsHashed(y_by2,'120a3a46cdf65dc239cc9b128eb1336886c7c137','Incorrect result for variable y_by2')
Test.assertEqualsHashed(z_4_2,'607730d96899ee27af576ecc7a4f1105d5b2cfed','Incorrect result for variable z_4_2')
Test.assertEqualsHashed(x_by_y,'a3b24f229d1e02fa71e940adc0a4135779864358','Incorrect result for variable x_by_y')
Test.assertEqualsHashed(x_by_y2,'a3b24f229d1e02fa71e940adc0a4135779864358','Incorrect result for variable x_by_y2')
Test.assertEqualsHashed(x_4_4,'fff55c032faa93592e5d27bf13da9bb49c468687','Incorrect result for variable x_4_4')
Test.assertEqualsHashed(x_norm2,'6eacac8f346bae7b5c72bcc3381c7140eaa98b48','Incorrect result for variable x_norm2')


1 test passed.
1 test passed.
1 test passed.
1 test passed.
1 test passed.
1 test passed.

2.3. Numpy methods that can be carried out along different dimensions

Compare the result of the following commands:


In [10]:
print z_4_2.shape
print np.mean(z_4_2)
print np.mean(z_4_2,axis=0)
print np.mean(z_4_2,axis=1)


(4, 2)
17.0
[ 15.  19.]
[  7.5  14.   16.5  30. ]

Other numpy methods where you can specify the axis along with a certain operation should be carried out are:

  • np.median()
  • np.std()
  • np.var()
  • np.percentile()
  • np.sort()
  • np.argsort()

If the axis argument is not provided, the array is flattened before carriying out the corresponding operation.

2.4. Concatenating matrices and vectors

Provided that the necessary dimensions fit, horizontal and vertical stacking of matrices can be carried out with methods np.hstack() and np.vstack().

Complete the following exercises to practice with matrix concatenation:


In [11]:
# Previous check that you are working with the right matrices
Test.assertEqualsHashed(z_4_2,'607730d96899ee27af576ecc7a4f1105d5b2cfed','Wrong value for variable z_4_2')
Test.assertEqualsHashed(x_array,'f4239d385605dc62b73c9a6f8945fdc65e12e43b','Wrong value for variable x_array')

# Vertically stack matrix z_4_2 with itself
ex1_res = np.vstack((z_4_2,z_4_2))
#ex1_res = <FILL IN>

# Horizontally stack matrix z_4_2 and vector x_array
ex2_res = np.hstack((z_4_2,x_array))
#ex2_res = <FILL IN>

# Horizontally stack a column vector of ones with the result of the first exercise (variable ex1_res)
X = np.hstack((np.ones((8,1)),ex1_res))
#X = <FILL IN>


1 test passed.
1 test passed.

In [12]:
Test.assertEqualsHashed(ex1_res,'31e60c0fa3e3accedc7db24339452085975a6659','Wrong value for variable ex1_res')
Test.assertEqualsHashed(ex2_res,'189b90c5b2113d2415767915becb58c6525519b7','Wrong value for variable ex2_res')
Test.assertEqualsHashed(X,'426c2708350ac469bc2fc4b521e781b36194ba23','Wrong value for variable X')


1 test passed.
1 test passed.
1 test passed.

 2.5. Slicing

Particular elements of numpy arrays (both unidimensional and multidimensional) can be accessed using standard python slicing. When working with multidimensional arrays, slicing can be carried out along the different dimensions at once


In [13]:
# Keep last row of matrix X
X_sub1 = X[-1,]
#X_sub1 = <FILL IN>

# Keep first column of the three first rows of X
X_sub2 = X[:3,0]
#X_sub2 = <FILL IN>

# Keep first two columns of the three first rows of X
X_sub3 = X[:3,:2]
#X_sub3 = <FILL IN>

# Invert the order of the rows of X
X_sub4 = X[::-1,:]
#X_sub4 = <FILL IN>

In [14]:
Test.assertEqualsHashed(X_sub1,'0bcf8043a3dd569b31245c2e991b26686305b93f','Wrong value for variable X_sub1')
Test.assertEqualsHashed(X_sub2,'7c43c1137480f3bfea7454458fcfa2bc042630ce','Wrong value for variable X_sub2')
Test.assertEqualsHashed(X_sub3,'3cddc950ea2abc256192461728ef19d9e1d59d4c','Wrong value for variable X_sub3')
Test.assertEqualsHashed(X_sub4,'33190dec8f3cbe3ebc9d775349665877d7b892dd','Wrong value for variable X_sub4')


1 test passed.
1 test passed.
1 test passed.
1 test passed.

 2.6 Matrix inversion

Non singular matrices can be inverted with method np.linalg.inv(). Invert square matrices $X\cdot X^\top$ and $X^\top \cdot X$, and see what happens when trying to invert a singular matrix. The rank of a matrix can be studied with method numpy.linalg.matrix_rank().


In [15]:
print X.shape
print X.dot(X.T)
print X.T.dot(X)

print np.linalg.inv(X.T.dot(X))
#print np.linalg.inv(X.dot(X.T))


(8, 3)
[[  126.   221.   256.   461.   126.   221.   256.   461.]
 [  221.   401.   469.   849.   221.   401.   469.   849.]
 [  256.   469.   550.   997.   256.   469.   550.   997.]
 [  461.   849.   997.  1809.   461.   849.   997.  1809.]
 [  126.   221.   256.   461.   126.   221.   256.   461.]
 [  221.   401.   469.   849.   221.   401.   469.   849.]
 [  256.   469.   550.   997.   256.   469.   550.   997.]
 [  461.   849.   997.  1809.   461.   849.   997.  1809.]]
[[    8.   120.   152.]
 [  120.  2356.  2816.]
 [  152.  2816.  3408.]]
[[ 6.81140351  1.30701754 -1.38377193]
 [ 1.30701754  0.28508772 -0.29385965]
 [-1.38377193 -0.29385965  0.30482456]]

2.7 Exercises

In this section, you will complete three exercises where you will carry out some common operations when working with data structures. For this exercise you will work with the 2-D numpy array X, assuming that it contains the values of two different variables for 8 data patterns. A first column of ones has already been introduced in a previous exercise:

$$X = \left[ \begin{array}{ccc} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(8)} & x_2^{(8)}\end{array}\right]$$

First of all, let us check that you are working with the right matrix


In [16]:
Test.assertEqualsHashed(X,'426c2708350ac469bc2fc4b521e781b36194ba23','Wrong value for variable X')


1 test passed.

 2.7.1. Non-linear transformations

Create a new matrix Z, where additional features are created by carrying out the following non-linear transformations:

$$Z = \left[ \begin{array}{ccc} 1 & x_1^{(1)} & x_2^{(1)} & \log\left(x_1^{(1)}\right) & \log\left(x_2^{(1)}\right)\\ 1 & x_1^{(2)} & x_2^{(2)} & \log\left(x_1^{(2)}\right) & \log\left(x_2^{(2)}\right) \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(8)} & x_2^{(8)} & \log\left(x_1^{(8)}\right) & \log\left(x_2^{(8)}\right)\end{array}\right] = \left[ \begin{array}{ccc} 1 & z_1^{(1)} & z_2^{(1)} & z_3^{(1)} & z_4^{(1)}\\ 1 & z_1^{(2)} & z_2^{(2)} & z_3^{(1)} & z_4^{(1)} \\ \vdots & \vdots & \vdots \\ 1 & z_1^{(8)} & z_2^{(8)} & z_3^{(1)} & z_4^{(1)} \end{array}\right]$$

In other words, we are calculating the logarightmic values of the two original variables. From now on, any function involving linear transformations of the variables in Z, will be in fact a non-linear function of the original variables.


In [17]:
# Obtain matrix Z
Z = np.hstack((X,np.log(X[:,1:])))
#Z = <FILL IN>

In [18]:
Test.assertEqualsHashed(Z,'d68d0394b57b4583ba95fc669c1c12aeec782410','Incorrect matrix Z')


1 test passed.

If you did not do that, repeat the previous exercise, this time using the map() method together with function log_transform():


In [19]:
def log_transform(x):
    return np.hstack((x,np.log(x[1]),np.log(x[2])))
    #return <FILL IN>
    
Z_map = np.array(map(log_transform,X))

In [20]:
Test.assertEqualsHashed(Z_map,'d68d0394b57b4583ba95fc669c1c12aeec782410','Incorrect matrix Z')


1 test passed.

Repeat the previous exercise once again using a lambda function:


In [21]:
Z_lambda = np.array(map(lambda x: np.hstack((x,np.log(x[1]),np.log(x[2]))),X))
#Z_lambda = np.array(map(lambda x: <FILL IN>,X))

In [22]:
Test.assertEqualsHashed(Z_lambda,'d68d0394b57b4583ba95fc669c1c12aeec782410','Incorrect matrix Z')


1 test passed.

2.7.2. Polynomial transformations

Similarly to the previous exercise, now we are interested in obtaining another matrix that will be used to evaluate a polynomial model. In order to do so, compute matrix Z_poly as follows:

$$Z_\text{poly} = \left[ \begin{array}{cccc} 1 & x_1^{(1)} & (x_1^{(1)})^2 & (x_1^{(1)})^3 \\ 1 & x_1^{(2)} & (x_1^{(2)})^2 & (x_1^{(2)})^3 \\ \vdots & \vdots & \vdots \\ 1 & x_1^{(8)} & (x_1^{(8)})^2 & (x_1^{(8)})^3 \end{array}\right]$$

Note that, in this case, only the first variable of each pattern is used.


In [23]:
# Calculate variable Z_poly, using any method that you want
Z_poly = np.array(map(lambda x: np.array([x[1]**k for k in range(4)]),X))
#Z_poly = <FILL IN>

In [24]:
Test.assertEqualsHashed(Z_poly,'ba0f38316dffe901b6c7870d13ccceccebd75201','Wrong variable Z_poly')


1 test passed.

2.7.3. Model evaluation

Finally, we can use previous data matrices Z and Z_poly to efficiently compute the output of the corresponding non-linear models over all the patterns in the data set. In this exercise, we consider the two following linear-in-the-parameters models to be evaluated:

$$f_\text{log}({\bf x}) = w_0 + w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot \log(x_1) + w_4 \cdot \log(x_2)$$$$f_\text{poly}({\bf x}) = w_0 + w_1 \cdot x_1 + w_2 \cdot x_1^2 + w_3 \cdot x_1^3$$

Compute the output of the two models for the particular weights that are defined in the code below. Your output variables f_log and f_poly should contain the outputs of the model for all eight patterns in the data set.


In [25]:
w_log = np.array([3.3, 0.5, -2.4, 3.7, -2.9])
w_poly = np.array([3.2, 4.5, -3.2, 0.7])

f_log = Z_lambda.dot(w_log)
f_poly = Z_poly.dot(w_poly)
#f_log = <FILL IN>
#f_poly = <FILL IN>

In [26]:
Test.assertEqualsHashed(f_log,'cf81496c5371a0b31931625040f460ed3481fb3d','Incorrect evaluation of the logarithmic model')
Test.assertEqualsHashed(f_poly,'05307e30124daa103c970044828f24ee8b1a0bb9','Incorrect evaluation of the polynomial model')


1 test passed.
1 test passed.

3. MLlib Data types

MLlib is Apache Spark's scalable machine learning library. It implements several machine learning methods that can work over data distributed by means of RDDs. The regression methods that are part of MLlib are:

  • linear least squares
  • Lasso
  • ridge regression
  • isotonic regression
  • random forests
  • gradient-boosted trees

We will just use the three first methods, and we will also work on an implementation of KNN regression over Spark, using the Data types provided by MLlib.

3.1. Local Vectors

  • Integer-typed and 0-based indices
  • Double-typed values
  • Stored on a single machine
  • Two kinds of vectors provided:
    • DenseVector: a double array with the entries values
    • SparseVector: backed up by two parallel arrays: indices and values


In [27]:
# Import additional libraries for this part

from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
  • DenseVectors can be created from lists or from numpy arrays
  • SparseVector constructor requires three arguments: the length of the vector, an array with the indices of the non-zero coefficients, and the values of such positions (in the same order)

In [28]:
# We create a sparse vector of length 900, with only 25 non-zero values
Z = np.eye(30, k=5).flatten()
print 'The dimension of array Z is ', Z.shape

# Create a DenseVector containing the elements of array Z
dense_V = DenseVector(Z)
#dense_V = <FILL IN>

#Create a SparseVector containing the elements of array Z
#Nonzero elements are indexed by the following variable idx_nonzero
idx_nonzero = np.nonzero(Z)[0]
sparse_V = SparseVector(Z.shape[0], idx_nonzero, Z[idx_nonzero])
#sparse_V = <FILL IN>

#Standard matrix operations can be computed on DenseVectors and SparseVectors
#Calculate the square norm of vector sparse_V, by multiplying sparse_V by the transponse of dense_V
print 'The norm of vector Z is', sparse_V.dot(dense_V)

#print sparse_V
#print dense_V


The dimension of array Z is  (900,)
The norm of vector Z is 25.0

In [29]:
Test.assertEqualsHashed(dense_V,'b331f43b23fda1ac19f5c29ee2c843fab6e34dfa', 'Incorrect vector dense_V')
Test.assertEqualsHashed(sparse_V,'954fe70f3f9acd720219fc55a30c7c303d02f05d', 'Incorrect vector sparse_V')
Test.assertEquals(type(dense_V),pyspark.mllib.linalg.DenseVector,'Incorrect type for dense_V')
Test.assertEquals(type(sparse_V),pyspark.mllib.linalg.SparseVector,'Incorrect type for sparse_V')


1 test passed.
1 test passed.
1 test passed.
1 test passed.

3.2. Labeled point

  • An associaation of a local vector and a label
  • The label is a double (also in classification)
  • Supervised MLlib methods rely on datasets of labeled points
  • In regression,the label can be any real number
  • In classification, labels are class indices starting from zero: 0, 1, 2, ...

Labeled point constructor takes two arguments: the labels, and a numpy array / DenseVector / SparseVector containing the features.


In [30]:
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, sparse_V)

# You can now easily access the label and features of the vector:

print 'The label of the first labeled point is', pos.label
print 'The features of the second labeled point are', neg.features


The label of the first labeled point is 1.0
The features of the second labeled point are (900,[5,36,67,98,129,160,191,222,253,284,315,346,377,408,439,470,501,532,563,594,625,656,687,718,749],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])

3.3. Distributed datasets

  • MLlib distributes the datasets using RDDs of vectors or labeled points