Introduction to Python for Data Sciences

Franck Iutzeler
Fall. 2018

a) Importing and Using Libraries

Go to top

Packages

Python has a large standard library, commonly cited as one of Python's greatest strengths, providing tools suited to many tasks. As of May, 2017, the official repository containing third-party software for Python, contains over 107,000 packages.

A package is a collection of modules i.e. groups of function, classes, constants, types, etc.

To use a module, you have to import it using the command import.


In [1]:
import math

You can now use it in the code by using the prefix package_name.


In [2]:
x = math.cos(2 * math.pi)

print(x)


1.0

To explore the function and other content of the library:


In [3]:
help(math)


Help on built-in module math:

NAME
    math

DESCRIPTION
    This module is always available.  It provides access to the
    mathematical functions defined by the C standard.

FUNCTIONS
    acos(...)
        acos(x)
        
        Return the arc cosine (measured in radians) of x.
    
    acosh(...)
        acosh(x)
        
        Return the inverse hyperbolic cosine of x.
    
    asin(...)
        asin(x)
        
        Return the arc sine (measured in radians) of x.
    
    asinh(...)
        asinh(x)
        
        Return the inverse hyperbolic sine of x.
    
    atan(...)
        atan(x)
        
        Return the arc tangent (measured in radians) of x.
    
    atan2(...)
        atan2(y, x)
        
        Return the arc tangent (measured in radians) of y/x.
        Unlike atan(y/x), the signs of both x and y are considered.
    
    atanh(...)
        atanh(x)
        
        Return the inverse hyperbolic tangent of x.
    
    ceil(...)
        ceil(x)
        
        Return the ceiling of x as an int.
        This is the smallest integral value >= x.
    
    copysign(...)
        copysign(x, y)
        
        Return a float with the magnitude (absolute value) of x but the sign 
        of y. On platforms that support signed zeros, copysign(1.0, -0.0) 
        returns -1.0.
    
    cos(...)
        cos(x)
        
        Return the cosine of x (measured in radians).
    
    cosh(...)
        cosh(x)
        
        Return the hyperbolic cosine of x.
    
    degrees(...)
        degrees(x)
        
        Convert angle x from radians to degrees.
    
    erf(...)
        erf(x)
        
        Error function at x.
    
    erfc(...)
        erfc(x)
        
        Complementary error function at x.
    
    exp(...)
        exp(x)
        
        Return e raised to the power of x.
    
    expm1(...)
        expm1(x)
        
        Return exp(x)-1.
        This function avoids the loss of precision involved in the direct evaluation of exp(x)-1 for small x.
    
    fabs(...)
        fabs(x)
        
        Return the absolute value of the float x.
    
    factorial(...)
        factorial(x) -> Integral
        
        Find x!. Raise a ValueError if x is negative or non-integral.
    
    floor(...)
        floor(x)
        
        Return the floor of x as an int.
        This is the largest integral value <= x.
    
    fmod(...)
        fmod(x, y)
        
        Return fmod(x, y), according to platform C.  x % y may differ.
    
    frexp(...)
        frexp(x)
        
        Return the mantissa and exponent of x, as pair (m, e).
        m is a float and e is an int, such that x = m * 2.**e.
        If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.
    
    fsum(...)
        fsum(iterable)
        
        Return an accurate floating point sum of values in the iterable.
        Assumes IEEE-754 floating point arithmetic.
    
    gamma(...)
        gamma(x)
        
        Gamma function at x.
    
    hypot(...)
        hypot(x, y)
        
        Return the Euclidean distance, sqrt(x*x + y*y).
    
    isfinite(...)
        isfinite(x) -> bool
        
        Return True if x is neither an infinity nor a NaN, and False otherwise.
    
    isinf(...)
        isinf(x) -> bool
        
        Return True if x is a positive or negative infinity, and False otherwise.
    
    isnan(...)
        isnan(x) -> bool
        
        Return True if x is a NaN (not a number), and False otherwise.
    
    ldexp(...)
        ldexp(x, i)
        
        Return x * (2**i).
    
    lgamma(...)
        lgamma(x)
        
        Natural logarithm of absolute value of Gamma function at x.
    
    log(...)
        log(x[, base])
        
        Return the logarithm of x to the given base.
        If the base not specified, returns the natural logarithm (base e) of x.
    
    log10(...)
        log10(x)
        
        Return the base 10 logarithm of x.
    
    log1p(...)
        log1p(x)
        
        Return the natural logarithm of 1+x (base e).
        The result is computed in a way which is accurate for x near zero.
    
    log2(...)
        log2(x)
        
        Return the base 2 logarithm of x.
    
    modf(...)
        modf(x)
        
        Return the fractional and integer parts of x.  Both results carry the sign
        of x and are floats.
    
    pow(...)
        pow(x, y)
        
        Return x**y (x to the power of y).
    
    radians(...)
        radians(x)
        
        Convert angle x from degrees to radians.
    
    sin(...)
        sin(x)
        
        Return the sine of x (measured in radians).
    
    sinh(...)
        sinh(x)
        
        Return the hyperbolic sine of x.
    
    sqrt(...)
        sqrt(x)
        
        Return the square root of x.
    
    tan(...)
        tan(x)
        
        Return the tangent of x (measured in radians).
    
    tanh(...)
        tanh(x)
        
        Return the hyperbolic tangent of x.
    
    trunc(...)
        trunc(x:Real) -> Integral
        
        Truncates x to the nearest Integral toward 0. Uses the __trunc__ magic method.

DATA
    e = 2.718281828459045
    pi = 3.141592653589793

FILE
    (built-in)



In [4]:
help(math.sqrt)


Help on built-in function sqrt in module math:

sqrt(...)
    sqrt(x)
    
    Return the square root of x.

Using the prefix package_name. can make the code obfuscated as it can get quite verbose (e.g. scipy.optimize.minimize) so Python provides simpler ways to import:

  • **import** package_name **as** nickname : the prefix to call is now nickname.

In [5]:
import math as m

print(m.pi)


3.141592653589793
  • **from** package_name **import** function1,constant1 : function1 constant1 can now be called directly. You can even import all contents with **from** package_name **import** \* but this may be dangerous as names may conflict or override former ones, it is thus not advised except on user-generated modules.

In [6]:
from math import e,log

print(log(e**4))


4.0

Creating you own modules

In order to reuse parts of your code, it is often efficient to write you own modules for often used functions. The importation of the module is the same as for other libraries to the difference that you have to give the (relative) path to the module.

In our example, the module file is pyds.py in the ./lib/ folder (you may open it to see what it contains). Two solutions to give the (relative) path to the module:

  • add the folder ./lib/ to the folder list of python

In [7]:
import sys
sys.path.append( "./lib/" )
import pyds as pyds1

help(pyds1)


Help on module pyds:

NAME
    pyds

DESCRIPTION
    ######################################
    ### 
    ### the pyds module
    ###
    ######################################

FUNCTIONS
    printQuote()

DATA
    teacherName = 'Franck Iutzeler'
    teacherWebsite = 'http://www.iutzeler.org'

FILE
    /workspace/Pro/Cours/2018-2019/Python/Factory/lib/pyds.py


  • Use the lib folder name as a package proxy (requires the presence of a - potentially void - file named \__init\__.py in the folder)

In [8]:
import lib.pyds as pyds2

pyds2.printQuote()


The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code -- not in reams of trivial code that bores the reader to death.

 -- Guido van Rossum 

b) Numpy and Matplotlib

Go to top

Numpy is a numerical calculus and algebra package that is widely used, notably providing the array (vector/matrix format) type used in almost all numerical projects. Documentation

Matplotlib is a module for generating 2D and 3D graphics. Documentation

It is common to import them with the respective nicknames np and plt (for matplotlib.pyplot). In Notebooks, it is common to add the Jupyter magic command %matplotlib inline to display the plots inside the notebook.


In [9]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Numpy arrays

In Numpy, the type array is used for vector, matrices, tensors (a matrix type also exists but is more seldomly used).

Numpy arrays can be defined either directly from a list or outputted by a function.

One-dimensional arrays


In [10]:
x = np.array([1, 2.5, 5, 10])
print(x,type(x))


[  1.    2.5   5.   10. ] <class 'numpy.ndarray'>

In [11]:
y = np.random.rand(4)
print(y,type(y))


[ 0.61984909  0.61414974  0.52366044  0.91077306] <class 'numpy.ndarray'>

Plotting

Visualizing the data is quite simple with pyplot:

  • Initialize a figure with plt.figure()
  • Plot something with ... plt.plot
  • Fix labels, titles, axes
  • Eventually save the result
  • Show the figure with plt.show()

In [12]:
plt.figure()
plt.plot(x,y, 'bd--', label='y(x)')
plt.legend(loc='lower right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Sample Figure')
plt.xlim([0, 10])
plt.ylim([0, 1])
plt.savefig('img/sample.png')
plt.show()


Matrices

The same process can be used to define matrices.


In [13]:
M = np.array([[0.25, 6.2, 1, 10],[12, 6.2, 6, -5.3]])
print(M,type(M))


[[  0.25   6.2    1.    10.  ]
 [ 12.     6.2    6.    -5.3 ]] <class 'numpy.ndarray'>
**Warning:** the vector and matrix share the same type so there are some caveats!
  • The size of an array is the number of elements while the shape gives how they are arranged.

In [14]:
print(x.size)  # or equivalently np.size(x)
print(M.size)


4
8

In [15]:
print(x.shape)  # or equivalently np.shape(x)
print(M.shape)


(4,)
(2, 4)
  • The element access, assignment, type, copy is common and similar to the list type.

In [16]:
print(x)
print(x[2],x[-1])
print(x[::-1])
x[0] = 6.1554
print(x)


[  1.    2.5   5.   10. ]
5.0 10.0
[ 10.    5.    2.5   1. ]
[  6.1554   2.5      5.      10.    ]

In [17]:
v = x
w = np.copy(x) 
print(v)
x[0]=1
print(v)
print(w)


[  6.1554   2.5      5.      10.    ]
[  1.    2.5   5.   10. ]
[  6.1554   2.5      5.      10.    ]

In [18]:
print(M)
print(M[1,2],type(M[1,2]))
print(M[1,:],type(M[1,:]),M[1,:].shape)
print(M[1])
print(M[:,0])


[[  0.25   6.2    1.    10.  ]
 [ 12.     6.2    6.    -5.3 ]]
6.0 <class 'numpy.float64'>
[ 12.    6.2   6.   -5.3] <class 'numpy.ndarray'> (4,)
[ 12.    6.2   6.   -5.3]
[  0.25  12.  ]

Advanced access to content and modification is possible


In [19]:
x = np.array([1, 2.5, 5, 10])
ind = [0,2,3]
print(x[ind])


[  1.   5.  10.]

Advanced array properties

An array has a type that can be accessed with dtype, it the combination of a base type (int, float, complex, bool, object, etc.) and a precision in bits (int64, int16, float128, complex128)


In [20]:
print(x.dtype)


float64

Array only accept they casting to their own type. However, the type of an array can be changed.


In [21]:
try:
    x[0] = 1 + 2.0j
except Exception as e:
    print(e)


can't convert complex to float

In [22]:
y = x.astype(complex)
y[0] = 1 + 2.0j
print(y,type(y),y.dtype)


[  1.0+2.j   2.5+0.j   5.0+0.j  10.0+0.j] <class 'numpy.ndarray'> complex128

 Numpy array generation

See the corresponding documentation.

Number sequences


**arange** returns an array of evenly spaced number from start to (at most) stop with a fixed jump step
**linspace** returns an array of evenly spaced number from start to (exactly) stop with a fixed number of points num

In [23]:
x = np.arange(0, 10, 1.5)
print(x,type(x))


[ 0.   1.5  3.   4.5  6.   7.5  9. ] <class 'numpy.ndarray'>

In [24]:
y = np.linspace(0, 10, 25)
print(y,type(y))


[  0.           0.41666667   0.83333333   1.25         1.66666667
   2.08333333   2.5          2.91666667   3.33333333   3.75         4.16666667
   4.58333333   5.           5.41666667   5.83333333   6.25         6.66666667
   7.08333333   7.5          7.91666667   8.33333333   8.75         9.16666667
   9.58333333  10.        ] <class 'numpy.ndarray'>

Zeros and Ones


**zeros** returns an array (of floats) of zeros of the precised shape

**ones** returns an array (of floats) of ones of the precised shape

**eye** returns a square 2D-array (of floats) with ones on the diagonal and zeros elsewhere dimension


In [25]:
x = np.zeros(3)
print(x,x.shape,type(x),x.dtype)

x = np.zeros((3,))
print(x,x.shape,type(x),x.dtype)


[ 0.  0.  0.] (3,) <class 'numpy.ndarray'> float64
[ 0.  0.  0.] (3,) <class 'numpy.ndarray'> float64

In [26]:
try:
    x = np.zeros(3,3) # This causes an error as 3,3 is not a shape, it is (3,3) -> double parentheses
except Exception as error:
    print(error)

print(x,x.shape,type(x),x.dtype)


data type not understood
[ 0.  0.  0.] (3,) <class 'numpy.ndarray'> float64

In [27]:
x = np.zeros((3,3))
print(x,x.shape,type(x),x.dtype)


[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]] (3, 3) <class 'numpy.ndarray'> float64

In [28]:
M = np.eye(3)
print(M,M.shape,type(M),M.dtype)


[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]] (3, 3) <class 'numpy.ndarray'> float64

Random data

Random arrays can be generated by Numpy's random module.

**rand** returns an array (of floats) of uniformly distributed numbers in [0,1) of the precised dimensions

**randn** returns an array (of floats) of numbers from the normal distribution of the precised dimensions

**randint** returns an array (of floats) of integers from the discrete uniform distribution


In [29]:
np.random.rand(5)


Out[29]:
array([ 0.91200941,  0.41933776,  0.87411083,  0.67564683,  0.88219004])

In [30]:
np.random.randn(5,2)


Out[30]:
array([[ 1.22951068,  0.7892105 ],
       [-1.42239273,  0.21920703],
       [ 0.51965297, -0.28082873],
       [ 0.95675861, -0.41684466],
       [ 0.8726305 ,  0.09333448]])

In [31]:
np.random.randint(0,100,size=(10,))


Out[31]:
array([66, 45, 97, 37, 37, 68, 18, 47, 31, 69])

In [32]:
a = np.random.randn(10000)
plt.figure()
plt.hist(a,40) # histogram of a with 40 bins
plt.show()



In [33]:
v = np.arange(0, 5)
print(v)


[0 1 2 3 4]

In [34]:
v * 2


Out[34]:
array([0, 2, 4, 6, 8])

In [35]:
v + 2.5


Out[35]:
array([ 2.5,  3.5,  4.5,  5.5,  6.5])

In [36]:
square = v**2
root = np.sqrt(v)
print(square,root)


[ 0  1  4  9 16] [ 0.          1.          1.41421356  1.73205081  2.        ]

In [37]:
plt.figure()
plt.subplot(1,2,1)
plt.plot(square,'g--', label='$y = x^2$')
plt.legend(loc=0)
plt.subplot(1,2,2)
plt.plot(root, 'r*-', label='$y = \sqrt{x}$')
plt.legend(loc=2)
plt.show()



In [38]:
A = np.array([[n+m*10 for n in range(5)] for m in range(4)])
print(A)


[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]]

In [39]:
A*2


Out[39]:
array([[ 0,  2,  4,  6,  8],
       [20, 22, 24, 26, 28],
       [40, 42, 44, 46, 48],
       [60, 62, 64, 66, 68]])

In [40]:
A+2.5


Out[40]:
array([[  2.5,   3.5,   4.5,   5.5,   6.5],
       [ 12.5,  13.5,  14.5,  15.5,  16.5],
       [ 22.5,  23.5,  24.5,  25.5,  26.5],
       [ 32.5,  33.5,  34.5,  35.5,  36.5]])

Matrices can be visualized as images.


In [41]:
C = np.random.randn(100,100)
plt.figure()
plt.imshow(C)
plt.colorbar()
plt.show()


/usr/lib/python3/dist-packages/matplotlib/collections.py:549: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

Matrix and vector operations


**Warning:** Operation symbols + - \* / correspond to *elementwise* operations! To perform, matrix/vector multiplication, dedicated function must be used.

Elementwise operations


In [42]:
A = np.array([[n+m*10 for n in range(5)] for m in range(4)])
v = np.random.randn(5)
print(A,v)


[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]] [-1.48069106 -0.44341955  1.16428664  1.6038041   1.19890407]

In [43]:
A*A


Out[43]:
array([[   0,    1,    4,    9,   16],
       [ 100,  121,  144,  169,  196],
       [ 400,  441,  484,  529,  576],
       [ 900,  961, 1024, 1089, 1156]])

In [44]:
v*v


Out[44]:
array([ 2.192446  ,  0.19662089,  1.35556338,  2.57218758,  1.43737096])

In [45]:
A*v


Out[45]:
array([[ -0.        ,  -0.44341955,   2.32857328,   4.81141229,
          4.79561626],
       [-14.80691056,  -4.877615  ,  13.97143969,  20.84945324,
         16.78465692],
       [-29.61382113,  -9.31181046,  25.61430609,  36.8874942 ,
         28.77369757],
       [-44.42073169, -13.74600592,  37.2571725 ,  52.92553515,
         40.76273823]])

Transposition

It can be useful to transpose, it is simply done by suffixing .T (or equivalently using the function np.transpose). Similarly .H is the Hermitian conjugate .imag .real the real and imaginary parts .abs the modulus (their full versions are respectively np.conjugate np.imag np.real np.abs)


In [46]:
print(A,A.shape)
print(A.T,A.T.shape)


[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]] (4, 5)
[[ 0 10 20 30]
 [ 1 11 21 31]
 [ 2 12 22 32]
 [ 3 13 23 33]
 [ 4 14 24 34]] (5, 4)

Matrix/vector operations

$y=Av$ can be obtained by y = A.dot(v) (or equivalently y = np.dot(A,v)). This methods works for array with compatible shape (matrix-matrix, matrix-vector, vector-matrix, vector-vector, etc).


In [47]:
y = np.dot(A,v)
print(A,A.shape,v,v.shape)
print(y,type(y),y.shape)


[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]] (4, 5) [-1.48069106 -0.44341955  1.16428664  1.6038041   1.19890407] (5,)
[ 11.49218228  31.92102428  52.34986627  72.77870826] <class 'numpy.ndarray'> (4,)

Example of vector-vector multiplication i.e. a scalar product


In [48]:
s = v.dot(v)
print(v, s, type(s))


[-1.48069106 -0.44341955  1.16428664  1.6038041   1.19890407] 7.7541888144 <class 'numpy.float64'>

Example of non-compatible shapes


In [49]:
try:
    A2 = np.dot(A,A)
except Exception as error:
    print(error)


shapes (4,5) and (4,5) not aligned: 5 (dim 1) != 4 (dim 0)

In [50]:
A3 = np.dot(A,A.T)
print(A3,A3.shape)


[[  30  130  230  330]
 [ 130  730 1330 1930]
 [ 230 1330 2430 3530]
 [ 330 1930 3530 5130]] (4, 4)

From a vector $v$, one can form the matrix $P=v v^T$ by A=v.outer(v) (or equivalently np.outer(v,v))


In [51]:
P = np.outer(v,v)
print(P)


[[ 2.192446    0.65656736 -1.72394882 -2.37473838 -1.77520653]
 [ 0.65656736  0.19662089 -0.51626745 -0.71115808 -0.5316175 ]
 [-1.72394882 -0.51626745  1.35556338  1.86728768  1.39586799]
 [-2.37473838 -0.71115808  1.86728768  2.57218758  1.92280725]
 [-1.77520653 -0.5316175   1.39586799  1.92280725  1.43737096]]

Useful Functions

See the Documentation on arrays and array creation.

**Warning:** Modificators such as transpose, reshape, etc. do not modify the matrix, if you want to keep the result of the operation, you have to assign a variable to it. The notable exceptions are precised as *in-place* in the documentation.

In [52]:
A.reshape((2,10))


Out[52]:
array([[ 0,  1,  2,  3,  4, 10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24, 30, 31, 32, 33, 34]])

In [53]:
print(A)


[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]]

In [54]:
B = A.flatten()
print(B)


[ 0  1  2  3  4 10 11 12 13 14 20 21 22 23 24 30 31 32 33 34]

In [55]:
print(A.trace(),A.max(),A.argmax())


66 34 19

Some functions may be taken with respects to the columns with axis=0 or lines with axis=1.


In [56]:
print(A.mean(),A.mean(axis=0),A.mean(axis=1))


17.0 [ 15.  16.  17.  18.  19.] [  2.  12.  22.  32.]

In [57]:
print(A.var(),A.var(axis=0),A.std(axis=1))


127.0 [ 125.  125.  125.  125.  125.] [ 1.41421356  1.41421356  1.41421356  1.41421356]

Repetition, tiling, concatenation


In [58]:
a = np.array([[1, 2], [3, 4]])
a


Out[58]:
array([[1, 2],
       [3, 4]])

In [59]:
# repeat each element 3 times
np.repeat(a, 3) #  1-d result


Out[59]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

In [60]:
# repeat *the matrix* 3 times
np.tile(a, 3)


Out[60]:
array([[1, 2, 1, 2, 1, 2],
       [3, 4, 3, 4, 3, 4]])

In [61]:
b = np.array([[5, 6]])

In [62]:
np.concatenate((a, b), axis=0)


Out[62]:
array([[1, 2],
       [3, 4],
       [5, 6]])

In [63]:
np.concatenate((a, b.T), axis=1)


Out[63]:
array([[1, 2, 5],
       [3, 4, 6]])

In [64]:
np.vstack((a,b))


Out[64]:
array([[1, 2],
       [3, 4],
       [5, 6]])

In [65]:
np.hstack((a,b.T))


Out[65]:
array([[1, 2, 5],
       [3, 4, 6]])

Iterating on arrays


In [66]:
v = np.array([1,2,3,4])

for element in v:
    print(element)


1
2
3
4

In [67]:
a = np.array([[1,2], [3,4]])

for row in a:
    print("row", row)
    
    for element in row:
        print(element)


row [1 2]
1
2
row [3 4]
3
4

enumerate can be used to get indexes along with elements.


In [68]:
for row_idx, row in enumerate(a):
    print("row_idx", row_idx, "row", row)
    
    for col_idx, element in enumerate(row):
        print("col_idx", col_idx, "element", element)
       
        # update the matrix a: square each element
        a[row_idx, col_idx] = element ** 2


row_idx 0 row [1 2]
col_idx 0 element 1
col_idx 1 element 2
row_idx 1 row [3 4]
col_idx 0 element 3
col_idx 1 element 4

In [69]:
a


Out[69]:
array([[ 1,  4],
       [ 9, 16]])

d) Linear Algebra

Go to top

Numpy comes with an efficient linear algebra module named linalg (see the documentation). As in many languages, the more vectorized the operations are, the more efficient.

Decompositions

  • QR: linalg.qr Factor the matrix $A$ as $QR$, where $Q$ is orthonormal and $R$ is upper-triangular.
  • Cholesky: linalg.cholesky Return the Cholesky decomposition, $L L^H$, of the square matrix $A$, where $L$ is lower-triangular. $A$ must be Hermitian and positive-definite. Only $L$ is actually returned.
  • SVD: linalg.svd Factors the matrix $A$ as $U \textrm{diag}(s) V$, where $U$ and $V$ are unitary and $s$ is a 1-d array of $A$‘s singular values.

In [70]:
A = np.random.randn(3,2)

In [71]:
Q, R = np.linalg.qr(A)
print(A)
print(Q)
print(R)


[[-0.03428186  1.24014753]
 [-0.40084178 -0.21138138]
 [-0.25740345 -1.2412635 ]]
[[-0.07177878  0.8106889 ]
 [-0.83927582  0.26569205]
 [-0.53894703 -0.5217195 ]]
[[ 0.47760435  0.75736629]
 [ 0.          1.59680286]]

In [72]:
np.allclose(A, np.dot(Q, R)) # check that A=QR


Out[72]:
True

In [73]:
U, s, V = np.linalg.svd(A)
print(U.shape, V.shape, s.shape)


(3, 3) (2, 2) (2,)

In [74]:
S = np.zeros(A.shape)
S[:A.shape[1], :A.shape[1]] = np.diag(s)
np.allclose(A, np.dot(U, np.dot(S, V)))


Out[74]:
True

By default, $U$ and $V$ have the shapes $(M, M)$ and $(N, N)$ respectively if $A$ is $(M,N)$. If full_matrices=False is passed, the shapes are $(M, K)$ and $(K, N)$, respectively, where $K = min(M, N)$.


In [75]:
U, s, V = np.linalg.svd(A, full_matrices=False)
print(U.shape, V.shape, s.shape)


(3, 2) (2, 2) (2,)

In [76]:
S = np.diag(s)
np.allclose(A, np.dot(U, np.dot(S, V)))


Out[76]:
True

Eigenvalues

linalg.eig compute the eigenvalues and right eigenvectors of a square array and is the main function (linalg.eigvals computes eigenvalues of a non-symmetric array, linalg.eigh returns eigenvalues and eigenvectors of a symmetric or Hermitian array).


In [77]:
A = np.array([[1, -1], [1, 1]])
l, v = np.linalg.eig(A)
print(l,v)


[ 1.+1.j  1.-1.j] [[ 0.70710678+0.j          0.70710678-0.j        ]
 [ 0.00000000-0.70710678j  0.00000000+0.70710678j]]

In [78]:
A.dot(v[:,0])


Out[78]:
array([ 0.70710678+0.70710678j,  0.70710678-0.70710678j])

We can check that $Ax= \lambda x$.


In [79]:
np.allclose(A.dot(v[:,0]),l[0]*v[:,0])


Out[79]:
True

Norms and other numbers

The function linalg.norm is able to return one of eight different matrix norms, or one of an infinite number of vector norms (described below), depending on the value of the ord parameter.

ord norm for matrices norm for vectors
None Frobenius norm 2-norm
‘fro’ Frobenius norm
‘nuc’ nuclear norm
inf max(sum(abs(x), axis=1)) max(abs(x))
-inf min(sum(abs(x), axis=1)) min(abs(x))
0 sum(x != 0)
1 max(sum(abs(x), axis=0)) as below
-1 min(sum(abs(x), axis=0)) as below
2 2-norm (largest sing. value) as below
-2 smallest singular value as below
other sum(abs(x)$^{ord}$)$^{(1./ord)}$

In [80]:
a = np.arange(9) - 4
B = a.reshape((3, 3))

In [81]:
print(a)
print("none \t",np.linalg.norm(a))
print("2 \t",np.linalg.norm(a,ord=2))
print("1 \t",np.linalg.norm(a,ord=1))
print("inf \t",np.linalg.norm(a,ord=np.inf))
print("0 \t",np.linalg.norm(a,ord=0))


[-4 -3 -2 -1  0  1  2  3  4]
none 	 7.74596669241
2 	 7.74596669241
1 	 20.0
inf 	 4.0
0 	 8.0

In [82]:
print(B)
print("none \t",np.linalg.norm(B))
print("2 \t",np.linalg.norm(B,ord=2))
print("1 \t",np.linalg.norm(B,ord=1))
print("inf \t",np.linalg.norm(B,ord=np.inf))
print("fro \t",np.linalg.norm(B,ord='fro'))


[[-4 -3 -2]
 [-1  0  1]
 [ 2  3  4]]
none 	 7.74596669241
2 	 7.34846922835
1 	 7.0
inf 	 9.0
fro 	 7.74596669241

Other useful function include the condition number linalg.cond, determinant linalg.det, or rank linalg.matrix_rank .


In [83]:
A = np.array([[1, 0, -1], [0, 1, 0], [1, 0, 1]]) # some matrix
I = np.eye(4) # identity
Def =  np.eye(4); Def[0,0]=0 # rank deficient

In [84]:
print(np.linalg.cond(A), np.linalg.cond(I))


1.41421356237 1.0

In [85]:
np.linalg.cond(Def)


/usr/local/lib/python3.4/dist-packages/numpy/linalg/linalg.py:1487: RuntimeWarning: divide by zero encountered in true_divide
  return s[..., 0]/s[..., -1]
Out[85]:
inf

In [86]:
print(np.linalg.det(A), np.linalg.det(I), np.linalg.det(Def))


2.0 1.0 0.0

In [87]:
print(np.linalg.matrix_rank(A), np.linalg.matrix_rank(I), np.linalg.matrix_rank(Def))


3 4 3

Solving equations and inverting matrices

When $A$ is full-rank, finding $x$ such that $Ax=b$ can be done efficiently using linalg.solve (as a general remark it is in general bad to invert $A$ for solving such equations, although this can be done by linalg.inv).


In [88]:
A = np.array([[3,1], [1,2]])
b = np.array([9,8])
print(A)
print(b)


[[3 1]
 [1 2]]
[9 8]

In [89]:
x_sol = np.linalg.solve(A,b)
print(x_sol , np.allclose(A.dot(x_sol),b))


[ 2.  3.] True

In [90]:
A_inv = np.linalg.inv(A)
x_sol2 = A_inv.dot(b)
print(x_sol2 , np.allclose(A.dot(x_sol2),b))


[ 2.  3.] True

Why you don't want to invert a matrix except if really needed:


In [91]:
N= 100
A = np.random.randn(N,N)
b = np.random.randn(N)
%timeit x_sol = np.linalg.solve(A,b)
%timeit A_inv = np.linalg.inv(A) ; x_sol2 = A_inv.dot(b)


1000 loops, best of 3: 461 µs per loop
100 loops, best of 3: 2.33 ms per loop

If $A$ is not full-rank, the least squares solution of $Ax=b$ (i.e $x$ that minimizes $\|Ax-b\|_2$) can be obtained by linalg.lstsq and linalg.pinv give the Moore-Penrose pseudo inverse.

Example: Let us see the problem of affine regression: fit a line, $y = mx + c$ through some noisy data-points


In [92]:
x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])

We seek $v =[m,c]$ such that $Av=y$ with $A = [x ; 1]$ and $1$ is the vector of all ones.


In [93]:
A = np.vstack([x, np.ones(len(x))]).T
print(A)


[[ 0.  1.]
 [ 1.  1.]
 [ 2.  1.]
 [ 3.  1.]]

In [94]:
res = np.linalg.lstsq(A,y)
print(res)


(array([ 1.  , -0.95]), array([ 0.05]), 2, array([ 4.10003045,  1.09075677]))

the result is a tuple of (solution,residuals,rank,singular values of A)


In [95]:
m,c = res[0]
print(m,c)


1.0 -0.95

In [96]:
plt.figure()
plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, m*x + c, 'r', label='Fitted line')
plt.legend()
plt.show()


**Exercise 2-1.1:** Image SVD

The following code snippet reads an image, converts it in grayscale (as a 2D array), and plots it.

In [97]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.cm as cm
%matplotlib inline

#### IMAGE
img = mpimg.imread('img/flower.png')
img_gray =  0.2989 * img[:,:,0] + 0.5870 * img[:,:,1] + 0.1140 * img[:,:,2] # Apparently these are "good" coefficients to convert to grayscale
####

print(type(img_gray))

plt.figure()
plt.xticks([]),plt.yticks([])
plt.title("Original")
plt.imshow(img_gray, cmap = cm.Greys_r) 
plt.show()


<class 'numpy.ndarray'>
Investigate how nulling
  • the greatest singular values
  • the smallest singular values
  • random singular values
affects the image. Conclude about the relation between information and singular values.

In [98]:
# Compression percentage: number of eigenvalues set to zero - To modify
Compression = 75.0  # in percents

# SVD 
U, s, Vt = np.linalg.svd(img_gray, full_matrices=False)

# (i) Nulling the smallest singular values 
img_i = np.array([[0]]) #TODO


# (ii) Nulling the greatest singular values
img_ii = np.array([[0]])  #TODO


# (iii) Nulling random singular values
img_iii = np.array([[0]])  #TODO

###############################################



###############################################
## SHOW THE FIGURES
print('The image is {:d}x{:d}\n - it has {:d} singular values between {:.3f} and {:.3f}'.format( img_gray.shape[0], img_gray.shape[1]  , len(s)  ,  np.min(s) , np.max(s) ))

plt.figure()# new figure

plt.subplot(2,2,1)
plt.xticks([]),plt.yticks([])
plt.title("Original")
plt.imshow(img_gray, cmap = cm.Greys_r) 

plt.subplot(2,2,2)
plt.xticks([]),plt.yticks([])
plt.title("(i) smallest")
plt.imshow(img_i, cmap = cm.Greys_r) 

plt.subplot(2,2,3)
plt.xticks([]),plt.yticks([])
plt.title("(ii) greatest")
plt.imshow(img_ii, cmap = cm.Greys_r) 

plt.subplot(2,2,4)
plt.xticks([]),plt.yticks([])
plt.title("(iii) random")
plt.imshow(img_iii, cmap = cm.Greys_r) 

plt.show() #show the window
###############################################


The image is 480x640
 - it has 480 singular values between 0.005 and 162.934

Exercise 2-1.2: Linear Regression

In this example, we use linear algebra to extract information from data; more precisely, we predict final notes of a group of student from their profiles with the [Student Performance dataset](https://archive.ics.uci.edu/ml/datasets/Student+Performance) which includes secondary education students of two Portuguese schools.

Profiles include features such as student grades, demographic, social and school related features and were collected by using school reports and questionnaires. There are $m = 395$ students (examples) and we selected $n = 27$ features (see data/student.txt for the features description and datat/student-mat.csv for the csv dataset.)

Our goal is to predict a target feature (the $28$-th) which is the final grade of the student from the other features (the first $27$). We assume that the final grade can be explained by a linear combination of the other features. We are going to learn from this data using linear regression over the $m_{learn} = 300$ students (called the *learning set*). We will check our prediction by comparing the results for the other $m_{test} = 95$ students (the *testing set*).

In [99]:
import numpy as np

# File reading
dat_file = np.load('data/student.npz')
A_learn = dat_file['A_learn']
b_learn = dat_file['b_learn']
A_test = dat_file['A_test']
b_test = dat_file['b_test']

m = 395 # number of read examples (total:395)
n = 27 # features
m_learn = 300
Mathematically, from the $m_{learn} \times (n+1)$ *learning matrix* (the number of columns is $n+1$ as a column of ones, called *intercept* for statistical reasons). $A_{learn}$ comprising of the features values of each training student in line, and the vector of the values of the target features $b_{learn}$; we seek a size-$n+1$ *regression vector* that minimizes the squared error between $A_{learn} x$ and $b_{learn}$. This problem boils down to the following least square problem: $$ \min_{x\in\mathbb{R}^{n+1}} \| A_{learn} x - b_{learn} \|_2^2 . $$
  • Observe the rank of the $m_{learn} \times (n+1)$ matrix $A_{learn}$. Does it have full row rank? full column rank? Conclude about the existence and uniqueness of solutions of the problem.
  • Solve the minimization problem in the least squares sense to retreive a predictor x_reg.

In [100]:
rank_A_learn = 0 #.........................................
#print('Rank of matrix A_learn ({:d} rows, {:d} cols.): {:d}\n'.format(m_learn,n+1,rank_A_learn))

x_reg = 0 #..........................................
  • In order to test the goodness of our predictor x_reg, we use the rest of the data to compare our predictions with the actual observations. The test matrix $A_{test}$ has $m_{test} = 95$ rows (students) and $n+1 = 28$ columns (features+intercept). Construct the predicted grades from x_reg and compare with the actual observed grades in $b_{test}$ (set SHOW_PREDICTION = True in the code).
  • Compare the relative values of the coefficients of the predictor x_reg (set SHOW_PREDICTOR = True in the code). What can you observe about the relative importance of the features?

In [101]:
predict = 0 #.........................................

SHOW_PREDICTION = False
SHOW_PREDICTOR  = False

if SHOW_PREDICTION:
    print('\n\n  Predicted | True value')
    for i in range(predict.size):
        print('\t{:2d}     {:2d} '.format(int(predict[i]),int(b_test[i])))
        
if SHOW_PREDICTOR:
    filename = 'data/student.txt' 
    f = open(filename, 'r')
    f.readline() # read the first (description) line
    for i in range(n):
        print("{:2.3f} \t-- {:s}".format(x_reg[i],f.readline()))
    print("{:2.3f} \t-- Intercept".format(x_reg[n]))
    f.close()

Exercise 2-1.3: PageRank

In this part, we will compute the PageRank ordering of the following graph. In PageRank, the score $x_i$ of page $i$ is equal to the sum over the pages $j$ pointing toward $i$ of their scores $x_j$ divided by their number of outgoing links $n_j$. This leads to a ranking matrix $R$ defined from the scoring method as $$ x = Rx.$$

In [102]:
import numpy as np
import matplotlib.pyplot as plt

#### Graph matrix
A = np.array([[0,1,1,0,1],[0,0,0,1,1],[1,0,0,1,0],[0,0,1,0,1],[0,1,0,0,0]])
####

#  column stochatic normalization 
R = np.dot( A  , np.diag(1.0/np.sum(A,0)))
  • Explain how the ranking matrix $R$ is generated from adjacence matrix $A$ and that that the sums of its columns are equal to $1$.
  • Check numerically that $\|R\| = 1$ for some matrix norm and that the spectral radius of $R$ is equal to $1$.

In [ ]:

  • Iterate the matrix $R$ a large number of times and check if the matrix is primitive. What do you notice on the eigenvalues and eigenvectors? How is defined the rank 1 matrix that you obtain? This manner of computing eigenvectors/values is called the *power method*.
  • Recover the *Perron* eigenvector of matrix $R$. The entries of this vector are the PageRank scores of the nodes/pages of the graph. Give the PageRank ordering of the pages of the graph.

In [103]:
v = 0 #......................................... 
#print("Perron vector: {0}".format(v))
  • (to go further) In this exercise, the graph we took led to a *primitive* matrix as seen above; this is necessary for the power method to work as the eigenvalue $1$ has to be the only one of modulus $1$. This is actually the case when the graph is strongly connected, that is when you can go from any node to any other node by following the edges, with *enough* loops. When it is not the case, our problem becomes ill posed. To overcome this problem, the ranking matrix $R$ is replaced by $$ M = (1-\alpha) R + \alpha J, ~~~~~~~~ \alpha\in]0,1[ $$ where is $J$ is the $5\times 5$ matrix whose entries are all $1/5$. The value of $\alpha$ originally used by Google is $0.15$.
    Compute the ranking for the original graph but where the link from 2 to 5 is suppressed.

In [104]:
#### New Graph matrix
A_2 = np.array([[0,1,1,0,1],[0,0,0,1,1],[1,0,0,1,0],[0,0,1,0,1],[0,0,0,0,0]])
####

Exercise 2-1.4: LibSVM Parsing

The LibSVM format README file reads.
The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...

Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label (multi-class is supported). For regression, <label> is the target value which can be any real number. For one-class SVM, it's not used so can be any number. The pair <index>:<value> gives a feature (attribute) value: <index> is an integer starting from 1 and <value> is a real number. The only exception is the precomputed kernel, where  <index> starts from 0; see the section of precomputed kernels. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any numbers.



Create a module including a function that parses a LibSVM file and returns the output as numpy arrays.

In [ ]:


Package Check and Styling

Go to top


In [ ]:
import lib.notebook_setting as nbs

packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)

nbs.cssStyling()