Introduction to Python for Data Sciences |
Franck Iutzeler Fall. 2018 |
Python has a large standard library, commonly cited as one of Python's greatest strengths, providing tools suited to many tasks. As of May, 2017, the official repository containing third-party software for Python, contains over 107,000 packages.
A package is a collection of modules i.e. groups of function, classes, constants, types, etc.
To use a module, you have to import it using the command import.
In [1]:
import math
You can now use it in the code by using the prefix package_name.
In [2]:
x = math.cos(2 * math.pi)
print(x)
To explore the function and other content of the library:
In [3]:
help(math)
In [4]:
help(math.sqrt)
Using the prefix package_name. can make the code obfuscated as it can get quite verbose (e.g. scipy.optimize.minimize) so Python provides simpler ways to import:
In [5]:
import math as m
print(m.pi)
In [6]:
from math import e,log
print(log(e**4))
In order to reuse parts of your code, it is often efficient to write you own modules for often used functions. The importation of the module is the same as for other libraries to the difference that you have to give the (relative) path to the module.
In our example, the module file is pyds.py in the ./lib/ folder (you may open it to see what it contains). Two solutions to give the (relative) path to the module:
In [7]:
import sys
sys.path.append( "./lib/" )
import pyds as pyds1
help(pyds1)
In [8]:
import lib.pyds as pyds2
pyds2.printQuote()
Numpy is a numerical calculus and algebra package that is widely used, notably providing the array (vector/matrix format) type used in almost all numerical projects. Documentation
Matplotlib is a module for generating 2D and 3D graphics. Documentation
It is common to import them with the respective nicknames np and plt (for matplotlib.pyplot). In Notebooks, it is common to add the Jupyter magic command %matplotlib inline to display the plots inside the notebook.
In [9]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [10]:
x = np.array([1, 2.5, 5, 10])
print(x,type(x))
In [11]:
y = np.random.rand(4)
print(y,type(y))
Visualizing the data is quite simple with pyplot:
In [12]:
plt.figure()
plt.plot(x,y, 'bd--', label='y(x)')
plt.legend(loc='lower right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Sample Figure')
plt.xlim([0, 10])
plt.ylim([0, 1])
plt.savefig('img/sample.png')
plt.show()
In [13]:
M = np.array([[0.25, 6.2, 1, 10],[12, 6.2, 6, -5.3]])
print(M,type(M))
In [14]:
print(x.size) # or equivalently np.size(x)
print(M.size)
In [15]:
print(x.shape) # or equivalently np.shape(x)
print(M.shape)
In [16]:
print(x)
print(x[2],x[-1])
print(x[::-1])
x[0] = 6.1554
print(x)
In [17]:
v = x
w = np.copy(x)
print(v)
x[0]=1
print(v)
print(w)
In [18]:
print(M)
print(M[1,2],type(M[1,2]))
print(M[1,:],type(M[1,:]),M[1,:].shape)
print(M[1])
print(M[:,0])
Advanced access to content and modification is possible
In [19]:
x = np.array([1, 2.5, 5, 10])
ind = [0,2,3]
print(x[ind])
In [20]:
print(x.dtype)
Array only accept they casting to their own type. However, the type of an array can be changed.
In [21]:
try:
x[0] = 1 + 2.0j
except Exception as e:
print(e)
In [22]:
y = x.astype(complex)
y[0] = 1 + 2.0j
print(y,type(y),y.dtype)
See the corresponding documentation.
In [23]:
x = np.arange(0, 10, 1.5)
print(x,type(x))
In [24]:
y = np.linspace(0, 10, 25)
print(y,type(y))
In [25]:
x = np.zeros(3)
print(x,x.shape,type(x),x.dtype)
x = np.zeros((3,))
print(x,x.shape,type(x),x.dtype)
In [26]:
try:
x = np.zeros(3,3) # This causes an error as 3,3 is not a shape, it is (3,3) -> double parentheses
except Exception as error:
print(error)
print(x,x.shape,type(x),x.dtype)
In [27]:
x = np.zeros((3,3))
print(x,x.shape,type(x),x.dtype)
In [28]:
M = np.eye(3)
print(M,M.shape,type(M),M.dtype)
Random arrays can be generated by Numpy's random module.
**rand** returns an array (of floats) of uniformly distributed numbers in [0,1) of the precised dimensions
**randn** returns an array (of floats) of numbers from the normal distribution of the precised dimensions
**randint** returns an array (of floats) of integers from the discrete uniform distribution
In [29]:
np.random.rand(5)
Out[29]:
In [30]:
np.random.randn(5,2)
Out[30]:
In [31]:
np.random.randint(0,100,size=(10,))
Out[31]:
In [32]:
a = np.random.randn(10000)
plt.figure()
plt.hist(a,40) # histogram of a with 40 bins
plt.show()
In [33]:
v = np.arange(0, 5)
print(v)
In [34]:
v * 2
Out[34]:
In [35]:
v + 2.5
Out[35]:
In [36]:
square = v**2
root = np.sqrt(v)
print(square,root)
In [37]:
plt.figure()
plt.subplot(1,2,1)
plt.plot(square,'g--', label='$y = x^2$')
plt.legend(loc=0)
plt.subplot(1,2,2)
plt.plot(root, 'r*-', label='$y = \sqrt{x}$')
plt.legend(loc=2)
plt.show()
In [38]:
A = np.array([[n+m*10 for n in range(5)] for m in range(4)])
print(A)
In [39]:
A*2
Out[39]:
In [40]:
A+2.5
Out[40]:
Matrices can be visualized as images.
In [41]:
C = np.random.randn(100,100)
plt.figure()
plt.imshow(C)
plt.colorbar()
plt.show()
In [42]:
A = np.array([[n+m*10 for n in range(5)] for m in range(4)])
v = np.random.randn(5)
print(A,v)
In [43]:
A*A
Out[43]:
In [44]:
v*v
Out[44]:
In [45]:
A*v
Out[45]:
It can be useful to transpose, it is simply done by suffixing .T (or equivalently using the function np.transpose). Similarly .H is the Hermitian conjugate .imag .real the real and imaginary parts .abs the modulus (their full versions are respectively np.conjugate np.imag np.real np.abs)
In [46]:
print(A,A.shape)
print(A.T,A.T.shape)
In [47]:
y = np.dot(A,v)
print(A,A.shape,v,v.shape)
print(y,type(y),y.shape)
Example of vector-vector multiplication i.e. a scalar product
In [48]:
s = v.dot(v)
print(v, s, type(s))
Example of non-compatible shapes
In [49]:
try:
A2 = np.dot(A,A)
except Exception as error:
print(error)
In [50]:
A3 = np.dot(A,A.T)
print(A3,A3.shape)
From a vector $v$, one can form the matrix $P=v v^T$ by A=v.outer(v) (or equivalently np.outer(v,v))
In [51]:
P = np.outer(v,v)
print(P)
See the Documentation on arrays and array creation.
In [52]:
A.reshape((2,10))
Out[52]:
In [53]:
print(A)
In [54]:
B = A.flatten()
print(B)
In [55]:
print(A.trace(),A.max(),A.argmax())
Some functions may be taken with respects to the columns with axis=0 or lines with axis=1.
In [56]:
print(A.mean(),A.mean(axis=0),A.mean(axis=1))
In [57]:
print(A.var(),A.var(axis=0),A.std(axis=1))
In [58]:
a = np.array([[1, 2], [3, 4]])
a
Out[58]:
In [59]:
# repeat each element 3 times
np.repeat(a, 3) # 1-d result
Out[59]:
In [60]:
# repeat *the matrix* 3 times
np.tile(a, 3)
Out[60]:
In [61]:
b = np.array([[5, 6]])
In [62]:
np.concatenate((a, b), axis=0)
Out[62]:
In [63]:
np.concatenate((a, b.T), axis=1)
Out[63]:
In [64]:
np.vstack((a,b))
Out[64]:
In [65]:
np.hstack((a,b.T))
Out[65]:
In [66]:
v = np.array([1,2,3,4])
for element in v:
print(element)
In [67]:
a = np.array([[1,2], [3,4]])
for row in a:
print("row", row)
for element in row:
print(element)
enumerate can be used to get indexes along with elements.
In [68]:
for row_idx, row in enumerate(a):
print("row_idx", row_idx, "row", row)
for col_idx, element in enumerate(row):
print("col_idx", col_idx, "element", element)
# update the matrix a: square each element
a[row_idx, col_idx] = element ** 2
In [69]:
a
Out[69]:
Numpy comes with an efficient linear algebra module named linalg (see the documentation). As in many languages, the more vectorized the operations are, the more efficient.
In [70]:
A = np.random.randn(3,2)
In [71]:
Q, R = np.linalg.qr(A)
print(A)
print(Q)
print(R)
In [72]:
np.allclose(A, np.dot(Q, R)) # check that A=QR
Out[72]:
In [73]:
U, s, V = np.linalg.svd(A)
print(U.shape, V.shape, s.shape)
In [74]:
S = np.zeros(A.shape)
S[:A.shape[1], :A.shape[1]] = np.diag(s)
np.allclose(A, np.dot(U, np.dot(S, V)))
Out[74]:
By default, $U$ and $V$ have the shapes $(M, M)$ and $(N, N)$ respectively if $A$ is $(M,N)$. If full_matrices=False is passed, the shapes are $(M, K)$ and $(K, N)$, respectively, where $K = min(M, N)$.
In [75]:
U, s, V = np.linalg.svd(A, full_matrices=False)
print(U.shape, V.shape, s.shape)
In [76]:
S = np.diag(s)
np.allclose(A, np.dot(U, np.dot(S, V)))
Out[76]:
In [77]:
A = np.array([[1, -1], [1, 1]])
l, v = np.linalg.eig(A)
print(l,v)
In [78]:
A.dot(v[:,0])
Out[78]:
We can check that $Ax= \lambda x$.
In [79]:
np.allclose(A.dot(v[:,0]),l[0]*v[:,0])
Out[79]:
The function linalg.norm is able to return one of eight different matrix norms, or one of an infinite number of vector norms (described below), depending on the value of the ord parameter.
ord | norm for matrices | norm for vectors |
---|---|---|
None | Frobenius norm | 2-norm |
‘fro’ | Frobenius norm | – |
‘nuc’ | nuclear norm | – |
inf | max(sum(abs(x), axis=1)) | max(abs(x)) |
-inf | min(sum(abs(x), axis=1)) | min(abs(x)) |
0 | – | sum(x != 0) |
1 | max(sum(abs(x), axis=0)) | as below |
-1 | min(sum(abs(x), axis=0)) | as below |
2 | 2-norm (largest sing. value) | as below |
-2 | smallest singular value | as below |
other | – | sum(abs(x)$^{ord}$)$^{(1./ord)}$ |
In [80]:
a = np.arange(9) - 4
B = a.reshape((3, 3))
In [81]:
print(a)
print("none \t",np.linalg.norm(a))
print("2 \t",np.linalg.norm(a,ord=2))
print("1 \t",np.linalg.norm(a,ord=1))
print("inf \t",np.linalg.norm(a,ord=np.inf))
print("0 \t",np.linalg.norm(a,ord=0))
In [82]:
print(B)
print("none \t",np.linalg.norm(B))
print("2 \t",np.linalg.norm(B,ord=2))
print("1 \t",np.linalg.norm(B,ord=1))
print("inf \t",np.linalg.norm(B,ord=np.inf))
print("fro \t",np.linalg.norm(B,ord='fro'))
Other useful function include the condition number linalg.cond, determinant linalg.det, or rank linalg.matrix_rank .
In [83]:
A = np.array([[1, 0, -1], [0, 1, 0], [1, 0, 1]]) # some matrix
I = np.eye(4) # identity
Def = np.eye(4); Def[0,0]=0 # rank deficient
In [84]:
print(np.linalg.cond(A), np.linalg.cond(I))
In [85]:
np.linalg.cond(Def)
Out[85]:
In [86]:
print(np.linalg.det(A), np.linalg.det(I), np.linalg.det(Def))
In [87]:
print(np.linalg.matrix_rank(A), np.linalg.matrix_rank(I), np.linalg.matrix_rank(Def))
In [88]:
A = np.array([[3,1], [1,2]])
b = np.array([9,8])
print(A)
print(b)
In [89]:
x_sol = np.linalg.solve(A,b)
print(x_sol , np.allclose(A.dot(x_sol),b))
In [90]:
A_inv = np.linalg.inv(A)
x_sol2 = A_inv.dot(b)
print(x_sol2 , np.allclose(A.dot(x_sol2),b))
Why you don't want to invert a matrix except if really needed:
In [91]:
N= 100
A = np.random.randn(N,N)
b = np.random.randn(N)
%timeit x_sol = np.linalg.solve(A,b)
%timeit A_inv = np.linalg.inv(A) ; x_sol2 = A_inv.dot(b)
If $A$ is not full-rank, the least squares solution of $Ax=b$ (i.e $x$ that minimizes $\|Ax-b\|_2$) can be obtained by linalg.lstsq and linalg.pinv give the Moore-Penrose pseudo inverse.
Example: Let us see the problem of affine regression: fit a line, $y = mx + c$ through some noisy data-points
In [92]:
x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
We seek $v =[m,c]$ such that $Av=y$ with $A = [x ; 1]$ and $1$ is the vector of all ones.
In [93]:
A = np.vstack([x, np.ones(len(x))]).T
print(A)
In [94]:
res = np.linalg.lstsq(A,y)
print(res)
the result is a tuple of (solution,residuals,rank,singular values of A)
In [95]:
m,c = res[0]
print(m,c)
In [96]:
plt.figure()
plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, m*x + c, 'r', label='Fitted line')
plt.legend()
plt.show()
In [97]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.cm as cm
%matplotlib inline
#### IMAGE
img = mpimg.imread('img/flower.png')
img_gray = 0.2989 * img[:,:,0] + 0.5870 * img[:,:,1] + 0.1140 * img[:,:,2] # Apparently these are "good" coefficients to convert to grayscale
####
print(type(img_gray))
plt.figure()
plt.xticks([]),plt.yticks([])
plt.title("Original")
plt.imshow(img_gray, cmap = cm.Greys_r)
plt.show()
In [98]:
# Compression percentage: number of eigenvalues set to zero - To modify
Compression = 75.0 # in percents
# SVD
U, s, Vt = np.linalg.svd(img_gray, full_matrices=False)
# (i) Nulling the smallest singular values
img_i = np.array([[0]]) #TODO
# (ii) Nulling the greatest singular values
img_ii = np.array([[0]]) #TODO
# (iii) Nulling random singular values
img_iii = np.array([[0]]) #TODO
###############################################
###############################################
## SHOW THE FIGURES
print('The image is {:d}x{:d}\n - it has {:d} singular values between {:.3f} and {:.3f}'.format( img_gray.shape[0], img_gray.shape[1] , len(s) , np.min(s) , np.max(s) ))
plt.figure()# new figure
plt.subplot(2,2,1)
plt.xticks([]),plt.yticks([])
plt.title("Original")
plt.imshow(img_gray, cmap = cm.Greys_r)
plt.subplot(2,2,2)
plt.xticks([]),plt.yticks([])
plt.title("(i) smallest")
plt.imshow(img_i, cmap = cm.Greys_r)
plt.subplot(2,2,3)
plt.xticks([]),plt.yticks([])
plt.title("(ii) greatest")
plt.imshow(img_ii, cmap = cm.Greys_r)
plt.subplot(2,2,4)
plt.xticks([]),plt.yticks([])
plt.title("(iii) random")
plt.imshow(img_iii, cmap = cm.Greys_r)
plt.show() #show the window
###############################################
In [99]:
import numpy as np
# File reading
dat_file = np.load('data/student.npz')
A_learn = dat_file['A_learn']
b_learn = dat_file['b_learn']
A_test = dat_file['A_test']
b_test = dat_file['b_test']
m = 395 # number of read examples (total:395)
n = 27 # features
m_learn = 300
In [100]:
rank_A_learn = 0 #.........................................
#print('Rank of matrix A_learn ({:d} rows, {:d} cols.): {:d}\n'.format(m_learn,n+1,rank_A_learn))
x_reg = 0 #..........................................
In [101]:
predict = 0 #.........................................
SHOW_PREDICTION = False
SHOW_PREDICTOR = False
if SHOW_PREDICTION:
print('\n\n Predicted | True value')
for i in range(predict.size):
print('\t{:2d} {:2d} '.format(int(predict[i]),int(b_test[i])))
if SHOW_PREDICTOR:
filename = 'data/student.txt'
f = open(filename, 'r')
f.readline() # read the first (description) line
for i in range(n):
print("{:2.3f} \t-- {:s}".format(x_reg[i],f.readline()))
print("{:2.3f} \t-- Intercept".format(x_reg[n]))
f.close()
In [102]:
import numpy as np
import matplotlib.pyplot as plt
#### Graph matrix
A = np.array([[0,1,1,0,1],[0,0,0,1,1],[1,0,0,1,0],[0,0,1,0,1],[0,1,0,0,0]])
####
# column stochatic normalization
R = np.dot( A , np.diag(1.0/np.sum(A,0)))
In [ ]:
In [103]:
v = 0 #.........................................
#print("Perron vector: {0}".format(v))
In [104]:
#### New Graph matrix
A_2 = np.array([[0,1,1,0,1],[0,0,0,1,1],[1,0,0,1,0],[0,0,1,0,1],[0,0,0,0,0]])
####
The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ...
Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label (multi-class is supported). For regression, <label> is the target value which can be any real number. For one-class SVM, it's not used so can be any number. The pair <index>:<value> gives a feature (attribute) value: <index> is an integer starting from 1 and <value> is a real number. The only exception is the precomputed kernel, where <index> starts from 0; see the section of precomputed kernels. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any numbers.
In [ ]:
In [ ]:
import lib.notebook_setting as nbs
packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)
nbs.cssStyling()