Numerical Python (NumPy) is the fundamental package for high performance scientific computing and data analysis. It is the founding for most high-level tools. Features:
ndarray
: fast and space-efficient multidimensional array providing vectorized arithmetic operationsFor most data analysis applications, the main functionalities we might be interested in are:
NumPy provides the computational foundation for these operations but you might want to use pandas as your basis for data analysis as it provides a rich, high-level interface masking most common data tasks very concise and simple. It also provides some more domain specific functionality like time series manipulation, absent in NumPy.
dtype
is a special object containing the information the ndarrays needs to interpret a chunk of memory as a particular type of data:
arr = np.array([1,2,3], dtype=np.float64)
dtype
objects are part of what makes NumPy so powerful and flexible. In most of the cases they map directly onto an underlying machine representation. This makes it easy to read/write binary streams of data to disk and connect to code written in languages such as C or Fortran. The most common numerical types are floatXX
and intXX
where XX
stands for the number of bits (tipically 64 for float and 32 for int).
An array originally created with one type can be converter or casted to another type:
In [ ]:
import numpy as np
arr = np.array([1,2,3])
print(arr," is of type ",arr.dtype)
float_arr = arr.astype(np.float64)
print(float_arr," is of type ",float_arr.dtype)
However, pay attention to the fact that a astype()
produces a new object.
Numbers introduced as strings can also be casted to numerical values:
In [ ]:
numeric_string = np.array(input("Type a sequence of numbers separated by spaces: ").split())
In [ ]:
numeric_string.astype(float)
One array can be casted to another array's dtype
:
In [ ]:
int_array = np.arange(10)
other_array = np.array([0.5, 1.2], dtype=float)
int_array.astype(other_array.dtype)
Indexing and slicing in NumPy is a topic that could fill pages and pages. A single array element or a subset can be selected in multiple ways. The following will be a short reminder on array indexing and slicing in NumPy. In the case of one-dimensional arrays there are apparently no differences with respect to indexing in Python lists:
In [ ]:
vec = np.arange(10)*3 + 0.5
vec
We obtain the sixth element of the array similarly to what one would do in a list:
In [ ]:
vec[5]
The subset formed by the sixth, seventh, and eigth element:
In [ ]:
vec[5:8] #as in lists, first index is included, last index is omitted
We can modify the value of a single element:
In [ ]:
vec[5] = 11
vec
Or a subset of the vector:
In [ ]:
vec [:3] = [11, 12, 13]
vec
If an scalar is assigned to a slice of the array, the value is broadcasted:
In [ ]:
vec [-3:] = 66
vec
READ THIS An importat difference with respect to lists is that slices of a given array are views on the original array. This means that the data is not copied and any modification will be reflected in the source array:
In [ ]:
print("This is the vector",vec)
slice = vec[5:8] #we obtain a slice of the vector
slice[1] = 1111 #modify the second element in the slice
print("This is the vector",vec)
This can be quite surprising and even inconvenient for those coming from programming language such as Fortran90 that copy data more zealously. If you want to actually copy a slice of an ndarray instead of just viewing it, you can either create a new array or use the copy()
method:
slice = np.array(vec[5:8])
slice = vec[5:8].copy()
With higher dimensional arrays more possibilities are available. For instance, in the case of a 2D array, the elements at each index are no longer scalars but rather one-dimensional arrays. Similarly in the case of a 3D array, at each index we would have a 2D array composed of 1D arrays at each subindex, and so on.
In [ ]:
arr2D = np.arange(9).reshape(3,3)
arr2D[2]
Individual elements can be accessed recursively. You can use either the syntax employed in lists or the more compact and nice comma-separated list of indices to select individual elements.
In [ ]:
arr2D[2][1]
In [ ]:
arr2D[2,1]
In the case of multidimensional arrays, omitting later indices will produce lower-dimensional array consisting of all the data along the higher dimensions:
In [ ]:
arr3D = np.arange(27).reshape(3,3,3)
arr3D
Check that, for instance, arr3D[0]
is a 3x3 matrix:
In [ ]:
Similarly, arr3D[2,1]
gives us all the values of the three-indexed variable whose indices start with (1,0), forming a 1D array:
In [ ]:
In [ ]:
vec[1:3]
Higher dimensional objects can be sliced along one or more axes and also mix integers. For instance in the 2D case:
In [ ]:
arr2D[:2]
The matrix is sliced along its first axis (rows) an the first two elements are given. One can pass multiple slices:
In [ ]:
arr2D[:2,1:]
Slice the first two rows and columns from the second onwards. Keep in mind that slices are views of the original array. The colon by itself means that the entire axis is taken:
In [ ]:
arr2D[:,-1] #produce the last column of the matrix
To illustrate this indexing style let us build a 7x5 matrix filled with random numbers distributed between 0 and 20.
In [ ]:
arr2D = np.random.randint(0,20,35).reshape(7,5)
A second sequence, a 1D array composed by 7 elements.
In [ ]:
arr1D = np.random.randn(7)
Suppose each value in arr1D
corresponds to a row in the arr2D
array and we want to select all the rows corresponding to absolute values in arr1D
greater than 0.5. If we apply a logical operation onto the array the result is a NumPy array of the same shape with a boolean value according to the outcome of the logical operation on the corresponding element.
In [ ]:
np.abs(arr1D) > 0.5
This boolean array can be passed when indexing the array:
In [ ]:
arr2D[np.abs(arr1D) > 0.5]
Clearly, the boolean array must be of the same length as the axis it's indexing. Boolean arrays can be mixed with slices or integer indices:
In [ ]:
arr2D[np.abs(arr1D) > 0.5, -1] #last column of those rows matching the logical condition
Boolean conditions can be combined using logical operators like &
(equivalent to and
) and |
(or
)
In [ ]:
arr2D[(np.abs(arr1D) > 0.5) & (arr1D > 0)]
However, keep in mind that the Python keywords and
and or
do not work with boolean arrays.
We can also apply to the array a mask produced by a logical condition on the array itself. Which elements of arr2D
are even?
In [ ]:
In [ ]:
arr2D
In [ ]:
arr2D[[0,2]] #Fancy index is an array requesting rows 0 and 2
Using negative indices selects rows from the end:
In [ ]:
arr2D[[-1,-3]]
Passing multiple indices does something slightly different; it selects a 1D array of elements corresponding to each tuple of indices:
In [ ]:
arr2D[[0,2],[-1,-3]]
Check that we have selected elements (0,-1) and (2,-3):
In [ ]:
In [ ]:
arr2D.shape
In [ ]:
arr2D.T.shape
In [ ]:
arr2D
In [ ]:
arr2D.T
For higher dimensional arrays, transpose
will accept a tuple of axis numbers to permute the axes:
In [ ]:
arr = np.arange(16).reshape((2,2,4))
arr
In [ ]:
arr.transpose(1,0,2)
In [ ]:
arr[0]
In [ ]:
arr.transpose(1,0,2)[1]
A universal function (ufunc) is a function that performs elementwise operations on data stired in ndarrays. They can be seen as fast vectorized wrappers for simple functions that take an array of values and produce an array of results. Many such ufuncs are elementwise representations and are called unary ufuncs:
In [ ]:
arr1D = np.arange(10)
np.sqrt(arr1D)
Other functions take two arrays as arguments. They are called binary ufuncs and some examples are add
or maximum
.
In [ ]:
i = np.random.randint(8, size=8)
i
In [ ]:
j = np.random.randint(8, size=8)
j
In [ ]:
np.maximum(i,j) #produces element-wise maxima
See the following link for more on ufuncs.
Let us start with a simple example, we will evaluate the function $\sqrt{x^2 + y^2}$ on a grid of equally spaced $(x,y)$ values.
In [ ]:
points = np.arange(-5, 5, 0.01)
print(points.ndim, points.shape, points.size)
The np.mesgrid()
function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y)
in the two arrays:
In [ ]:
import numpy as np
x, y = np.meshgrid(points, points)
In [ ]:
y
To evaluate the function we simply write the same expression we would write for two scalar values, only now x
and y
are arrays. Being of the same dimension, the operation is carried out element wise:
In [ ]:
zp = np.sqrt(x**2 + y**2)
In [ ]:
zp
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
plt.imshow(zp, cmap=plt.cm.gray)
plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a evenly spaced grid")
In [ ]:
xarr = np.linspace(1.1,1.5,5)
yarr = np.linspace(2.1,2.5,5)
We also have a boolean array of the same dimension, for instance:
In [ ]:
condition = np.array([True, False, True, True, False])
Let us build a new array combining xarr
and yarr
in such a way that whenever condition
is True we use the value in xarr
and yarr
otherwise:
In [ ]:
new_arr = [x if c else y for x,y,c in zip(xarr, yarr, condition)]
new_arr
The latter is not the most efficient solution:
The np.where
function offers a concise and fast alternative:
In [ ]:
new_arr = np.where(condition, xarr, yarr)
new_arr
Exercise
In [ ]:
arr = np.random.randn(16).reshape(4,4)
arr
In [ ]:
new1 = np.where(arr > 0, 2, -2)
new1
In [ ]:
new2 = np.where(arr > 0, 2, arr)
new2
Numpy offers a set of mathematical functions that compute statistics about an entire array or about the data along an axis. These functions are accessed as array methods. Reductions or aggregations such as sum
, mean
, and std
can be used by calling the array method or the Numpy function:
In [ ]:
import numpy as np
arr = np.random.randn(5,4)
In [ ]:
arr
In [ ]:
arr.mean()
In [ ]:
np.mean(arr)
Functions like mean
and sum
take an optional axis
argument which computes the statistic over the given axis, yielding an array with one fewer dimension
In [ ]:
arr.mean(axis=1)
In [ ]:
arr.mean(1)
Other methods like cumsum
and cumprod
do not aggregate but they produce an array of the intermediate results:
In [ ]:
arr = np.arange(9).reshape(3,3)
arr
In [ ]:
arr.cumsum() #cumulative sum of the complete matrix
In [ ]:
arr.cumsum(0) #cumulative sum over columns
In [ ]:
arr.cumsum(1) #cumulative sum over rows
In [ ]:
arr = np.random.randn(100)
(arr > 0).sum()
There are two additional methods that are particularly useful for boolean arrays: any
and all
. any
tests whether one or more values in an array is True
, while all
checks if every value is True
:
In [ ]:
np.random.randint?
In [ ]:
boolarr = np.array(np.random.randint(0,2,size=10),dtype=np.bool)
boolarr
In [ ]:
boolarr.any()
In [ ]:
boolarr.all()
By the way, these two methods can be applied to non-boolean arrays, non-zero elements evaluate to True
.
In [ ]:
arr = np.random.randn(10)
arr
In [ ]:
arr.sort()
arr
In the case of multidimensional arrays can have each 1D section of values sorted in-place along an axis.
In [ ]:
arr = np.random.randn(5,3) # 5x3 matrix with random values
arr
In [ ]:
arr.sort(1) #sort the rows
arr
In [ ]:
arr.sort(0) #sort the columns
arr
It is worthwhile mentioning that the np.sort()
function is not fully equivalent to the sort
method. While the method sorts the array in-place, the function creates a new function.
Exercise Write a piece code that:
In [ ]:
arr = np.random.randn(5)
arr.sort()
quantile = 0.1
arr[int(quantile*len(arr))]
In [ ]:
x = np.linspace(1,6,6).reshape(2,3)
y = np.random.randint(0,30,size=6).reshape(3,2)
In [ ]:
x
In [ ]:
y
In [ ]:
x.dot(y)
The numpy.linalg
module of the NumPy module has a set of matrix decomposition, inverse, and determinant functions. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as BLAS, LAPACK or Intel MKL:
In [ ]:
from numpy.linalg import inv, qr
In [ ]:
X = np.random.randn(5,5)
In [ ]:
mat = X.T.dot(X) #Matrix product between X and its transpose
In [ ]:
inv(mat)
In [ ]:
mat.dot(inv(mat))
Type qr?
:
In [ ]:
In [ ]:
q, r = qr(mat)
See Numpy Documentation for the list of numpy.linalg functions