Learning Objectives: Learn how to create, transform and visualize multidimensional data of a single type using Numpy.
NumPy is the foundation for scientific computing and data science in Python. Its more data object is a multidimensional array with the following characteristics:
While this notebook doesn't focus on plotting, Matplotlib will be used to make a few basic plots.
In [113]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
The vizarray package will be used to visualize NumPy arrays:
In [114]:
import antipackage
from github.ellisonbg.misc import vizarray as va
This is the canonical way you should import Numpy:
In [115]:
import numpy as np
In [116]:
data = [0,2,4,6]
a = np.array(data)
In [117]:
type(a)
Out[117]:
In [118]:
a
Out[118]:
The vz.vizarray function can be used to visualize a 1d or 2d NumPy array using a colormap:
In [119]:
va.vizarray(a)
Out[119]:
The shape of the array:
In [120]:
a.shape
Out[120]:
The number of array dimensions:
In [121]:
a.ndim
Out[121]:
The number of array elements:
In [122]:
a.size
Out[122]:
The number of bytes the array takes up:
In [123]:
a.nbytes
Out[123]:
The dtype attribute describes the "data type" of the elements:
In [124]:
a.dtype
Out[124]:
Arrays can be created with nested lists or tuples:
In [125]:
data = [[0.0,2.0,4.0,6.0],[1.0,3.0,5.0,7.0]]
b = np.array(data)
In [126]:
b
Out[126]:
In [127]:
va.vizarray(b)
Out[127]:
In [128]:
b.shape, b.ndim, b.size, b.nbytes
Out[128]:
The arange function is similar to Python's builtin range function, but creates an array:
In [129]:
c = np.arange(0.0, 10.0, 1.0) # Step size of 1.0
c
Out[129]:
The linspace function is similar, but allows you to specify the number of points:
In [130]:
e = np.linspace(0.0, 5.0, 11) # 11 points
e
Out[130]:
There are also empty, zeros and ones functions:
In [131]:
np.empty((4,4))
Out[131]:
In [132]:
np.zeros((3,3))
Out[132]:
In [133]:
np.ones((3,3))
Out[133]:
See also:
empty_like, ones_like, zeros_likeeye, identity, diagArrays have a dtype attribute that encapsulates the "data type" of each element. It can be set:
dtype argument to an array creation functionHere is an integer valued array:
In [134]:
a = np.array([0,1,2,3])
In [135]:
a, a.dtype
Out[135]:
All array creation functions accept an optional dtype argument:
In [136]:
b = np.zeros((2,2), dtype=np.complex64)
b
Out[136]:
In [137]:
c = np.arange(0, 10, 2, dtype=np.float)
c
Out[137]:
You can use the astype method to create a copy of the array with a given dtype:
In [138]:
d = c.astype(dtype=np.int)
d
Out[138]:
IPython's tab completion is useful for exploring the various available dtypes:
In [139]:
np.float*?
The NumPy documentation on dtypes describes the many other ways of specifying dtypes.
Basic mathematical operations are elementwise for:
Fill an array with a value:
In [140]:
a = np.empty((3,3))
a.fill(0.1)
a
Out[140]:
In [141]:
b = np.ones((3,3))
b
Out[141]:
Addition is elementwise:
In [142]:
a+b
Out[142]:
Division is elementwise:
In [143]:
b/a
Out[143]:
As are powers:
In [144]:
a**2
Out[144]:
Scalar multiplication is also elementwise:
In [145]:
np.pi*b
Out[145]:
Indexing and slicing provide an efficient way of getting the values in an array and modifying them.
In [146]:
a = np.random.rand(10,10)
The enable function is part of vizarray and enables a nice display of arrays:
In [147]:
va.enable()
In [148]:
a
Out[148]:
List Python lists and tuples, NumPy arrays have zero-based indexing and use the [] syntax for getting and setting values:
In [149]:
a[0,0]
Out[149]:
An index of -1 refers to the last element along that axis:
In [150]:
a[-1,-1] == a[9,9]
Out[150]:
Extract the 0th column using the : syntax, which denotes all elements along that axis.
In [151]:
a[:,0]
Out[151]:
The last row:
In [152]:
a[-1,:]
Out[152]:
You can also slice ranges:
In [153]:
a[0:2,0:2]
Out[153]:
Assignment also works with slices:
In [154]:
a[0:5,0:5] = 1.0
In [155]:
a
Out[155]:
Note how even though we assigned the value to the slice, the original array was changed. This clarifies that slices are views of the same data, not a copy.
In [156]:
va.disable()
Arrays can be indexed using other arrays that have boolean values.
In [157]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])
Boolean expressions involving arrays create new arrays with a bool dtype and the elementwise result of the expression:
In [158]:
ages > 30
Out[158]:
In [159]:
genders == 'm'
Out[159]:
Boolean expressions provide an extremely fast and flexible way of querying arrays:
In [160]:
(ages > 10) & (ages < 50)
Out[160]:
You can use a boolean array to index into the original or another array. This selects the ages of all females in the genders array:
In [161]:
mask = (genders == 'f')
ages[mask]
Out[161]:
In [162]:
ages[ages>30]
Out[162]:
In [163]:
va.enable()
In [164]:
a = np.random.rand(3,4)
In [165]:
a
Out[165]:
The T atrribute contains the transpose of the original array:
In [166]:
a.T
Out[166]:
The reshape method can be used to change the shape and even the number of dimensions:
In [167]:
a.reshape(2,6)
Out[167]:
In [168]:
a.reshape(6,2)
Out[168]:
The ravel method strings the array out in one dimension:
In [169]:
a.ravel()
Out[169]:
In [170]:
va.disable()
Universal function, or "ufuncs," are functions that take and return arrays or scalars. They have the following characteristics:
In [171]:
va.set_block_size(5)
va.enable()
Here is a linear sequence of values"
In [172]:
t = np.linspace(0.0, 4*np.pi, 100)
t
Out[172]:
Take the $sin$ of each element of the array:
In [173]:
np.sin(t)
Out[173]:
As the next two examples show, multiple ufuncs can be used to create complex mathematical expressions that can be computed efficiently:
In [174]:
np.exp(np.sqrt(t))
Out[174]:
In [175]:
va.disable()
va.set_block_size(30)
In [176]:
plt.plot(t, np.exp(-0.1*t)*np.sin(t))
Out[176]:
In general, you should always try to use ufuncs rather than do computations using for loops. These types of array based computations are referred to as vectorized.
In [177]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])
Numpy has a basic set of methods and function for computing basic quantities about data.
In [178]:
ages.min(), ages.max()
Out[178]:
Compute the mean:
In [179]:
ages.mean()
Out[179]:
Compute the variance and standard deviation:
In [180]:
ages.var(), ages.std()
Out[180]:
The bincount function counts how many times each value occurs in the array:
In [181]:
np.bincount(ages)
Out[181]:
The cumsum and cumprod methods compute cumulative sums and products:
In [182]:
ages.cumsum()
Out[182]:
In [183]:
ages.cumprod()
Out[183]:
Most of the functions and methods above take an axis argument that will apply the action along a particular axis:
In [184]:
a = np.random.randint(0,10,(3,4))
a
Out[184]:
With axis=0 the action takes place along rows:
In [185]:
a.sum(axis=0)
Out[185]:
With axis=1 the action takes place along columns:
In [186]:
a.sum(axis=1)
Out[186]:
The unique function is extremely useful in working with categorical data:
In [187]:
np.unique(genders)
Out[187]:
In [188]:
np.unique(genders, return_counts=True)
Out[188]:
The where function allows you to apply conditional logic to arrays. Here is a rough sketch of how it works:
def where(condition, if_false, if_true):
In [189]:
np.where(ages>30, 0, 1)
Out[189]:
The if_false and if_true values can be arrays themselves:
In [190]:
np.where(ages<30, 0, ages)
Out[190]:
NumPy has a a number of different function to reading and writing arrays to and from disk.
In [191]:
a = np.random.rand(10)
a
Out[191]:
Save the array to a binary file named array1.npy:
In [192]:
np.save('array1', a)
In [193]:
ls
Using %pycat to look at the file shows that it is binary:
In [194]:
%pycat array1.npy
Load the array back into memory:
In [195]:
a_copy = np.load('array1.npy')
In [196]:
a_copy
Out[196]:
In [197]:
b = np.random.randint(0,10,(5,3))
b
Out[197]:
The savetxt function saves arrays in a simple, textual format that is less effecient, but easier for other languges to read:
In [198]:
np.savetxt('array2.txt', b)
In [199]:
ls
Using %pycat to look at the contents shows that the files is indeed a plain text file:
In [200]:
%pycat array2.txt
In [201]:
np.loadtxt('array2.txt')
Out[201]:
The savez function provides an efficient way of saving multiple arrays to a single file:
In [202]:
np.savez('arrays.npz', a=a, b=b)
The load function returns a dictionary like object that provides access to the individual arrays:
In [203]:
a_and_b = np.load('arrays.npz')
In [204]:
a_and_b['a']
Out[204]:
In [205]:
a_and_b['b']
Out[205]:
NumPy has excellent linear algebra capabilities.
In [206]:
a = np.random.rand(5,5)
b = np.random.rand(5,5)
Remember that array operations are elementwise. Thus, this is not matrix multiplication:
In [207]:
a*b
Out[207]:
To get matrix multiplication use np.dot:
In [208]:
np.dot(a, b)
Out[208]:
Or, NumPy as a matrix subclass for which matrix operations are the default:
In [209]:
m1 = np.matrix(a)
m2 = np.matrix(b)
In [210]:
m1*m2
Out[210]:
The np.linalg package has a wide range of fast linear algebra operations.
Here is determinant:
In [211]:
np.linalg.det(a)
Out[211]:
Matrix inverse:
In [212]:
np.linalg.inv(a)
Out[212]:
Eigenvalues:
In [213]:
np.linalg.eigvals(a)
Out[213]:
NumPy can be built against fast BLAS/LAPACK implementation for these linear algebra operations.
In [214]:
c = np.random.rand(2000,2000)
In [215]:
%timeit -n1 -r1 evs = np.linalg.eigvals(c)
NumPy has functions for creating arrays of random numbers from different distributions in np.random, as well as handling things like permutation, shuffling, and choosing.
Here is the numpy.random documentation.
In [216]:
plt.hist(np.random.random(250))
plt.title('Uniform Random Distribution $[0,1]$')
plt.xlabel('value')
plt.ylabel('count')
Out[216]:
In [217]:
plt.hist(np.random.randn(250))
plt.title('Standard Normal Distribution')
plt.xlabel('value')
plt.ylabel('count')
Out[217]:
The shuffle function shuffles an array in place:
In [218]:
a = np.arange(0,10)
np.random.shuffle(a)
a
Out[218]:
The permutation function does the same thing but first makes a copy:
In [219]:
a = np.arange(0,10)
print(np.random.permutation(a))
print(a)
The choice function provides a powerful way of creating synthetic data sets of discrete data:
In [220]:
np.random.choice(['m','f'], 20, p=[0.25,0.75])
Out[220]: