Learning Objectives: Learn how to create, transform and visualize multidimensional data of a single type using Numpy.
NumPy is the foundation for scientific computing and data science in Python. Its more data object is a multidimensional array with the following characteristics:
While this notebook doesn't focus on plotting, Matplotlib will be used to make a few basic plots.
In [3]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
The vizarray
package will be used to visualize NumPy arrays:
In [4]:
import antipackage
from github.ellisonbg.misc import vizarray as va
This is the canonical way you should import Numpy:
In [5]:
import numpy as np
In [6]:
data = [0,2,4,6]
a = np.array(data)
In [7]:
type(a)
Out[7]:
In [8]:
a
Out[8]:
The vz.vizarray
function can be used to visualize a 1d or 2d NumPy array using a colormap:
In [9]:
va.vizarray(a)
Out[9]:
The shape of the array:
In [10]:
a.shape
Out[10]:
The number of array dimensions:
In [11]:
a.ndim
Out[11]:
The number of array elements:
In [12]:
a.size
Out[12]:
The number of bytes the array takes up:
In [13]:
a.nbytes
Out[13]:
The dtype
attribute describes the "data type" of the elements:
In [14]:
a.dtype
Out[14]:
Arrays can be created with nested lists or tuples:
In [15]:
data = [[0.0,2.0,4.0,6.0],[1.0,3.0,5.0,7.0]]
b = np.array(data)
In [16]:
b
Out[16]:
In [17]:
va.vizarray(b)
Out[17]:
In [18]:
b.shape, b.ndim, b.size, b.nbytes
Out[18]:
The arange
function is similar to Python's builtin range
function, but creates an array:
In [19]:
c = np.arange(0.0, 10.0, 1.0) # Step size of 1.0
c
Out[19]:
The linspace
function is similar, but allows you to specify the number of points:
In [20]:
e = np.linspace(0.0, 5.0, 11) # 11 points
e
Out[20]:
There are also empty
, zeros
and ones
functions:
In [21]:
np.empty((4,4))
Out[21]:
In [22]:
np.zeros((3,3))
Out[22]:
In [23]:
np.ones((3,3))
Out[23]:
See also:
empty_like
, ones_like
, zeros_like
eye
, identity
, diag
Arrays have a dtype
attribute that encapsulates the "data type" of each element. It can be set:
dtype
argument to an array creation functionHere is an integer valued array:
In [24]:
a = np.array([0,1,2,3])
In [25]:
a, a.dtype
Out[25]:
All array creation functions accept an optional dtype
argument:
In [26]:
b = np.zeros((2,2), dtype=np.complex64)
b
Out[26]:
In [27]:
c = np.arange(0, 10, 2, dtype=np.float)
c
Out[27]:
You can use the astype
method to create a copy of the array with a given dtype
:
In [28]:
d = c.astype(dtype=np.int)
d
Out[28]:
IPython's tab completion is useful for exploring the various available dtypes
:
In [29]:
np.float*?
The NumPy documentation on dtypes describes the many other ways of specifying dtypes.
Basic mathematical operations are elementwise for:
Fill an array with a value:
In [30]:
a = np.empty((3,3))
a.fill(0.1)
a
Out[30]:
In [31]:
b = np.ones((3,3))
b
Out[31]:
Addition is elementwise:
In [32]:
a+b
Out[32]:
Division is elementwise:
In [33]:
b/a
Out[33]:
As are powers:
In [34]:
a**2
Out[34]:
Scalar multiplication is also elementwise:
In [35]:
np.pi*b
Out[35]:
Indexing and slicing provide an efficient way of getting the values in an array and modifying them.
In [36]:
a = np.random.rand(10,10)
The enable
function is part of vizarray
and enables a nice display of arrays:
In [37]:
va.enable()
In [38]:
a
Out[38]:
List Python lists and tuples, NumPy arrays have zero-based indexing and use the []
syntax for getting and setting values:
In [39]:
a[0,0]
Out[39]:
An index of -1
refers to the last element along that axis:
In [40]:
a[-1,-1] == a[9,9]
Out[40]:
Extract the 0th column using the :
syntax, which denotes all elements along that axis.
In [41]:
a[:,0]
Out[41]:
The last row:
In [42]:
a[-1,:]
Out[42]:
You can also slice ranges:
In [43]:
a[0:2,0:2]
Out[43]:
Assignment also works with slices:
In [44]:
a[0:5,0:5] = 1.0
In [45]:
a
Out[45]:
Note how even though we assigned the value to the slice, the original array was changed. This clarifies that slices are views of the same data, not a copy.
In [46]:
va.disable()
Arrays can be indexed using other arrays that have boolean values.
In [47]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])
Boolean expressions involving arrays create new arrays with a bool
dtype and the elementwise result of the expression:
In [48]:
ages > 30
Out[48]:
In [49]:
genders == 'm'
Out[49]:
Boolean expressions provide an extremely fast and flexible way of querying arrays:
In [50]:
(ages > 10) & (ages < 50)
Out[50]:
You can use a boolean array to index into the original or another array. This selects the ages of all females in the genders
array:
In [51]:
mask = (genders == 'f')
ages[mask]
Out[51]:
In [52]:
ages[ages>30]
Out[52]:
In [53]:
va.enable()
In [54]:
a = np.random.rand(3,4)
In [55]:
a
Out[55]:
The T
atrribute contains the transpose of the original array:
In [56]:
a.T
Out[56]:
The reshape
method can be used to change the shape and even the number of dimensions:
In [57]:
a.reshape(2,6)
Out[57]:
In [58]:
a.reshape(6,2)
Out[58]:
The ravel
method strings the array out in one dimension:
In [59]:
a.ravel()
Out[59]:
In [60]:
va.disable()
Universal function, or "ufuncs," are functions that take and return arrays or scalars. They have the following characteristics:
In [61]:
va.set_block_size(5)
va.enable()
Here is a linear sequence of values"
In [62]:
t = np.linspace(0.0, 4*np.pi, 100)
t
Out[62]:
Take the $sin$ of each element of the array:
In [63]:
np.sin(t)
Out[63]:
As the next two examples show, multiple ufuncs can be used to create complex mathematical expressions that can be computed efficiently:
In [64]:
np.exp(np.sqrt(t))
Out[64]:
In [65]:
va.disable()
va.set_block_size(30)
In [66]:
plt.plot(t, np.exp(-0.1*t)*np.sin(t))
Out[66]:
In general, you should always try to use ufuncs rather than do computations using for loops. These types of array based computations are referred to as vectorized.
In [67]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])
Numpy has a basic set of methods and function for computing basic quantities about data.
In [68]:
ages.min(), ages.max()
Out[68]:
Compute the mean:
In [69]:
ages.mean()
Out[69]:
Compute the variance and standard deviation:
In [70]:
ages.var(), ages.std()
Out[70]:
The bincount
function counts how many times each value occurs in the array:
In [71]:
np.bincount(ages)
Out[71]:
The cumsum
and cumprod
methods compute cumulative sums and products:
In [72]:
ages.cumsum()
Out[72]:
In [73]:
ages.cumprod()
Out[73]:
Most of the functions and methods above take an axis
argument that will apply the action along a particular axis:
In [74]:
a = np.random.randint(0,10,(3,4))
a
Out[74]:
With axis=0
the action takes place along rows:
In [75]:
a.sum(axis=0)
Out[75]:
With axis=1
the action takes place along columns:
In [76]:
a.sum(axis=1)
Out[76]:
The unique
function is extremely useful in working with categorical data:
In [77]:
np.unique(genders)
Out[77]:
In [78]:
np.unique(genders, return_counts=True)
Out[78]:
The where function allows you to apply conditional logic to arrays. Here is a rough sketch of how it works:
def where(condition, if_false, if_true):
In [79]:
np.where(ages>30, 0, 1)
Out[79]:
The if_false
and if_true
values can be arrays themselves:
In [80]:
np.where(ages<30, 0, ages)
Out[80]:
NumPy has a a number of different function to reading and writing arrays to and from disk.
In [81]:
a = np.random.rand(10)
a
Out[81]:
Save the array to a binary file named array1.npy
:
In [84]:
np.save('array1', a)
In [85]:
ls
Using %pycat
to look at the file shows that it is binary:
In [86]:
%pycat array1.npy
Load the array back into memory:
In [85]:
a_copy = np.load('array1.npy')
In [86]:
a_copy
Out[86]:
In [87]:
b = np.random.randint(0,10,(5,3))
b
Out[87]:
The savetxt
function saves arrays in a simple, textual format that is less effecient, but easier for other languges to read:
In [88]:
np.savetxt('array2.txt', b)
In [89]:
ls
Using %pycat
to look at the contents shows that the files is indeed a plain text file:
In [90]:
%pycat array2.txt
In [91]:
np.loadtxt('array2.txt')
Out[91]:
The savez
function provides an efficient way of saving multiple arrays to a single file:
In [92]:
np.savez('arrays.npz', a=a, b=b)
The load
function returns a dictionary like object that provides access to the individual arrays:
In [93]:
a_and_b = np.load('arrays.npz')
In [94]:
a_and_b['a']
Out[94]:
In [95]:
a_and_b['b']
Out[95]:
NumPy has excellent linear algebra capabilities.
In [96]:
a = np.random.rand(5,5)
b = np.random.rand(5,5)
Remember that array operations are elementwise. Thus, this is not matrix multiplication:
In [97]:
a*b
Out[97]:
To get matrix multiplication use np.dot
:
In [98]:
np.dot(a, b)
Out[98]:
Or, NumPy as a matrix
subclass for which matrix operations are the default:
In [99]:
m1 = np.matrix(a)
m2 = np.matrix(b)
In [100]:
m1*m2
Out[100]:
The np.linalg
package has a wide range of fast linear algebra operations.
Here is determinant:
In [101]:
np.linalg.det(a)
Out[101]:
Matrix inverse:
In [ ]:
np.linalg.inv(a)
Eigenvalues:
In [ ]:
np.linalg.eigvals(a)
NumPy can be built against fast BLAS/LAPACK implementation for these linear algebra operations.
In [118]:
c = np.random.rand(2000,2000)
In [119]:
%timeit -n1 -r1 evs = np.linalg.eigvals(c)
NumPy has functions for creating arrays of random numbers from different distributions in np.random
, as well as handling things like permutation, shuffling, and choosing.
Here is the numpy.random documentation.
In [104]:
plt.hist(np.random.random(250))
plt.title('Uniform Random Distribution $[0,1]$')
plt.xlabel('value')
plt.ylabel('count')
Out[104]:
In [105]:
plt.hist(np.random.randn(250))
plt.title('Standard Normal Distribution')
plt.xlabel('value')
plt.ylabel('count')
Out[105]:
The shuffle
function shuffles an array in place:
In [106]:
a = np.arange(0,10)
np.random.shuffle(a)
a
Out[106]:
The permutation
function does the same thing but first makes a copy:
In [107]:
a = np.arange(0,10)
print(np.random.permutation(a))
print(a)
The choice
function provides a powerful way of creating synthetic data sets of discrete data:
In [110]:
np.random.choice(['m','f'], 20, p=[0.25,0.75])
Out[110]: