NumPy is the fundamental package for scientific computing with Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized, performance is very good.
So, in a nutshell:
If you are a MATLAB® user I do recommend to read Numpy for MATLAB Users.
I'm a supporter of the Open Science Movement, thus I humbly suggest you to take a look at the Science Code Manifesto
NumPy's main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type.
In Numpy dimensions are called axes.
The number of axes is called rank.
The most important attributes of an ndarray object are:
To use numpy need to import the module it using of example:
In [2]:
import numpy as np # naming import convention
In the numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array.
On the web: http://docs.scipy.org/
Interactive help:
In [ ]:
np.array?
If you're looking for something
Let's start by creating some numpy.array objects in order to get our hands into the very details of numpy basic data structure.
NumPy is a very flexible library, and provides many ways to create (and initialize) new numpy arrays.
One way is using specific functions dedicated to generate numpy arrays (usually, array of numbers)[+]
[+] More on data types, later on !-)
NumPy provides many functions to generate arrays with with specific properties (e.g. size or shape).
We will see later examples in which we will generate ndarray using explicit Python lists.
However, for larger arrays, using Python lists is simply inpractical.
In standard Python, we use the range function to generate an iterable object of integers within a specific range (at a specified step, default: 1)
In [3]:
r = range(10)
print(list(r))
print(type(r)) # NOTE: if this print will return a <type 'list'> it means you're using Py2.7
Similarly, in numpy there is the arange function which instead generates a numpy.ndarray
In [4]:
ra = np.arange(10)
print(ra)
print(type(ra))
However, we are working with the Numerical Python library, so we should expect more when it comes to numbers.
In fact, we can create an array within a floating point step-wise range:
In [5]:
# floating point step-wise range generatation
raf = np.arange(-1, 1, 0.1)
print(raf)
Apart from the actual content, which is of course different because specified ranges are different, the ra and raf arrays differ by their dtype:
In [6]:
print(f"dtype of 'ra': {ra.dtype}, dtype of 'raf': {raf.dtype}")
In [7]:
ra.itemsize # bytes per element
Out[7]:
In [8]:
ra.nbytes # number of bytes
Out[8]:
In [9]:
ra.ndim # number of dimensions
Out[9]:
In [10]:
ra.shape # shape, i.e. number of elements per-dimension/axis
Out[10]:
In [ ]:
## please replicate the same set of operations here for `raf`
In [ ]:
# your code here
Q: Do you notice any relevant difference?
Like np.arange, in numpy there are other two "similar" functions:
Looking at the examples below, can you spot the difference?
In [11]:
np.linspace(0, 10, 20)
Out[11]:
In [12]:
np.logspace(0, np.e**2, 10, base=np.e)
Out[12]:
In [13]:
# uniform random numbers in [0,1]
ru = np.random.rand(10)
In [14]:
ru
Out[14]:
Note: numbers and the content of the array may vary
In [15]:
# standard normal distributed random numbers
rs = np.random.randn(10)
In [16]:
rs
Out[16]:
Note: numbers and the content of the array may vary
Q: What if I ask you to generate random numbers in a way that we both obtain the very same numbers? (Provided we share the same CPU architecture)
In [17]:
Z = np.zeros((3,3))
print(Z)
In [18]:
O = np.ones((3, 3))
print(O)
In [19]:
E = np.empty(10)
print(E)
In [ ]:
# TRY THIS!
np.empty(9)
In [20]:
# a diagonal matrix
np.diag([1,2,3])
Out[20]:
In [21]:
# diagonal with offset from the main diagonal
np.diag([1,2,3], k=1)
Out[21]:
In [22]:
# a diagonal matrix with ones on the main diagonal
np.eye(3, dtype='int') # 3 is the
Out[22]:
To create new vector or matrix arrays from Python lists we can use the
numpy.array constructor function:
In [23]:
v = np.array([1,2,3,4])
v
Out[23]:
In [24]:
print(type(v))
Alternatively there is also the np.asarray function which easily convert a Python list into a numpy array:
In [25]:
v = np.asarray([1, 2, 3, 4])
v
Out[25]:
In [26]:
print(type(v))
We can use the very same strategy for higher-dimensional arrays.
E.g. Let's create a matrix from a list of lists:
In [27]:
M = np.array([[1, 2], [3, 4]])
M
Out[27]:
In [28]:
v.shape, M.shape
Out[28]:
So far the numpy.ndarray looks awefully much like a Python list (or nested list).
Why not simply use Python lists for computations instead of creating a new array type?
There are several reasons:
numpy arrays can be implemented in a compiled language (C and Fortran is used).
In [29]:
L = range(100000)
In [30]:
%timeit [i**2 for i in L]
In [31]:
a = np.arange(100000)
In [32]:
%timeit a**2 # This operation is called Broadcasting - more on this later!
In [33]:
%timeit [element**2 for element in a]
Create simple one and two dimensional arrays. First, redo the examples from above. And then create your own.
Use the functions len, shape and ndim on some of those arrays and
observe their output.
In [ ]:
Experiment with arange, linspace, ones, zeros, eye and diag.
Create different kinds of arrays with random numbers.
Try setting the seed before creating an array with random values
np.random.seed
In [ ]:
NumPy has a multidimensional array object called ndarray. It consists of two parts as follows:
The majority of array operations leave the raw data untouched. The only aspect that changes is the metadata.
This internal separation between actual data (i.e. the content of the array --> the memory) and metadata (i.e. properties and attributes of the data), allows for example for an efficient memory management.
For example, the shape of an Numpy array can be modified without copying and/or affecting the actual data, which makes it a fast operation even for large arrays.
In [34]:
a = np.arange(45)
a
Out[34]:
In [35]:
a.shape
Out[35]:
In [36]:
A = a.reshape(9, 5)
A
Out[36]:
In [37]:
n, m = A.shape
In [38]:
B = A.reshape((1,n*m))
B
Out[38]:
Q: What is the difference (in terms of shape) between B and the original a?
In [39]:
A = np.array([[1, 2, 3], [4, 5, 6]])
A.ravel()
Out[39]:
By default, the np.ravel performs the operation row-wise á-la-C. Numpy also support a Fortran-style order of indices (i.e. column-major indexing)
In [40]:
A.ravel('F') # order F (Fortran) is column-major, C (default) row-major
Out[40]:
Alternatively We can also use the function np.flatten to make a higher-dimensional array into a vector. But this function create a copy of the data.
In [41]:
A.T
Out[41]:
In [42]:
A.T.ravel()
Out[42]:
(1) We can always add as many axis as we want:
In [43]:
A = np.arange(20).reshape(10, 2)
A = A[np.newaxis, ...] # this is called ellipsis
print(A.shape)
(2) We can also permute axis:
In [44]:
A = A.swapaxes(0, 2) # swap axis 0 with axis 2 --> new shape: (2, 10, 1)
print(A.shape)
Again, changin and manipulating the axis will not touch the memory, it will just change parameters (i.e. strides and offset) to navigate data.
In NumPy, talking about int or float does not make "real sense". This is mainly for two reasons:
(a) int or float are assumed at the maximum precision available on your machine (presumably int64 and
float64, respectively.
(b) Different precision imply different numerical ranges, and so different memory size (i.e. number of bytes required to represent all the numbers in the corresponding numerical range).
Numpy support the following numerical types:
bool | This stores boolean (True or False) as a bit
int0 | This is a platform integer (normally either int32 or int64)
int8 | This is an integer ranging from -128 to 127
int16 | This is an integer ranging from -32768 to 32767
int32 | This is an integer ranging from -2 ** 31 to 2 ** 31 -1
int64 | This is an integer ranging from -2 ** 63 to 2 ** 63 -1
uint8 | This is an unsigned integer ranging from 0 to 255
uint16 | This is an unsigned integer ranging from 0 to 65535
uint32 | This is an unsigned integer ranging from 0 to 2 ** 32 - 1
uint64 | This is an unsigned integer ranging from 0 to 2 ** 64 - 1
float16 | This is a half precision float with sign bit, 5 bits exponent, and 10 bits mantissa
float32 | This is a single precision float with sign bit, 8 bits exponent, and 23 bits mantissa
float64 or float | This is a double precision float with sign bit, 11 bits exponent, and 52 bits mantissa
complex64 | This is a complex number represented by two 32-bit floats (real and imaginary components)
complex128 | This is a complex number represented by two 64-bit floats (real and imaginary components)
(or complex)
The numerical dtype of an array should be selected very carefully, as it directly affects the numerical representation of elements, that is:
We can always specify the dtype of an array when we create one. If we do not, the dtype of the array will be inferred, namely np.int_ or np.float_ depending on the case.
In [45]:
a = np.arange(10)
print(a)
print(a.dtype)
In [46]:
au = np.arange(10, dtype=np.uint8)
print(au)
print(au.dtype)
So, then: What happens if I try to represent a number that is Out of range?
Let's have a go with integers, i.e., int8 and uint8
In [47]:
x = np.zeros(4, 'int8') # Integer ranging from -128 to 127
x
Out[47]:
Spoiler Alert: very simple example of indexing in NumPy
Well...it works as expected, doesn't it?
In [48]:
x[0] = 127
x
Out[48]:
In [49]:
x[0] = 128
x
Out[49]:
In [50]:
x[1] = 129
x
Out[50]:
In [51]:
x[2] = 257 # i.e. (128 x 2) + 1
x
Out[51]:
In [52]:
ux = np.zeros(4, 'uint8') # Integer ranging from 0 to 255, dtype also as string!
ux
Out[52]:
In [53]:
ux[0] = 255
ux[1] = 256
ux[2] = 257
ux[3] = 513 # (256 x 2) + 1
ux
Out[53]:
Numpy provides two functions to inspect the information of supported integer and floating-point types, namely np.iinfo and np.finfo:
In [54]:
np.iinfo(np.int32)
Out[54]:
In [55]:
np.finfo(np.float16)
Out[55]:
In addition, the MachAr class will provide information on the current machine :
In [56]:
machine_info = np.MachAr()
In [57]:
machine_info.epsilon
Out[57]:
In [58]:
machine_info.huge
Out[58]:
In [59]:
np.finfo(np.float64).max == machine_info.huge
Out[59]:
In [ ]:
# TRY THIS!
help(machine_info)
Data type objects are instances of the numpy.dtype class.
Once again, arrays have a data type.
To be precise, every element in a NumPy array has the same data type.
The data type object can tell you the size of the data in bytes.
(Recall: The size in bytes is given by the itemsize attribute of the dtype class)
In [60]:
a = np.arange(7, dtype=np.uint16)
print('a itemsize: ', a.itemsize)
print('a.dtype.itemsize: ', a.dtype.itemsize)
Character codes are included for backward compatibility with Numeric.
Numeric is the predecessor of NumPy. Their use is not recommended, but these codes pop up in several places.
Btw, You should instead use the dtype objects.
integer i
Unsigned integer u
Single precision float f
Double precision float d
bool b
complex D
string S
unicode U
In [61]:
np.dtype(float)
Out[61]:
In [62]:
np.dtype('f')
Out[62]:
In [63]:
np.dtype('d')
Out[63]:
In [64]:
np.dtype('f8')
Out[64]:
In [65]:
np.dtype('U10') # Unicode string of up to 10 chars
Out[65]:
Note: A listing of all data type names can be found by calling np.sctypeDict.keys()
We can use the np.dtype constructor to create a custom record type.
In [66]:
rt = np.dtype([('name', np.str_, 40), ('numitems', np.int32), ('price', np.float32)])
In [67]:
rt['name'] # see the difference with Python 2
Out[67]:
In [68]:
rt['numitems']
Out[68]:
In [69]:
rt['price']
Out[69]:
dtype equal to t (record type)
In [70]:
record_items = np.array([('Meaning of life DVD', 42, 3.14), ('Butter', 13, 2.72)],
dtype=rt)
In [71]:
print(record_items)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: