From NumPy's website we have the following description:
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
You can think of Numpy as standard Python lists on steroids!
There are a few reasons why Numpy is so much faster than lists:
type
And since a Data Scientist is always learning, here's an excellent resource on Arrays - scipy array tip sheet
Arrays contain uniform data type, with an arbitrary number of dimensions. What's a dimension? It's just a big word to denote how many levels deep the array goes.
Dimensions are nothing more than lists inside lists inside lists...
As we saw earlier with Matplotlib, there are some conventions for importing Numpy too.
In [2]:
import numpy as np
In [3]:
# Create an array with the statement np.array
a = np.array([1,2,3,4])
print('a is of type:', type(a))
print('dimension of a:', a.ndim) # To find the dimension of 'a'
In [4]:
arr1 = np.array([1,2,3,4])
arr1.ndim
Out[4]:
In [5]:
arr2 = np.array([[1,2],[2,3],[3,4],[4,5]])
arr2.ndim
Out[5]:
In [6]:
# Doesn't make a difference to a computer how you represent it,
# but if humans are going to read your code, this might be useful
arr3 = np.array([[[1,2],[2,3]],
[[2,3],[3,4]],
[[4,5],[5,6]],
[[6,7],[7,8]]
])
arr3.ndim
Out[6]:
In [7]:
arr4 = np.array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
arr4.ndim
Out[7]:
One easy to tell the number of dimensions - look at the number of square brackets at the beginning. [[ = 2 dimensions. [[[ = 3 dimensions.
Remember, dimensions are nothing more than lists inside lists inside lists...
Why use Numpy Arrays, and not just list? One reason right here.
In [8]:
a_list = [1,2,3,4,5]
b_list = [5,10,15,20,25]
# Multiplying these will give an error
print(a_list * b_list)
In [9]:
a_list = np.array([1,2,3,4,5])
b_list = np.array([5,10,15,20,25])
print(a_list * b_list)
Numpy allows for vectorisation, i.e. operations are applied to whole arrays instead of individual elements. To get the results of a_list * b_list
using traditional python, you would have had to write a for loop. When dealing with millions or billions of lines of data, that can be inefficient. We will spend some more time on operations of this nature when we get to Broadcasting.
In [10]:
arr1 = np.arange(16)
print(arr1)
We can even reshape these arrays into our desired shape. But remember, when we say desired shape, we are not speaking of circles or pentagons. Think square, reactangles, cubes and the like.
In [11]:
arr1.reshape(4,4)
Out[11]:
In [12]:
arr1.reshape(2,8)
Out[12]:
In [13]:
arr1.reshape(8,2)
Out[13]:
In [14]:
arr1.reshape(16,1)
Out[14]:
In [15]:
np.random.seed(42)
rand_arr = np.random.randint(0,1000,20)
print(rand_arr)
Translating from Python to English, "call the randint module from the random module of numpy, then select 20 numbers between 0 and 999 at random, and assign that to an array named rand_arr i.e. 0 is included, 1000 is excluded.
In [16]:
rand_arr.reshape(5,4)
Out[16]:
In [17]:
rand_arr.reshape(4,5)
Out[17]:
In [18]:
rand_arr.reshape(2,10)
Out[18]:
Remember, the first number always represents the number of rows.
In [19]:
np.random.seed(42)
np.random.rand(5)
Out[19]:
In [20]:
np.random.seed(42)
np.random.rand(3,2)
Out[20]:
From the official documentation:
Return a sample (or samples) from the “standard normal” distribution.
For random samples from $$N(\mu, \sigma^2)$$ use:
sigma * np.random.randn(...) + mu
Don't get scared by the formula - it's actually very simple, and we will cover this in brief later on in the mathematics section.
In [21]:
np.random.seed(42)
np.random.randn(5)
Out[21]:
In [22]:
np.zeros(16)
Out[22]:
In [23]:
np.zeros((4,4))
Out[23]:
In [24]:
np.ones(5)
Out[24]:
In [25]:
np.ones((4,4))
Out[25]:
In [27]:
np.eye(10)
Out[27]:
From Numpy's official documentation:
Return evenly spaced numbers over a specified interval.
Returns num evenly spaced samples, calculated over the interval [start, stop].
The endpoint of the interval can optionally be excluded.
Here's an interesting discussion on SO about when to use Linspace v range - https://stackoverflow.com/questions/5779270/linspace-vs-range
In [28]:
# 5 evenly spaced numbers between -5 and 5
np.linspace(-5,5,5)
Out[28]:
In [29]:
import numpy as np
np.random.seed(42)
arr1 = np.random.randint(1,1000,100)
arr1 = arr1.reshape(10,10)
In [30]:
arr1.shape
Out[30]:
In [31]:
arr1
Out[31]:
Now imagine this is just a small snippet of a large array with millions, or even billions of numbers. Does that sound crazy? Well, Data Scientist regularly work with large arrays of numbers. The Netflix Data Scientists for example, deal with a high dimensional sparse matrix.
For smaller datasets, let's say, number of people who boarded a particular flight every day for the past hundred days, we have a few useful tools to find the highest or lowest values, and their corresponding locations.
In [32]:
# Find the highest value in arr1
arr1.max()
Out[32]:
In [33]:
# Find the lowest value in arr1
arr1.min()
Out[33]:
In [34]:
# Find the location of the highest value in arr1
arr1.argmax()
Out[34]:
Keep in mind that if we have duplicate entries, or multiple entries, only the first entry will be returned.
In [35]:
arr1.argmin()
Out[35]:
In [36]:
# From earlier
rand_arr = np.random.randint(0,1000,20)
rand_arr
Out[36]:
In [37]:
rand_arr = rand_arr.reshape(4,5)
In [38]:
rand_arr.shape
Out[38]:
In [39]:
rand_arr
Out[39]:
Secret! You already know how to select values from a numpy array.
In [40]:
import numpy as np
np.random.seed(42)
arr1 = np.arange(1,6)
arr1
Out[40]:
In [41]:
arr1[0]
Out[41]:
In [42]:
arr1[0:3]
Out[42]:
In [43]:
arr1[-1]
Out[43]:
Remember our old friend, lists?
And there you have it - you're already an expert in Numpy Indexing! And very soon, you will learn to be an expert at indexing 2D Matrices too.
In [54]:
import numpy as np
np.random.seed(42)
rand_arr = np.random.randint(0,1000,20)
print(rand_arr)
In [55]:
rand_arr = rand_arr.reshape(5,4)
rand_arr
Out[55]:
In [56]:
rand_arr[0]
Out[56]:
In [57]:
rand_arr[1]
Out[57]:
In [58]:
rand_arr[0][-1]
Out[58]:
In [59]:
# Another way to write the same thing
rand_arr[0,-1]
Out[59]:
Remember, rows before columns. Always!
How do we get entire rows, or snippets of values from rows?
Exactly the same as before. Nothing to worry about here!
In [60]:
import numpy as np
np.random.seed(42)
arr1 = np.arange(1,101)
arr1
Out[60]:
In [61]:
arr1 = arr1.reshape(10,10)
arr1
Out[61]:
In [62]:
# Step 1 - Narrow down the row
arr1[2] # 3rd row
Out[62]:
In [63]:
# 26 is at index 5, we need all the numbers from thr 6th column onwards
arr1[2,5:]
Out[63]:
In [64]:
# Step 1: Identify the Row
arr1[7:]
Out[64]:
In [65]:
# Now we need the first three columns
arr1[7:,:3]
Out[65]:
In [ ]:
# Your code here
In [ ]:
# Your code here
In [ ]:
# Your code here
While there are many ways to index, one of the more common methods that Data Scientists use will is Boolean Indexing. You can read more about indexing methods here.
In [102]:
import numpy as np
np.random.seed(42)
arr1 = np.random.randint(0,1000,100)
arr1
Out[102]:
In [103]:
# We check what values are greater than 150
arr1>150
Out[103]:
In [104]:
# Assign this operation to a variable x
mask = arr1>150
# Create a new array which subsets arr1 based on a boolean operation
arr2 = arr1[mask]
arr2
Out[104]:
In [105]:
# Check the shape
arr2.shape
Out[105]:
In [106]:
list1 = [1,3,5,7]
list2 = [2,4,6,8]
In [107]:
arr1 = np.arange(1,101)
arr1
Out[107]:
In [108]:
arr_even = arr1[list1]
arr_even
Out[108]:
In [109]:
# Alternatively
arr_even = arr1[[1,3,5,7]]
arr_even
Out[109]:
In [110]:
arr_odd = arr1[list2]
arr_odd
Out[110]:
This is similar to Fancy Indexing, but is arguably easier to use, atleast for me. I am sure you might develop a preference for this technique too. Additionally, Wes McKinney - the creator of Pandas, reports that "take" is faster than "fancy indexing".
In [111]:
arr1 = np.arange(1,101)
arr1
Out[111]:
In [112]:
indices = [0,2,4,10,20,80,91,97,99]
np.take(arr1, indices)
Out[112]:
Works with Multi-Dimensional
In [113]:
np.take(arr1, [[0, 1], [11, 18]])
Out[113]:
Broadcasting is a way for Numpy to work with arrays of different shapes.
The easiest example to explain broadcasting would be to use a scalar value. What's a scalar? A quantity having only magnitutde, but not direction. Speed is a scalar, velocity is a vector. For our practical Numpy purposes, scalars are real numbers - 1,2,3,4.....
Broadcasting is fast and efficient because all the underlying looping occurs in C, and happens on the fly without making copies of the data.
In [114]:
arr_1 = np.arange(1,11)
print(arr_1)
print(arr_1 * 10)
Here we have broadcast 10 to all other elements in the array. Remember Vectorisation? Same principles!
In [115]:
arr_1 = np.array([[1,2],[3,4]])
a = 2
arr_1 + a
Out[115]:
What about arrays of different dimensions and/or sizes? Well, for that, we have the broadcasting rule.
In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.
Umm....
In [116]:
arr1 = np.arange(1,13)
arr1
Out[116]:
In [117]:
arr1.shape
Out[117]:
In [118]:
arr1 = arr1.reshape(4,3).astype('float')
In [119]:
arr1
Out[119]:
A quick digression, in case you are wondering, the .astype('float')
was just a quick operation to convert integers to floats as you are already familiar with. If you want to find out what the data type of an element in a numpy array is, simply use the suffix .dtype
In [ ]:
arr1.dtype
In [ ]:
arr_example = np.array([1,2,3,4])
print(arr_example)
print('arr_example is an',arr_example.dtype)
arr_example = arr_example.astype('float')
print('arr_example is now a',arr_example.dtype)
Back to our array, arr1
In [120]:
arr1
Out[120]:
In [121]:
arr1.shape
Out[121]:
In [122]:
arr2 = np.array([0.0,1.0,2.0])
print(arr2)
print(arr2.shape)
In [123]:
arr1 + arr2
Out[123]:
Do you see what happened here? Our row with 3 elements, was sequentially added to each 3-element row in arr1.
The 1d array is represented as (3,), but think of it as simple a (3). The trailing axes have to match. So (4,3) and (3) match. What happens with it's (4,3) and (4)? It won't work! Let's prove it here.
In [128]:
arr3 = np.arange(0,4)
arr3 = arr3.astype('float')
print(arr3)
print(arr3.shape)
In [129]:
# Let's generate our error
arr1 + arr3
A final example now, with a (5,1) and (3) array. Read the rule once again - and it will be clear that the new array will be a 5X3 array.
In [130]:
arr4 = np.arange(1,6)
arr4
Out[130]:
In [135]:
arr4 = arr4.reshape(5,1).astype('float')
arr4.shape
Out[135]:
In [136]:
arr2
Out[136]:
In [133]:
arr4 * arr2
Out[133]:
So let's begin with some good news here. You have already performed some advanced algebraic operations! That's the power of numpy.
Let's look at a few more operations now that come in quite handy.
In [137]:
a1 = np.arange(1,21)
a1 = a1.reshape(4,5)
a1
Out[137]:
In [138]:
# Let's get the first column
a1[:,0]
Out[138]:
In [139]:
# Assign to new array
new_a1 = a1[:,0]
new_a1
Out[139]:
In [140]:
# Recall that this is how you select all values
new_a1[:] = 42
new_a1
Out[140]:
So what happened to our original array? Let's find out.
In [141]:
a1
Out[141]:
Why did that happen?! We never touched a1, and even went on to create a whole new array!
This is because Numpy is very efficient in the way it uses memory. If you want a copy, be explicit, else Numpy will make changes to the original array too. Here's how you make a copy.
In [142]:
a1_copy = a1.copy()
a1_copy
Out[142]:
In [143]:
a1_copy = np.arange(1,21)
a1_copy = a1_copy.reshape(4,5)
a1_copy
Out[143]:
In [144]:
a1
Out[144]:
In [145]:
np.square(a1)
Out[145]:
In [146]:
np.sqrt(a1)
Out[146]:
That's all folks!
You are on your way to become a Numpy expert, but please, constantly educate yourself. This is only a beginning, and it's just not humanly possible to cover all of Numpy is a tutorial or even a book. I meet Data Scientists at PyCon and other conferences that are always pleasantly surprised to discover new tips and tricks or features they never knew about. To be the best, you have to constantly update your skills too.