Introduction to NumPy

From NumPy's website we have the following description:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

You can think of Numpy as standard Python lists on steroids!

There are a few reasons why Numpy is so much faster than lists:

  • Numpy's underlying code is written in C
  • The contents of Numpy's arrays are homogenous, i.e., all of of the same type
  • Numpy's arrays have a smaller memory footprint

And since a Data Scientist is always learning, here's an excellent resource on Arrays - scipy array tip sheet

Creating Arrays

Arrays contain uniform data type, with an arbitrary number of dimensions. What's a dimension? It's just a big word to denote how many levels deep the array goes.
Dimensions are nothing more than lists inside lists inside lists...

As we saw earlier with Matplotlib, there are some conventions for importing Numpy too.


In [2]:
import numpy as np

In [3]:
# Create an array with the statement np.array
a = np.array([1,2,3,4])
print('a is of type:', type(a))
print('dimension of a:', a.ndim) # To find the dimension of 'a'


a is of type: <class 'numpy.ndarray'>
dimension of a: 1

In [4]:
arr1 = np.array([1,2,3,4])
arr1.ndim


Out[4]:
1

In [5]:
arr2 = np.array([[1,2],[2,3],[3,4],[4,5]])
arr2.ndim


Out[5]:
2

In [6]:
# Doesn't make a difference to a computer how you represent it,
# but if humans are going to read your code, this might be useful
arr3 = np.array([[[1,2],[2,3]],
                 [[2,3],[3,4]],
                 [[4,5],[5,6]],
                 [[6,7],[7,8]]
                ])
arr3.ndim


Out[6]:
3

In [7]:
arr4 = np.array([[ 0,  1,  2,  3,  4], 
                 [ 5,  6,  7,  8,  9], 
                 [10, 11, 12, 13, 14]])
arr4.ndim


Out[7]:
2

One easy to tell the number of dimensions - look at the number of square brackets at the beginning. [[ = 2 dimensions. [[[ = 3 dimensions.
Remember, dimensions are nothing more than lists inside lists inside lists...

Why use Numpy Arrays, and not just list? One reason right here.


In [8]:
a_list = [1,2,3,4,5]
b_list = [5,10,15,20,25]
# Multiplying these will give an error
print(a_list * b_list)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-fdbb40c844d8> in <module>()
      2 b_list = [5,10,15,20,25]
      3 # Multiplying these will give an error
----> 4 print(a_list * b_list)

TypeError: can't multiply sequence by non-int of type 'list'

In [9]:
a_list = np.array([1,2,3,4,5])
b_list = np.array([5,10,15,20,25])
print(a_list * b_list)


[  5  20  45  80 125]

Numpy allows for vectorisation, i.e. operations are applied to whole arrays instead of individual elements. To get the results of a_list * b_list using traditional python, you would have had to write a for loop. When dealing with millions or billions of lines of data, that can be inefficient. We will spend some more time on operations of this nature when we get to Broadcasting.

Built-in methods to generate an array

Numpy provides us many methods to generate numbers for our array.


In [10]:
arr1 = np.arange(16)
print(arr1)


[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

We can even reshape these arrays into our desired shape. But remember, when we say desired shape, we are not speaking of circles or pentagons. Think square, reactangles, cubes and the like.


In [11]:
arr1.reshape(4,4)


Out[11]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [12]:
arr1.reshape(2,8)


Out[12]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15]])

In [13]:
arr1.reshape(8,2)


Out[13]:
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15]])

In [14]:
arr1.reshape(16,1)


Out[14]:
array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15]])

The arange feature generates sequential series though. What if we want random numbers?

Randint


In [15]:
np.random.seed(42) 
rand_arr = np.random.randint(0,1000,20)
print(rand_arr)


[102 435 860 270 106  71 700  20 614 121 466 214 330 458  87 372  99 871
 663 130]

Translating from Python to English, "call the randint module from the random module of numpy, then select 20 numbers between 0 and 999 at random, and assign that to an array named rand_arr i.e. 0 is included, 1000 is excluded.


In [16]:
rand_arr.reshape(5,4)


Out[16]:
array([[102, 435, 860, 270],
       [106,  71, 700,  20],
       [614, 121, 466, 214],
       [330, 458,  87, 372],
       [ 99, 871, 663, 130]])

In [17]:
rand_arr.reshape(4,5)


Out[17]:
array([[102, 435, 860, 270, 106],
       [ 71, 700,  20, 614, 121],
       [466, 214, 330, 458,  87],
       [372,  99, 871, 663, 130]])

In [18]:
rand_arr.reshape(2,10)


Out[18]:
array([[102, 435, 860, 270, 106,  71, 700,  20, 614, 121],
       [466, 214, 330, 458,  87, 372,  99, 871, 663, 130]])

Remember, the first number always represents the number of rows.

Random Array with a Uniform Distribution

From the official documentation:

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

A uniform distribution, is a distribution that has constant probability.


In [19]:
np.random.seed(42) 
np.random.rand(5)


Out[19]:
array([ 0.37454012,  0.95071431,  0.73199394,  0.59865848,  0.15601864])

In [20]:
np.random.seed(42) 
np.random.rand(3,2)


Out[20]:
array([[ 0.37454012,  0.95071431],
       [ 0.73199394,  0.59865848],
       [ 0.15601864,  0.15599452]])

Random Array with Standard Normal Distribution

From the official documentation:

Return a sample (or samples) from the “standard normal” distribution.

For random samples from $$N(\mu, \sigma^2)$$ use:
sigma * np.random.randn(...) + mu

Don't get scared by the formula - it's actually very simple, and we will cover this in brief later on in the mathematics section.


In [21]:
np.random.seed(42) 
np.random.randn(5)


Out[21]:
array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337])

Array of Zeroes


In [22]:
np.zeros(16)


Out[22]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.])

In [23]:
np.zeros((4,4))


Out[23]:
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Array of Ones


In [24]:
np.ones(5)


Out[24]:
array([ 1.,  1.,  1.,  1.,  1.])

In [25]:
np.ones((4,4))


Out[25]:
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

Identity Matrix

An identity matrix is a square matrix, with all the values on the diagonal equal to 1, and the remaining values equal to 0.


In [27]:
np.eye(10)


Out[27]:
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

Linspace

From Numpy's official documentation:

Return evenly spaced numbers over a specified interval.

Returns num evenly spaced samples, calculated over the interval [start, stop].

The endpoint of the interval can optionally be excluded.

Here's an interesting discussion on SO about when to use Linspace v range - https://stackoverflow.com/questions/5779270/linspace-vs-range


In [28]:
# 5 evenly spaced numbers between -5 and 5
np.linspace(-5,5,5)


Out[28]:
array([-5. , -2.5,  0. ,  2.5,  5. ])

Quick Operations on Numpy Arrays


In [29]:
import numpy as np
np.random.seed(42) 

arr1 = np.random.randint(1,1000,100)
arr1 = arr1.reshape(10,10)

In [30]:
arr1.shape


Out[30]:
(10, 10)

In [31]:
arr1


Out[31]:
array([[103, 436, 861, 271, 107,  72, 701,  21, 615, 122],
       [467, 215, 331, 459,  88, 373, 100, 872, 664, 131],
       [662, 309, 770, 344, 492, 414, 806, 386, 192, 956],
       [277, 161, 460, 314,  22, 253, 748, 857, 561, 475],
       [ 59, 511, 682, 476, 700, 976, 783, 190, 958, 687],
       [958, 563, 876, 567, 244, 832, 505, 131, 485, 819],
       [647,  21, 841, 167, 274, 388, 601, 316,  14, 242],
       [777, 346, 565, 898, 340,  92, 367, 956, 455, 428],
       [509, 776, 943,  35, 206,  81, 932, 562, 872, 388],
       [  2, 390, 566, 106, 772, 822, 477, 703, 402, 730]])

Now imagine this is just a small snippet of a large array with millions, or even billions of numbers. Does that sound crazy? Well, Data Scientist regularly work with large arrays of numbers. The Netflix Data Scientists for example, deal with a high dimensional sparse matrix.

For smaller datasets, let's say, number of people who boarded a particular flight every day for the past hundred days, we have a few useful tools to find the highest or lowest values, and their corresponding locations.


In [32]:
# Find the highest value in arr1
arr1.max()


Out[32]:
976

In [33]:
# Find the lowest value in arr1
arr1.min()


Out[33]:
2

In [34]:
# Find the location of the highest value in arr1
arr1.argmax()


Out[34]:
45

Keep in mind that if we have duplicate entries, or multiple entries, only the first entry will be returned.


In [35]:
arr1.argmin()


Out[35]:
90

In [36]:
# From earlier
rand_arr = np.random.randint(0,1000,20)
rand_arr


Out[36]:
array([555, 161, 201, 957, 995, 269, 862, 815, 270, 455, 461, 726, 251,
       701, 295, 724, 719, 748, 337, 878])

In [37]:
rand_arr = rand_arr.reshape(4,5)

In [38]:
rand_arr.shape


Out[38]:
(4, 5)

In [39]:
rand_arr


Out[39]:
array([[555, 161, 201, 957, 995],
       [269, 862, 815, 270, 455],
       [461, 726, 251, 701, 295],
       [724, 719, 748, 337, 878]])

Selecting Values

Secret! You already know how to select values from a numpy array.


In [40]:
import numpy as np
np.random.seed(42)

arr1 = np.arange(1,6)
arr1


Out[40]:
array([1, 2, 3, 4, 5])

In [41]:
arr1[0]


Out[41]:
1

In [42]:
arr1[0:3]


Out[42]:
array([1, 2, 3])

In [43]:
arr1[-1]


Out[43]:
5

Remember our old friend, lists?

And there you have it - you're already an expert in Numpy Indexing! And very soon, you will learn to be an expert at indexing 2D Matrices too.

Indexing 2D Matrices


In [54]:
import numpy as np
np.random.seed(42)

rand_arr = np.random.randint(0,1000,20)
print(rand_arr)


[102 435 860 270 106  71 700  20 614 121 466 214 330 458  87 372  99 871
 663 130]

In [55]:
rand_arr = rand_arr.reshape(5,4)
rand_arr


Out[55]:
array([[102, 435, 860, 270],
       [106,  71, 700,  20],
       [614, 121, 466, 214],
       [330, 458,  87, 372],
       [ 99, 871, 663, 130]])

In [56]:
rand_arr[0]


Out[56]:
array([102, 435, 860, 270])

In [57]:
rand_arr[1]


Out[57]:
array([106,  71, 700,  20])

In [58]:
rand_arr[0][-1]


Out[58]:
270

In [59]:
# Another way to write the same thing
rand_arr[0,-1]


Out[59]:
270

Remember, rows before columns. Always!
How do we get entire rows, or snippets of values from rows?
Exactly the same as before. Nothing to worry about here!


In [60]:
import numpy as np
np.random.seed(42)

arr1 = np.arange(1,101)
arr1


Out[60]:
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [61]:
arr1 = arr1.reshape(10,10)
arr1


Out[61]:
array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [ 11,  12,  13,  14,  15,  16,  17,  18,  19,  20],
       [ 21,  22,  23,  24,  25,  26,  27,  28,  29,  30],
       [ 31,  32,  33,  34,  35,  36,  37,  38,  39,  40],
       [ 41,  42,  43,  44,  45,  46,  47,  48,  49,  50],
       [ 51,  52,  53,  54,  55,  56,  57,  58,  59,  60],
       [ 61,  62,  63,  64,  65,  66,  67,  68,  69,  70],
       [ 71,  72,  73,  74,  75,  76,  77,  78,  79,  80],
       [ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90],
       [ 91,  92,  93,  94,  95,  96,  97,  98,  99, 100]])

Exercise

Select 26 to 30


In [62]:
# Step 1 - Narrow down the row
arr1[2] # 3rd row


Out[62]:
array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30])

In [63]:
# 26 is at index 5, we need all the numbers from thr 6th column onwards
arr1[2,5:]


Out[63]:
array([26, 27, 28, 29, 30])

Exercise

Select:
[71, 72, 73]
[81, 82, 83]
[91, 92, 93]


In [64]:
# Step 1: Identify the Row
arr1[7:]


Out[64]:
array([[ 71,  72,  73,  74,  75,  76,  77,  78,  79,  80],
       [ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90],
       [ 91,  92,  93,  94,  95,  96,  97,  98,  99, 100]])

In [65]:
# Now we need the first three columns
arr1[7:,:3]


Out[65]:
array([[71, 72, 73],
       [81, 82, 83],
       [91, 92, 93]])

Exercise

Select:
[56, 57, 58, 59, 60]
[66, 67, 68, 69, 70]
[76, 77, 78, 79, 80]


In [ ]:
# Your code here

Exercise

Select:
[ 44, 45]
[ 54, 55]


In [ ]:
# Your code here

Exercise

Create atleast 4 challenges for yourself, so you can practice indexing.


In [ ]:
# Your code here

Fancy Indexing

Method 1: Boolean Masks

While there are many ways to index, one of the more common methods that Data Scientists use will is Boolean Indexing. You can read more about indexing methods here.


In [102]:
import numpy as np
np.random.seed(42)

arr1 = np.random.randint(0,1000,100)
arr1


Out[102]:
array([102, 435, 860, 270, 106,  71, 700,  20, 614, 121, 466, 214, 330,
       458,  87, 372,  99, 871, 663, 130, 661, 308, 769, 343, 491, 413,
       805, 385, 191, 955, 276, 160, 459, 313,  21, 252, 747, 856, 560,
       474,  58, 510, 681, 475, 699, 975, 782, 189, 957, 686, 957, 562,
       875, 566, 243, 831, 504, 130, 484, 818, 646,  20, 840, 166, 273,
       387, 600, 315,  13, 241, 776, 345, 564, 897, 339,  91, 366, 955,
       454, 427, 508, 775, 942,  34, 205,  80, 931, 561, 871, 387,   1,
       389, 565, 105, 771, 821, 476, 702, 401, 729])

In [103]:
# We check what values are greater than 150
arr1>150


Out[103]:
array([False,  True,  True,  True, False, False,  True, False,  True,
       False,  True,  True,  True,  True, False,  True, False,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True, False,  True, False,  True,  True,  True,  True,
       False,  True,  True, False,  True,  True,  True,  True,  True,  True], dtype=bool)

In [104]:
# Assign this operation to a variable x
mask = arr1>150
# Create a new array which subsets arr1 based on a boolean operation
arr2 = arr1[mask]
arr2


Out[104]:
array([435, 860, 270, 700, 614, 466, 214, 330, 458, 372, 871, 663, 661,
       308, 769, 343, 491, 413, 805, 385, 191, 955, 276, 160, 459, 313,
       252, 747, 856, 560, 474, 510, 681, 475, 699, 975, 782, 189, 957,
       686, 957, 562, 875, 566, 243, 831, 504, 484, 818, 646, 840, 166,
       273, 387, 600, 315, 241, 776, 345, 564, 897, 339, 366, 955, 454,
       427, 508, 775, 942, 205, 931, 561, 871, 387, 389, 565, 771, 821,
       476, 702, 401, 729])

In [105]:
# Check the shape
arr2.shape


Out[105]:
(82,)

Method 2: Indexing with Array of Integers

Don't get intimidated by the big words - it just means indexing by using a Python list.


In [106]:
list1 = [1,3,5,7]
list2 = [2,4,6,8]

In [107]:
arr1 = np.arange(1,101)
arr1


Out[107]:
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [108]:
arr_even = arr1[list1]
arr_even


Out[108]:
array([2, 4, 6, 8])

In [109]:
# Alternatively
arr_even = arr1[[1,3,5,7]]
arr_even


Out[109]:
array([2, 4, 6, 8])

In [110]:
arr_odd = arr1[list2]
arr_odd


Out[110]:
array([3, 5, 7, 9])

Take

This is similar to Fancy Indexing, but is arguably easier to use, atleast for me. I am sure you might develop a preference for this technique too. Additionally, Wes McKinney - the creator of Pandas, reports that "take" is faster than "fancy indexing".


In [111]:
arr1 = np.arange(1,101)
arr1


Out[111]:
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [112]:
indices = [0,2,4,10,20,80,91,97,99]
np.take(arr1, indices)


Out[112]:
array([  1,   3,   5,  11,  21,  81,  92,  98, 100])

Works with Multi-Dimensional


In [113]:
np.take(arr1, [[0, 1], [11, 18]])


Out[113]:
array([[ 1,  2],
       [12, 19]])

Broadcasting

Broadcasting is a way for Numpy to work with arrays of different shapes.

The easiest example to explain broadcasting would be to use a scalar value. What's a scalar? A quantity having only magnitutde, but not direction. Speed is a scalar, velocity is a vector. For our practical Numpy purposes, scalars are real numbers - 1,2,3,4.....

Broadcasting is fast and efficient because all the underlying looping occurs in C, and happens on the fly without making copies of the data.


In [114]:
arr_1 = np.arange(1,11)
print(arr_1)
print(arr_1 * 10)


[ 1  2  3  4  5  6  7  8  9 10]
[ 10  20  30  40  50  60  70  80  90 100]

Here we have broadcast 10 to all other elements in the array. Remember Vectorisation? Same principles!


In [115]:
arr_1 = np.array([[1,2],[3,4]])
a = 2
arr_1 + a


Out[115]:
array([[3, 4],
       [5, 6]])

Broadcasting Rule

What about arrays of different dimensions and/or sizes? Well, for that, we have the broadcasting rule.

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

Umm....

Ok, let's not offend our friend Samuel Jackson here, so here's what it means in plain English.


In [116]:
arr1 = np.arange(1,13)
arr1


Out[116]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [117]:
arr1.shape


Out[117]:
(12,)

In [118]:
arr1 = arr1.reshape(4,3).astype('float')

In [119]:
arr1


Out[119]:
array([[  1.,   2.,   3.],
       [  4.,   5.,   6.],
       [  7.,   8.,   9.],
       [ 10.,  11.,  12.]])

A quick digression, in case you are wondering, the .astype('float') was just a quick operation to convert integers to floats as you are already familiar with. If you want to find out what the data type of an element in a numpy array is, simply use the suffix .dtype


In [ ]:
arr1.dtype

In [ ]:
arr_example = np.array([1,2,3,4])
print(arr_example)
print('arr_example is an',arr_example.dtype)
arr_example = arr_example.astype('float')
print('arr_example is now a',arr_example.dtype)

Back to our array, arr1


In [120]:
arr1


Out[120]:
array([[  1.,   2.,   3.],
       [  4.,   5.,   6.],
       [  7.,   8.,   9.],
       [ 10.,  11.,  12.]])

In [121]:
arr1.shape


Out[121]:
(4, 3)

In [122]:
arr2 = np.array([0.0,1.0,2.0])
print(arr2)
print(arr2.shape)


[ 0.  1.  2.]
(3,)

In [123]:
arr1 + arr2


Out[123]:
array([[  1.,   3.,   5.],
       [  4.,   6.,   8.],
       [  7.,   9.,  11.],
       [ 10.,  12.,  14.]])

Do you see what happened here? Our row with 3 elements, was sequentially added to each 3-element row in arr1.

The 1d array is represented as (3,), but think of it as simple a (3). The trailing axes have to match. So (4,3) and (3) match. What happens with it's (4,3) and (4)? It won't work! Let's prove it here.


In [128]:
arr3 = np.arange(0,4)
arr3 = arr3.astype('float')
print(arr3)
print(arr3.shape)


[ 0.  1.  2.  3.]
(4,)

In [129]:
# Let's generate our error
arr1 + arr3


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-129-d0fb9fd7b448> in <module>()
      1 # Let's generate our error
----> 2 arr1 + arr3

ValueError: operands could not be broadcast together with shapes (4,3) (4,) 

A final example now, with a (5,1) and (3) array. Read the rule once again - and it will be clear that the new array will be a 5X3 array.


In [130]:
arr4 = np.arange(1,6)
arr4


Out[130]:
array([1, 2, 3, 4, 5])

In [135]:
arr4 = arr4.reshape(5,1).astype('float')
arr4.shape


Out[135]:
(5, 1)

In [136]:
arr2


Out[136]:
array([ 0.,  1.,  2.])

In [133]:
arr4 * arr2


Out[133]:
array([[  0.,   1.,   2.],
       [  0.,   2.,   4.],
       [  0.,   3.,   6.],
       [  0.,   4.,   8.],
       [  0.,   5.,  10.]])

Other Array Operations

So let's begin with some good news here. You have already performed some advanced algebraic operations! That's the power of numpy.

Let's look at a few more operations now that come in quite handy.

Copying Arrays


In [137]:
a1 = np.arange(1,21)
a1 = a1.reshape(4,5)
a1


Out[137]:
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

In [138]:
# Let's get the first column
a1[:,0]


Out[138]:
array([ 1,  6, 11, 16])

In [139]:
# Assign to new array
new_a1 = a1[:,0]
new_a1


Out[139]:
array([ 1,  6, 11, 16])

In [140]:
# Recall that this is how you select all values
new_a1[:] = 42
new_a1


Out[140]:
array([42, 42, 42, 42])

So what happened to our original array? Let's find out.


In [141]:
a1


Out[141]:
array([[42,  2,  3,  4,  5],
       [42,  7,  8,  9, 10],
       [42, 12, 13, 14, 15],
       [42, 17, 18, 19, 20]])

Why did that happen?! We never touched a1, and even went on to create a whole new array!

This is because Numpy is very efficient in the way it uses memory. If you want a copy, be explicit, else Numpy will make changes to the original array too. Here's how you make a copy.


In [142]:
a1_copy = a1.copy()
a1_copy


Out[142]:
array([[42,  2,  3,  4,  5],
       [42,  7,  8,  9, 10],
       [42, 12, 13, 14, 15],
       [42, 17, 18, 19, 20]])

In [143]:
a1_copy = np.arange(1,21)
a1_copy = a1_copy.reshape(4,5)
a1_copy


Out[143]:
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

In [144]:
a1


Out[144]:
array([[42,  2,  3,  4,  5],
       [42,  7,  8,  9, 10],
       [42, 12, 13, 14, 15],
       [42, 17, 18, 19, 20]])

Squaring Arrays


In [145]:
np.square(a1)


Out[145]:
array([[1764,    4,    9,   16,   25],
       [1764,   49,   64,   81,  100],
       [1764,  144,  169,  196,  225],
       [1764,  289,  324,  361,  400]])

Square Roots


In [146]:
np.sqrt(a1)


Out[146]:
array([[ 6.4807407 ,  1.41421356,  1.73205081,  2.        ,  2.23606798],
       [ 6.4807407 ,  2.64575131,  2.82842712,  3.        ,  3.16227766],
       [ 6.4807407 ,  3.46410162,  3.60555128,  3.74165739,  3.87298335],
       [ 6.4807407 ,  4.12310563,  4.24264069,  4.35889894,  4.47213595]])

That's all folks!

You are on your way to become a Numpy expert, but please, constantly educate yourself. This is only a beginning, and it's just not humanly possible to cover all of Numpy is a tutorial or even a book. I meet Data Scientists at PyCon and other conferences that are always pleasantly surprised to discover new tips and tricks or features they never knew about. To be the best, you have to constantly update your skills too.