NumPy Array Basics

http://numpy.org

NumPy is the fundamental base for scientific computing in python. It contains:

  • N-dimensional array objects
  • vectorization of functions
  • Tools for integrating C/C++ and fortran code
  • linear algebra, fourier transformation, and random number tools.

Now, NumPy is the basis for a lot of other packages like scikit-learn, scipy, pandas, among other packages but provides a lot of power in and of itself and they keep numpy pretty abstract however it provides a strong foundation for us to learn some of the operations and concepts we’ll be applying later on.

Let’s go ahead and get started.


In [3]:
import sys
print(sys.version)


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

First we’ll need to import, now the convention is to import it as ‘np’. This is extremely common and will be what I use everything I import numpy.


In [4]:
import numpy as np

One of the reasons that numpy is such a popular tool is typically vastly more efficient than standard python lists.

I'm not going to go into the details but things like vectorization and boolean selection not only improve readability but provide for faster operations as well.

Feel free to post questions on the side and I can dive into the details for you all however the big take aways are that we can access data in memory more efficiently, functions can be applied to whole arrays or matrices, and boolean selection allows for simple filtering.

Let's create a list in numpy with np.arange then we’ll get the mean.


In [5]:
range(10)


Out[5]:
range(0, 10)

In [6]:
np.arange(10)


Out[6]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [7]:
npa = np.arange(10)

In [8]:
?npa

Getting these summary statistics is much easier in numpy as they provide convenient methods to get them.


In [9]:
npa.mean()


Out[9]:
4.5

In [10]:
npa.sum()


Out[10]:
45

In [11]:
npa.max()


Out[11]:
9

In [12]:
npa.min()


Out[12]:
0

In [13]:
[x * x for x in npa]


Out[13]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

You’ll see that we can do things like list comprehensions on arrays however this is not the recommended method which would be to vectorize our operation.

Vectorization in simplest terms allows you to apply a function to an entire array instead of doing it value by value - similar to what we were doing with map and filter in the previous videos. This typically makes things much more concise and readable. Not necessarily in the trivial examples like we’re doing in these initial videos but when we move along into more complicated analysis the speed improvements are significant.

A good rule of thumb is if you’re hard coding for loops with numpy arrays or with certain things in pandas, you're likely doing it wrong. There are much more efficient ways of doing it. This will be come apparent over the next several videos however before we get there, I want to talk about boolean selection.


In [ ]: