Numpy and Matplotlib

Reference documents

What is Numpy?

While the Python language is an excellent tool for general-purpose programming, it was not designed specifically for mathematical and scientific computing. Numpy allows for:

Efficient array computing in Python
Efficient indexing/slicing of arrays
Mathematical functions that operate on an entire array

The critical thing to know is that Python for loops are very slow! One should try to use array-operations as much as possible.

First, let's import the numpy module. There are multiple ways we can do this. In the following examples, zeros is a numpy routine which we will see later, and depending on how we import numpy we call it in different ways:

'import numpy' imports the entire numpy module --> 'numpy.zeros()'
'import numpy as np' imports the entire numpy module and renames it --> 'np.zeros()'
'from numpy import *' imports the entire numpy module (more or less) --> 'zeros()'
'from numpy import zeros' imports only the zeros() routine

After all that preamble, let's get started:



In [1]:

    
import numpy as np

Creating numpy arrays

You can create an array from scratch, similar to how you might create a list / tuple / whatever:



In [3]:

    
a = np.array((1,2,3))    # or np.array((1,2,3))
print a

You can also convert a list or tuple of elements into an array:



In [4]:

    
myList = [0.2, 0.4, 0.6]
myArray = np.array(myList)
print myList
print myArray









    



[0.2, 0.4, 0.6]
[ 0.2  0.4  0.6]



In [5]:

    
print type(myArray)









    



<type 'numpy.ndarray'>



In [6]:

    
print myArray.dtype









    



float64

Arrays can be created with non-numbers as well, but all the elements of an array have to have the same type. i.e., arrays are homogeneous. Once an array has been created, its dtype is fixed and it can only store elements of the same type. However, the dtype can explicitly be changed (we'll see this later).



In [7]:

    
# an array of booleans

print np.array([True, False, True])









    



[ True False  True]



In [8]:

    
# an array of characters/strings

print np.array(['a', 'b', 'c'])









    



['a' 'b' 'c']



In [9]:

    
print np.array([2, 3, 'c'])









    



['2' '3' 'c']



In [10]:

    
print np.array([2, 3, 0.4])









    



[ 2.   3.   0.4]



In [12]:

    
print np.array([2, 3, 'C'], dtype=int)









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-bd4c2998445d> in <module>()
----> 1 print np.array([2, 3, 'C'], dtype=int)

ValueError: invalid literal for long() with base 10: 'C'

You can access elements of an array in the same way you access elements of a list:



In [17]:

    
print myArray









    



[ 0.2  0.4  0.6]



In [19]:

    
print myArray[-2]

0.4



In [23]:

    
print myArray[1]
print myArray[1:2]









    



0.4
[ 0.4]

Multidimensional arrays work like you might expect:



In [26]:

    
newArray = np.array([ [3, 8, 0, 1],
                      [4, 0, 0, 9],
                      [2, 2, 7, 1],
                      [5, 1, 0, 8]] )
print newArray
print newArray.shape









    



[[3 8 0 1]
 [4 0 0 9]
 [2 2 7 1]
 [5 1 0 8]]
(4, 4)



In [34]:

    
print newArray[:,-1]



In [35]:

    
print newArray[1,3]



In [36]:

    
print newArray[3,0:2]



In [37]:

    
print newArray[:,0]   # print out an individual column

If you know the size of the array you want, you can create an array of ones or zeros or an empty array:



In [39]:

    
b = np.ones(3)
print b
print b.shape









    



[ 1.  1.  1.]
(3,)



In [40]:

    
c = np.zeros((1,3), int)
print c
print type(c)
print c.dtype









    



[[0 0 0]]
<type 'numpy.ndarray'>
int64



In [42]:

    
print [3, 5, 6]
print (3, 5, 6)









    



[3, 5, 6]
(3, 5, 6)



In [45]:

    
a = [3,5,6]
b = (3,5,6)
b[0] = 1
print b









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-ce2feb4c0bbe> in <module>()
      1 a = [3,5,6]
      2 b = (3,5,6)
----> 3 b[0] = 1
      4 print b

TypeError: 'tuple' object does not support item assignment



In [41]:

    
d = np.zeros(3, complex)
print d
print d.dtype









    



[ 0.+0.j  0.+0.j  0.+0.j]
complex128



In [47]:

    
# slightly faster, but be careful to fill it with actual values!

f = np.empty(4)
f.fill(3.14)
f[-1] = 0
print f









    



[ 3.14  3.14  3.14  0.  ]

Create an identity array:



In [48]:

    
print np.eye(5, dtype=int)    # default data type is 'float'









    



[[1 0 0 0 0]
 [0 1 0 0 0]
 [0 0 1 0 0]
 [0 0 0 1 0]
 [0 0 0 0 1]]

Number generation

Here are a few ways to generate uniformly spaced numbers over an interval (including both endpoints), similar to range():



In [49]:

    
print np.arange(-5, 5, 0.5)    # excludes upper endpoint









    



[-5.  -4.5 -4.  -3.5 -3.  -2.5 -2.  -1.5 -1.  -0.5  0.   0.5  1.   1.5  2.
  2.5  3.   3.5  4.   4.5]



In [52]:

    
print np.linspace(-3, 3, 21)     # includes both endpoints









    



[-3.  -2.7 -2.4 -2.1 -1.8 -1.5 -1.2 -0.9 -0.6 -0.3  0.   0.3  0.6  0.9  1.2
  1.5  1.8  2.1  2.4  2.7  3. ]



In [51]:

    
print np.logspace(1, 4, 4)









    



[    10.    100.   1000.  10000.]

Some examples of random number generation using numpy



In [53]:

    
print np.random.rand(10)
print np.random.rand(2,2)









    



[ 0.92068137  0.72312715  0.27224283  0.25495314  0.86758637  0.09651996
  0.58958672  0.69966758  0.69903476  0.49420249]
[[ 0.57298993  0.55321172]
 [ 0.50420865  0.08798024]]



In [55]:

    
print np.()









    



[ nan]



In [56]:

    
print np.random.randint(2,100,5)









    



[57 53  3 32 56]



In [57]:

    
print np.random.normal(10, 3, (2,4))









    



[[ 11.10114537  10.88844565  12.22318922  11.10845409]
 [ 10.66948063  10.44214431   7.6900988    5.2654093 ]]



In [58]:

    
print np.random.randn(5)    # samples 5 times from the standard normal distribution









    



[-1.10944935 -0.19801826 -0.32755373 -1.46807797 -0.97970284]



In [59]:

    
print np.random.normal(3, 1, 5)    # samples 5 times from a Gaussian with mean 3 and std dev 1









    



[ 2.22645411  4.34850779  3.648477    4.16902654  2.53367402]

Working with and manipulating arrays



In [60]:

    
print newArray



In [63]:

    
print newArray.reshape(2,7)









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-974c92a34970> in <module>()
----> 1 print newArray.reshape(2,7)

ValueError: total size of new array must be unchanged



In [62]:

    
print newArray.reshape(-1,2)









    



[[3 8]
 [0 1]
 [4 0]
 [0 9]
 [2 2]
 [7 1]
 [5 1]
 [0 8]]

None of these manipulations have modified the original newArray:



In [64]:

    
print newArray



In [65]:

    
reshapedArray = newArray.reshape(2,8)
print reshapedArray
print newArray









    



[[3 8 0 1 4 0 0 9]
 [2 2 7 1 5 1 0 8]]
[[3 8 0 1]
 [4 0 0 9]
 [2 2 7 1]
 [5 1 0 8]]

Above we saw how to index arrays with single numbers and slices, just like Python lists. But arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.



In [66]:

    
redshifts =  np.array((0.2, 1.56, 6.3, 0.003, 0.9, 4.54, 1.1))



In [67]:

    
print redshifts









    



[  2.00000000e-01   1.56000000e+00   6.30000000e+00   3.00000000e-03
   9.00000000e-01   4.54000000e+00   1.10000000e+00]



In [68]:

    
close = redshifts < 1
print close
print redshifts[close]









    



[ True False False  True  True False False]
[ 0.2    0.003  0.9  ]



In [72]:

    
far = np.where(redshifts > 2)
print far[0][0]
print redshifts[far]









    



2
[ 6.3   4.54]



In [73]:

    
middle = np.where( (redshifts >= 1) & (redshifts <= 2) )
print middle
print redshifts[middle]









    



(array([1, 6]),)
[ 1.56  1.1 ]



In [75]:

    
print (redshifts >= 1) & (redshifts <= 2)









    



[False  True False False False False  True]

Mathematical operations on arrays

Math with arrays is straightforward and easy. For instance, let's say we want to multiply every number in a group by 3. If we try to do that with a list:



In [76]:

    
myList = [3, 6, 7, 2]
print 2*myList









    



[3, 6, 7, 2, 3, 6, 7, 2]



In [77]:

    
myArray = np.array([3, 6, 7, 2])
print 2*myArray









    



[ 6 12 14  4]

In general, mathematical operations on arrays are done element-by-element:



In [78]:

    
arr1 = np.arange(4)
arr2 = np.arange(10,14)
print arr1
print arr2









    



[0 1 2 3]
[10 11 12 13]



In [79]:

    
print arr1 + arr2
print arr1 - arr2
print arr1 * arr2
print arr1 / arr2









    



[10 12 14 16]
[-10 -10 -10 -10]
[ 0 11 24 39]
[0 0 0 0]

Notice in particular that multiplication is element-wise and is NOT a dot product or regular matrix multiplication.



In [80]:

    
print 3.5 + arr1









    



[ 3.5  4.5  5.5  6.5]

In the last example, numpy understood the command to be "add 3.5 to every element in arr1." That is, it converts the scalar 3.5 into an array of the appropriate size. Since the new array is filled with floats, and arr1 is filled with ints, the summed array is an array of floats.

Broadcasting with numpy

The broadcasting rules allow numpy to:

create new dimensions of length 1 (since this doesn't change the size of the array)
'stretch' a dimension of length 1 that needs to be matched to a dimension of a different size.

So in the above example, the scalar 1.5 is effectively:

first 'promoted' to a 1-dimensional array of length 1
then, this array is 'stretched' to length 4 to match the dimension of arr1.

After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 4.

This broadcasting behavior is in practice enormously powerful, especially because when numpy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This can save lots of memory in cases when the arrays in question are large and can have significant performance implications.



In [81]:

    
arr1 += 10
print arr1









    



[10 11 12 13]



In [82]:

    
arr1.fill(0)
print arr1



In [83]:

    
print arr2
print np.mean(arr2), arr2.mean()
print np.sum(arr2), arr2.sum()
print np.min(arr2), arr2.min()









    



[10 11 12 13]
11.5 11.5
46 46
10 10

Numpy I/O with arrays

Numpy makes it easy to write arrays to files and read them. It can write both text and binary files. In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, with in this case 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable pi had in the computer's memory.

Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can only be used for one- and two-dimensional arrays.
Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.

In the following examples, we'll only be talking about reading and writing arrays in text mode.



In [85]:

    
arr3 = np.random.rand(6,5)
print arr3









    



[[ 0.39558761  0.40688139  0.17304988  0.8445151   0.57160838]
 [ 0.95838636  0.57521364  0.58982494  0.95418991  0.01794814]
 [ 0.60268159  0.65890725  0.89238717  0.75554629  0.74659523]
 [ 0.3253617   0.9003241   0.83747841  0.61111402  0.7536519 ]
 [ 0.42639395  0.54634399  0.05459465  0.96242149  0.42202277]
 [ 0.28657583  0.93159786  0.27406637  0.2621851   0.25151161]]



In [97]:

    
np.savetxt('arrayFile.txt', arr3, fmt='%.2e', header='')



In [98]:

    
arr4 = np.loadtxt('arrayFile.txt')



In [99]:

    
print arr4









    



[[ 0.396   0.407   0.173   0.845   0.572 ]
 [ 0.958   0.575   0.59    0.954   0.0179]
 [ 0.603   0.659   0.892   0.756   0.747 ]
 [ 0.325   0.9     0.837   0.611   0.754 ]
 [ 0.426   0.546   0.0546  0.962   0.422 ]
 [ 0.287   0.932   0.274   0.262   0.252 ]]



In [100]:

    
## what if we want to skip the first and last rows? Or we just want columns 2 and 3?
arr5 = np.genfromtxt('arrayFile.txt', skip_header=1, skip_footer=1, usecols=(2,3))



In [101]:

    
print arr5









    



[[ 0.59    0.954 ]
 [ 0.892   0.756 ]
 [ 0.837   0.611 ]
 [ 0.0546  0.962 ]]

What is Matplotlib?

The matplotlib library is a powerful tool capable that can quickly produce simple plots for data visualization as well as complex publication-quality figures with fine layout control. Here, we will only provide a minimal introduction, but Google is your friend when it comes to creating the perfect plots. The pyplot tutorial (http://matplotlib.org/users/pyplot_tutorial.html) is a great place to get started, and the matplotlib gallery (http://matplotlib.org/gallery.html) has tons of information as well.

Just as we typically use the shorthand np for Numpy, we will use plt for the matplotlib.pyplot module where the plotting functions reside.



In [102]:

    
import matplotlib.pyplot as plt
%matplotlib inline

To plot a collection of x- and y-values:



In [103]:

    
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y)









    Out[103]:





[<matplotlib.lines.Line2D at 0x10634a390>]



In [104]:

    
### If you don't give it x values:
plt.plot(np.random.rand(100))









    Out[104]:





[<matplotlib.lines.Line2D at 0x10657d750>]



In [108]:

    
### Multiple data sets on the same plot
# the semicolon at the end suppresses the display of some usually unnecessary information

plt.plot(x, np.sin(x), label=r'$\sin(x)$')
plt.plot(x, np.cos(x), label='cos(x)')
plt.xlabel('x values', size=24)
plt.ylabel('y values')
plt.title('two functions')
plt.legend();



In [ ]:



In [ ]:

Notice how matplotlib automatically assigned different colors to the two data sets. We can finetune what these look like:



In [109]:

    
plt.plot(x, np.sin(x), linewidth=3, color='red', linestyle='--')









    Out[109]:





[<matplotlib.lines.Line2D at 0x107067b90>]



In [112]:

    
plt.plot(x, np.sin(x), 'o--', markersize=5, color='k')









    Out[112]:





[<matplotlib.lines.Line2D at 0x106f804d0>]



In [118]:

    
### logarithmic plots
x = np.linspace(-5, 5)
y = np.exp(-x**2)
plt.semilogy(x,y, lw=2, color='0.1')









    Out[118]:





[<matplotlib.lines.Line2D at 0x108070310>]



In [119]:

    
plt.loglog(x,y)
plt.xlim(1e-1, 1e1)
plt.ylim(1e-2, 1e1)









    Out[119]:





(0.01, 10.0)



In [121]:

    
# A more complicated example

mu, sigma = 5.9, 0.2
measurements = mu + sigma * np.random.randn(10000)

# the histogram of the data
plt.hist(measurements, 50, normed=False, facecolor='g', alpha=0.75, label='hist')

plt.xlabel('Height (feet)')
plt.ylabel('Number of people')
plt.title('Height distribution')
# This will put a text fragment at the position given:
plt.text(5.1, 600, r'$\mu=100$, $\sigma=15$', fontsize=14)
plt.axis([5,6.7,0,750])
plt.grid(True)
plt.legend()









    Out[121]:





<matplotlib.legend.Legend at 0x1082545d0>



In [122]:

    
### Error bars and subplots

# generate some fake data, with errors in y
x = np.arange(0.2, 4, 0.5)
y = np.exp(-x)
errors = 0.1 + 0.1*np.sqrt(x)


fig = plt.figure()

noErrs = fig.add_subplot(121)
plt.plot(x, y)
plt.xticks((0,1,2,3,4))
noErrs.set_title('No errors')
noErrs.set_ylim(-0.3,1)

withErrs = fig.add_subplot(122)
withErrs.errorbar(x, y, yerr=errors)
withErrs.set_title('With errors')
plt.ylim(-0.3,1)
plt.xticks(np.arange(0,4.5,0.5))

fig.suptitle('fake data', size=16, weight='bold')









    Out[122]:





<matplotlib.text.Text at 0x108aa9590>



In [123]:

    
# normal distribution center at x=0 and y=5
x = np.random.randn(100000)
y = np.random.randn(100000)+5

# let's make this plot bigger
plt.figure(figsize=(12,8))
plt.hist2d(x, y, bins=40)
plt.colorbar()









    Out[123]:





<matplotlib.colorbar.Colorbar instance at 0x1092026c8>

A final note

Since we often need to use numpy + matplotlib (+ SciPy + others) at the same time, we can import all of these modules at once using a single command. Instead of:

    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline

etc., just use:

    %pylab inline

Breakout session

We want to reproduce the Global Land-Ocean Temperature index plot below (available at this GISS website). Download the data this link and recreate the plot as closely as you can (don't worry about the two green points or matching the font). We've broken up this problem into steps to get you started. Feel free to follow them as you wish, or be a lone wolf and solve this on your own.

Save the data as a text file, and then read the data into a numpy array. You'll have to deal with the extra lines at the top and bottom of the file, as well as the missing values (the asterixes) that appear. There are multiple ways to solve these problems; do what works for you. If you get stuck, remember that we talked about slicing arrays at the beginning of this lecture.
Plot "Annual Mean" as a set of black squares connected by a thick black line. Try Googling something about "matplotlib square markers". Don't forget to give it a label so it can go in the legend later.
Plot "5-Year Running Mean" as a thick red line. Don't forget the label.
Set the x and y ranges of your plot so that they match the ones in the plot we are trying to recreate.
Add a grid. The default grid that shows up is fine; don't worry about making it match exactly.
Label the y axis. Don't worry about the "degree" symbol for now.
Add the title.
Add the legend (default settings are fine).

BONUS TASKS:

Notice how, in the the plot above, there are labels at every 20 years along the x axis but grid marks at every ten (and similar for the y axis)? Figure out how to recreate this.
On that note, match the formatting they have for the y axis ticks.
Make all the font sizes match (yours are probably smaller than you'd like).
Add in the "degree" symbol for the y axis label. Hint: LaTeX or unicode.
Move the legend to the right place. Remove the box (the "frame") that surrounds the legend. Notice how your legend has a line with two markers for "Annual mean" and their legend has a line with only one marker? Match that.
Move the title up a little bit so the spacing better matches the plot above.



In [ ]: