Scientific Python Quickstart

ANU

This is a fast-paced, hands-on introduction to scientific computing with Python, contained in a Jupyter notebook. The main focus will be on introducing Python's four most important scientific libraries: NumPy, Scipy, Pandas and Matplotlib.

If you don't know how to use this notebook you need to first work through this page.

A slower, more detailed and more systematic treatment of Python for scientific applications can be found at quant-econ.net. But this notebook is a good place to start for those who like to learn by doing.

Here's some information on the version of Python that I'm using:


In [241]:
import sys
print(sys.version)


3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Basic NumPy

Perhaps the single most important scientific library for Python is NumPy. NumPy provides foundational data structures and routines on which many other libraries rely.


In [242]:
import numpy as np     # Import library and give it alias np
print(np.__version__)  #  The version I'm using


1.10.2

NumPy defines a basic data type called an array (actually a numpy.ndarray)


In [243]:
a = np.zeros(3)     # Create an array of zeros
a                   # Print a


Out[243]:
array([ 0.,  0.,  0.])

In [244]:
type(a)


Out[244]:
numpy.ndarray

Note that array data must be homogeneous

The most important data types are:

  • float64: 64 bit floating point number
  • float32: 32 bit floating point number
  • int64: 64 bit integer
  • int32: 32 bit integer
  • bool: 8 bit True or False

There are also dtypes to represent complex numbers, unsigned integers, etc

On most machines, the default dtype for arrays is float64


In [245]:
a = np.zeros(3)
type(a[1])


Out[245]:
numpy.float64

When we create an array such as


In [246]:
z = np.zeros(10)

z is a "flat" array with no dimension--- neither row nor column vector:


In [247]:
z.shape


Out[247]:
(10,)

Here the shape tuple has only one element, which is the length of the array (tuples with one element end with a comma)

To give it dimension, we can change the shape attribute

For example, let's make it a column vector


In [248]:
z.shape = (10, 1)

In [249]:
z


Out[249]:
array([[ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.]])

In [250]:
z = np.zeros(4)
z.shape = (2, 2)
z


Out[250]:
array([[ 0.,  0.],
       [ 0.,  0.]])

Creating arrays

Creating empty arrays --- initializing memory:


In [251]:
z = np.empty(3)
z


Out[251]:
array([ 0.,  0.,  0.])

These are just garbage numbers --- whatever was in those memory slots

Here's how to make a regular gird sequence


In [252]:
z = np.linspace(2, 4, 5)  # From 2 to 4, with 5 elements
z


Out[252]:
array([ 2. ,  2.5,  3. ,  3.5,  4. ])

Creating an array of ones


In [253]:
z = np.ones(3)
z


Out[253]:
array([ 1.,  1.,  1.])

In [254]:
z = np.identity(2)
z


Out[254]:
array([[ 1.,  0.],
       [ 0.,  1.]])

Arrays can be made from Python lists or tuples


In [255]:
z = np.array([10, 20]) 
z


Out[255]:
array([10, 20])

In [256]:
z = np.array((10, 20), dtype=float) 
z


Out[256]:
array([ 10.,  20.])

In [257]:
z = np.array([[1, 2], [3, 4]])         # 2D array from a list of lists
z


Out[257]:
array([[1, 2],
       [3, 4]])

Array indexing


In [258]:
z = np.linspace(1, 2, 5)
z


Out[258]:
array([ 1.  ,  1.25,  1.5 ,  1.75,  2.  ])

In [259]:
z[0]  # First element --- Python sequences are zero based, like C, Java, etc.


Out[259]:
1.0

In [260]:
z[-1]  # Special syntax for last element


Out[260]:
2.0

In [261]:
z[0:2]  # Meaning: Two elements, starting from element 0


Out[261]:
array([ 1.  ,  1.25])

In [262]:
z = np.array([[1, 2], [3, 4]])
z


Out[262]:
array([[1, 2],
       [3, 4]])

In [263]:
z[0, 0]


Out[263]:
1

In [264]:
z[0,:]  # First row


Out[264]:
array([1, 2])

In [265]:
z[:,0]  # First column


Out[265]:
array([1, 3])

In [266]:
z = np.linspace(2, 4, 5)
z


Out[266]:
array([ 2. ,  2.5,  3. ,  3.5,  4. ])

In [267]:
d = np.array([0, 1, 1, 0, 0], dtype=bool)
d


Out[267]:
array([False,  True,  True, False, False], dtype=bool)

In [268]:
z[d]


Out[268]:
array([ 2.5,  3. ])

Array methods


In [269]:
A = np.array((4, 3, 2, 1))
A


Out[269]:
array([4, 3, 2, 1])

In [270]:
A.sort()

In [271]:
A


Out[271]:
array([1, 2, 3, 4])

In [272]:
A.mean()


Out[272]:
2.5

In [273]:
A.sum()


Out[273]:
10

In [274]:
A.max()


Out[274]:
4

In [275]:
A.cumsum()


Out[275]:
array([ 1,  3,  6, 10])

In [276]:
A.var()


Out[276]:
1.25

In [277]:
A.shape = (2, 2)
A


Out[277]:
array([[1, 2],
       [3, 4]])

In [278]:
A.T  # Transpose, equivalent to A.transpose()


Out[278]:
array([[1, 3],
       [2, 4]])

Operations on arrays

Standard arithmetic operations on arrays act elementwise


In [279]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

In [280]:
a + b


Out[280]:
array([ 6,  8, 10, 12])

In [281]:
a - b


Out[281]:
array([-4, -4, -4, -4])

In [282]:
a + 10


Out[282]:
array([11, 12, 13, 14])

In [283]:
a.shape = 2, 2
b.shape = 2, 2

In [284]:
a


Out[284]:
array([[1, 2],
       [3, 4]])

In [285]:
b


Out[285]:
array([[5, 6],
       [7, 8]])

In [286]:
a * b # Pointwise multiplication!!


Out[286]:
array([[ 5, 12],
       [21, 32]])

In [287]:
np.dot(a, b) # Matrix multiplication


Out[287]:
array([[19, 22],
       [43, 50]])

For Python $\geq 3.5$ and NumPy $\geq 1.1$ the @ operator also works.


In [288]:
a @ b


Out[288]:
array([[19, 22],
       [43, 50]])

I'll continue to use np.dot below for the benefit of those who are using older versions. But in my opinion the @ operator is much nicer.

Comparisons


In [289]:
z = np.array([2, 3])
y = np.array([2, 3])
z == y


Out[289]:
array([ True,  True], dtype=bool)

In [290]:
y[0] = 3
z == y


Out[290]:
array([False,  True], dtype=bool)

In [291]:
z = np.linspace(0, 10, 5)
z


Out[291]:
array([  0. ,   2.5,   5. ,   7.5,  10. ])

In [292]:
z > 3


Out[292]:
array([False, False,  True,  True,  True], dtype=bool)

In [293]:
z[z > 3]  # Conditional extraction


Out[293]:
array([  5. ,   7.5,  10. ])

Matplotlib

Matplotlib is an outstanding plotting and visualization library for Python that interacts nicely with NumPy. Here are a few quick examples. We'll see more below when we discuss the SciPy library.


In [294]:
import matplotlib.pyplot as plt  # Import main functionality

Display figures in this browser window rather than having them open up separately:


In [295]:
%matplotlib inline

Create something to plot


In [296]:
x = np.linspace(-2, 2, 100)
y = x**2

In [297]:
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-')


Out[297]:
[<matplotlib.lines.Line2D at 0x7f77a347d048>]

Here's a slightly more complex plot


In [298]:
y3 = x**3
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-', lw=2, alpha=0.8, label='$x^2$')
ax.plot(x, y3, 'g-', lw=2, alpha=0.8, label='$x^3$')
ax.legend(loc='lower right')


Out[298]:
<matplotlib.legend.Legend at 0x7f77a228fcf8>

SciPy

Let's just cover some simple examples --- references for further reading are below

Statistics and distributions

Let's use scipy.stats to generate some data from the Beta distribution


In [299]:
from scipy.stats import beta
q = beta(5, 5)      # Beta(a, b), with a = b = 5
obs = q.rvs(2000)   # 2000 observations

Now let's histogram it and compare it to the original density


In [300]:
fig, ax = plt.subplots()
ax.hist(obs, bins=40, normed=True)
grid = np.linspace(0.01, 0.99, 100)
ax.plot(grid, q.pdf(grid), 'k-', linewidth=2)


Out[300]:
[<matplotlib.lines.Line2D at 0x7f77a22d2ef0>]

Other methods


In [301]:
type(q)


Out[301]:
scipy.stats._distn_infrastructure.rv_frozen

In [302]:
dir(q)  # Let's see all its methods


Out[302]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'a',
 'args',
 'b',
 'cdf',
 'dist',
 'entropy',
 'expect',
 'interval',
 'isf',
 'kwds',
 'logcdf',
 'logpdf',
 'logpmf',
 'logsf',
 'mean',
 'median',
 'moment',
 'pdf',
 'pmf',
 'ppf',
 'random_state',
 'rvs',
 'sf',
 'stats',
 'std',
 'var']

In [303]:
q.cdf(0.5)


Out[303]:
0.50000000000000011

In [304]:
q.pdf(0.5)


Out[304]:
2.4609375000000009

In [305]:
q.mean()


Out[305]:
0.5

Basic linear regression:


In [306]:
from scipy.stats import linregress
n = 100
alpha, beta, sigma = 1, 2, 1.5
x = np.random.randn(n)  # n standard normals
y = alpha + beta * x + sigma * np.random.randn(n)
beta_hat, alpha_hat, r_value, p_value, std_err = linregress(x, y)
print("gradient = {}".format(beta_hat))
print("intercept = {}".format(alpha_hat))


gradient = 1.874172371477785
intercept = 1.4035329152756753

Let's plot this with data and line of best fit


In [307]:
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, 'bo', alpha=0.6, label='observations')
xgrid = np.linspace(-3, 3, 2)
ax.plot(xgrid, alpha_hat + beta_hat * xgrid, 'k-', lw=2, alpha=0.8, label='best fit')
ax.grid()
ax.legend(loc='upper left')


Out[307]:
<matplotlib.legend.Legend at 0x7f77a21026a0>

Roots and fixed points

Let's choose an arbitrary function to work with


In [308]:
fig, ax = plt.subplots()
def f(x):
     return np.sin(4 * (x - 0.25)) + x + x**20 - 1
x = np.linspace(0, 1, 100)
ax.plot(x, f(x))
ax.plot(x, 0 * x)


Out[308]:
[<matplotlib.lines.Line2D at 0x7f77a2129c18>]

In [309]:
from scipy.optimize import bisect  # Bisection algorithm --- slow but robust
bisect(f, 0, 1)


Out[309]:
0.4082935042797544

In [310]:
from scipy.optimize import newton  #  Newton's method --- fast but less robust
newton(f, 0.2)   # Start the search at initial condition x = 0.2


Out[310]:
0.40829350427935679

In [311]:
newton(f, 0.7)   # Start the search at x = 0.7 instead


Out[311]:
0.70017000000002816

Here we see that the algorithm gets it wrong --- newton is fast but not robust

Let's try a hybrid method


In [312]:
from scipy.optimize import brentq
brentq(f, 0, 1) # Hybrid method


Out[312]:
0.40829350427936706

In [313]:
timeit bisect(f, 0, 1)


10000 loops, best of 3: 64.7 µs per loop

In [314]:
timeit newton(f, 0.2)


100000 loops, best of 3: 14.2 µs per loop

In [315]:
timeit brentq(f, 0, 1)


100000 loops, best of 3: 16.1 µs per loop

Note that the hybrid method is robust but still quite fast...

Numerical optimization and integration


In [316]:
from scipy.optimize import fminbound
fminbound(lambda x: x**2, -1, 2)  # Search in [-1, 2]


Out[316]:
0.0

In [317]:
from scipy.integrate import quad
integral, error = quad(lambda x: x**2, 0, 1)
integral


Out[317]:
0.33333333333333337

Linear Algebra

Let's look at some of the most common routines from linear and matrix algebra


In [318]:
import scipy.linalg as la

We'll experiment with matrices

$$ A = \begin{bmatrix} 2 & -1 \\ 3 & 0 \end{bmatrix} \quad \text{and} \quad b = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$

In [319]:
A = [[2, -1],
     [3, 0]]
A = np.array(A) # Convert from list to NumPy array
b = np.ones((2, 1))  # Shape is 2 x 1

In [320]:
A


Out[320]:
array([[ 2, -1],
       [ 3,  0]])

In [321]:
b


Out[321]:
array([[ 1.],
       [ 1.]])

In [322]:
x = la.solve(A, b)  # Solve for x in Ax = b
print(x)


[[ 0.33333333]
 [-0.33333333]]

Let's check that $Ax = b$


In [323]:
np.dot(A, x)


Out[323]:
array([[ 1.],
       [ 1.]])

We can also invert directly


In [324]:
la.inv(A)


Out[324]:
array([[ 0.        ,  0.33333333],
       [-1.        ,  0.66666667]])

In [325]:
np.dot(A, la.inv(A))  # Should be the identity


Out[325]:
array([[ 1.,  0.],
       [ 0.,  1.]])

Let's compute the eigenvalues and eigenvectors


In [326]:
eigvals, eigvecs = la.eig(A)

In [327]:
print("eigenvalues = {}".format(eigvals))


eigenvalues = [ 1.+1.41421356j  1.-1.41421356j]

In [328]:
print("first eigenvector = {}".format(eigvecs[:, 0]))


first eigenvector = [ 0.28867513+0.40824829j  0.86602540+0.j        ]

More information

Pandas

Pandas is a very popular library for working with data sets. In pandas, data is held in a dataframe, which is kind of like a spread sheet


In [329]:
import pandas as pd

Let's start by writing a test data set to the present working directory, so we can read it back in as a dataframe using pandas. We use an IPython magic to write the data from a cell to a file:


In [330]:
%%file test_data.csv
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"


Overwriting test_data.csv

In [331]:
%ls ./*.csv # Check it's there


./test_data.csv

In [332]:
df = pd.read_csv('./test_data.csv')

In [333]:
df


Out[333]:
country country isocode year POP XRAT tcgdp cc cg
0 Argentina ARG 2000 37335.653 0.999500 295072.218690 75.716805 5.578804
1 Australia AUS 2000 19053.186 1.724830 541804.652100 67.759026 6.720098
2 India IND 2000 1006300.297 44.941600 1728144.374800 64.575551 14.072206
3 Israel ISR 2000 6114.570 4.077330 129253.894230 64.436451 10.266688
4 Malawi MWI 2000 11801.505 59.543808 5026.221784 74.707624 11.658954
5 South Africa ZAF 2000 45064.098 6.939830 227242.369490 72.718710 5.726546
6 United States USA 2000 282171.957 1.000000 9898700.000000 72.347054 6.032454
7 Uruguay URY 2000 3219.793 12.099592 25255.961693 78.978740 5.108068

Let's try that again but this time using the country as the index column


In [334]:
df = pd.read_csv('./test_data.csv', index_col='country')

In [335]:
df


Out[335]:
country isocode year POP XRAT tcgdp cc cg
country
Argentina ARG 2000 37335.653 0.999500 295072.218690 75.716805 5.578804
Australia AUS 2000 19053.186 1.724830 541804.652100 67.759026 6.720098
India IND 2000 1006300.297 44.941600 1728144.374800 64.575551 14.072206
Israel ISR 2000 6114.570 4.077330 129253.894230 64.436451 10.266688
Malawi MWI 2000 11801.505 59.543808 5026.221784 74.707624 11.658954
South Africa ZAF 2000 45064.098 6.939830 227242.369490 72.718710 5.726546
United States USA 2000 282171.957 1.000000 9898700.000000 72.347054 6.032454
Uruguay URY 2000 3219.793 12.099592 25255.961693 78.978740 5.108068

Let's drop the year since it's not very informative


In [336]:
df.drop(['year'], axis=1, inplace=True)
df


Out[336]:
country isocode POP XRAT tcgdp cc cg
country
Argentina ARG 37335.653 0.999500 295072.218690 75.716805 5.578804
Australia AUS 19053.186 1.724830 541804.652100 67.759026 6.720098
India IND 1006300.297 44.941600 1728144.374800 64.575551 14.072206
Israel ISR 6114.570 4.077330 129253.894230 64.436451 10.266688
Malawi MWI 11801.505 59.543808 5026.221784 74.707624 11.658954
South Africa ZAF 45064.098 6.939830 227242.369490 72.718710 5.726546
United States USA 282171.957 1.000000 9898700.000000 72.347054 6.032454
Uruguay URY 3219.793 12.099592 25255.961693 78.978740 5.108068

Let's add a column for GDP per capita


In [337]:
df['GDP percap'] = df['tcgdp'] / df['POP']

In [338]:
df


Out[338]:
country isocode POP XRAT tcgdp cc cg GDP percap
country
Argentina ARG 37335.653 0.999500 295072.218690 75.716805 5.578804 7.903229
Australia AUS 19053.186 1.724830 541804.652100 67.759026 6.720098 28.436433
India IND 1006300.297 44.941600 1728144.374800 64.575551 14.072206 1.717325
Israel ISR 6114.570 4.077330 129253.894230 64.436451 10.266688 21.138673
Malawi MWI 11801.505 59.543808 5026.221784 74.707624 11.658954 0.425897
South Africa ZAF 45064.098 6.939830 227242.369490 72.718710 5.726546 5.042648
United States USA 282171.957 1.000000 9898700.000000 72.347054 6.032454 35.080382
Uruguay URY 3219.793 12.099592 25255.961693 78.978740 5.108068 7.843971

Let's sort the whole data frame by GDP per capita


In [339]:
df.sort_values(by='GDP percap', inplace=True)

In [340]:
df


Out[340]:
country isocode POP XRAT tcgdp cc cg GDP percap
country
Malawi MWI 11801.505 59.543808 5026.221784 74.707624 11.658954 0.425897
India IND 1006300.297 44.941600 1728144.374800 64.575551 14.072206 1.717325
South Africa ZAF 45064.098 6.939830 227242.369490 72.718710 5.726546 5.042648
Uruguay URY 3219.793 12.099592 25255.961693 78.978740 5.108068 7.843971
Argentina ARG 37335.653 0.999500 295072.218690 75.716805 5.578804 7.903229
Israel ISR 6114.570 4.077330 129253.894230 64.436451 10.266688 21.138673
Australia AUS 19053.186 1.724830 541804.652100 67.759026 6.720098 28.436433
United States USA 282171.957 1.000000 9898700.000000 72.347054 6.032454 35.080382

Now we'll plot per capital GDP using the dataframe's plot method


In [341]:
df['GDP percap'].plot(kind='bar')


Out[341]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f77a207cda0>

Exercises

Here are two exercises. Feel free to consult documentation such as can be found here. The solutions are below. The cell with "solution below" is mean to push them below your line of sight and save you from temptation.

Exercise 1

Generate 10000 data points from the exponential distribution with density

$$ f(x; \alpha) = \alpha \exp(-\alpha x) \qquad (x > 0, \alpha > 0) $$

using scipy.stats and taking $\alpha = 0.5$. Then, after looking up the maximum likelihood estimator of $\alpha$, compute the estimate given your data and check that it is in fact close to $\alpha$.


In [342]:
# Put your solution here

Exercise 2

Using the same data set, implement maximum likelihood again, but this time pretending that you don't know the analytical expression for the maximum likelihood estimator. Set up the log likelihood function and maximize it numerically using a routine from scipy.optimize.


In [343]:
# Put your solution here

Solutions


In [344]:
# Print some nonsense to partially hide solutions
filler_text = "solution below\n" * 25
print(filler_text)


solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below

Solution to Exercise 1

After checking the docs for the exponential distribution we proceed as follows


In [345]:
from scipy.stats import expon
alpha = 0.5
n = 10000
ep = expon(scale=1.0/alpha)  # scale controls the exponential parameter
x = ep.rvs(n)

Let's check we've got the right distribution here


In [346]:
fig, ax = plt.subplots(figsize=(8, 5))
xmin, xmax = 0.001, 10.0
ax.set_xlim(xmin, xmax)
ax.hist(x, normed=True, bins=40, alpha=0.3)
grid = np.linspace(xmin, xmax, 200)
ax.plot(grid, ep.pdf(grid), 'g-', lw=2, label='true density')
ax.legend()


Out[346]:
<matplotlib.legend.Legend at 0x7f77a2118048>

It's well-known that the MLE of $\alpha$ is $1/\bar x$ where $\bar x$ is the mean of the sample. Let's check that it is indeed close to $\alpha$.


In [347]:
alpha_mle = 1.0 / x.mean()
print("max likelihood estimate of alpha is {}".format(alpha_mle))


max likelihood estimate of alpha is 0.5042739418869412

In [348]:
s = x.sum()
def neg_loglike(a):
    "Minus the log likelihood function for exponential"
    return - n * np.log(a) + a * s

Minimize over a reasonable parameter space


In [349]:
from scipy.optimize import fminbound
fminbound(neg_loglike, 0.01, 10.0)


Out[349]:
0.5042730725954131

This is very close to the analytical value of the max likelihood estimator we got in exercise 1


In [ ]: