Scientific Python Quickstart

John Stachurski

ANU

This is a fast-paced, hands-on introduction to scientific computing with Python, contained in a Jupyter notebook. The main focus will be on introducing Python's four most important scientific libraries: NumPy, Scipy, Pandas and Matplotlib.

If you don't know how to use this notebook you need to first work through this page.

A slower, more detailed and more systematic treatment of Python for scientific applications can be found at quant-econ.net. But this notebook is a good place to start for those who like to learn by doing.

Here's some information on the version of Python that I'm using:



In [241]:

    
import sys
print(sys.version)









    



3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Basic NumPy

Perhaps the single most important scientific library for Python is NumPy. NumPy provides foundational data structures and routines on which many other libraries rely.



In [242]:

    
import numpy as np     # Import library and give it alias np
print(np.__version__)  #  The version I'm using

NumPy defines a basic data type called an array (actually a numpy.ndarray)



In [243]:

    
a = np.zeros(3)     # Create an array of zeros
a                   # Print a









    Out[243]:





array([ 0.,  0.,  0.])



In [244]:

    
type(a)









    Out[244]:





numpy.ndarray

Note that array data must be homogeneous

The most important data types are:

float64: 64 bit floating point number
float32: 32 bit floating point number
int64: 64 bit integer
int32: 32 bit integer
bool: 8 bit True or False

There are also dtypes to represent complex numbers, unsigned integers, etc

On most machines, the default dtype for arrays is float64



In [245]:

    
a = np.zeros(3)
type(a[1])









    Out[245]:





numpy.float64

When we create an array such as



In [246]:

    
z = np.zeros(10)

z is a "flat" array with no dimension--- neither row nor column vector:



In [247]:

    
z.shape









    Out[247]:





(10,)

Here the shape tuple has only one element, which is the length of the array (tuples with one element end with a comma)

To give it dimension, we can change the shape attribute

For example, let's make it a column vector



In [248]:

    
z.shape = (10, 1)



In [249]:

    
z









    Out[249]:





array([[ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.]])



In [250]:

    
z = np.zeros(4)
z.shape = (2, 2)
z









    Out[250]:





array([[ 0.,  0.],
       [ 0.,  0.]])

Creating arrays

Creating empty arrays --- initializing memory:



In [251]:

    
z = np.empty(3)
z









    Out[251]:





array([ 0.,  0.,  0.])

These are just garbage numbers --- whatever was in those memory slots

Here's how to make a regular gird sequence



In [252]:

    
z = np.linspace(2, 4, 5)  # From 2 to 4, with 5 elements
z









    Out[252]:





array([ 2. ,  2.5,  3. ,  3.5,  4. ])

Creating an array of ones



In [253]:

    
z = np.ones(3)
z









    Out[253]:





array([ 1.,  1.,  1.])



In [254]:

    
z = np.identity(2)
z









    Out[254]:





array([[ 1.,  0.],
       [ 0.,  1.]])

Arrays can be made from Python lists or tuples



In [255]:

    
z = np.array([10, 20]) 
z









    Out[255]:





array([10, 20])



In [256]:

    
z = np.array((10, 20), dtype=float) 
z









    Out[256]:





array([ 10.,  20.])



In [257]:

    
z = np.array([[1, 2], [3, 4]])         # 2D array from a list of lists
z









    Out[257]:





array([[1, 2],
       [3, 4]])

Array indexing



In [258]:

    
z = np.linspace(1, 2, 5)
z









    Out[258]:





array([ 1.  ,  1.25,  1.5 ,  1.75,  2.  ])



In [259]:

    
z[0]  # First element --- Python sequences are zero based, like C, Java, etc.









    Out[259]:





1.0



In [260]:

    
z[-1]  # Special syntax for last element









    Out[260]:





2.0



In [261]:

    
z[0:2]  # Meaning: Two elements, starting from element 0









    Out[261]:





array([ 1.  ,  1.25])



In [262]:

    
z = np.array([[1, 2], [3, 4]])
z









    Out[262]:





array([[1, 2],
       [3, 4]])



In [263]:

    
z[0, 0]









    Out[263]:





1



In [264]:

    
z[0,:]  # First row









    Out[264]:





array([1, 2])



In [265]:

    
z[:,0]  # First column









    Out[265]:





array([1, 3])



In [266]:

    
z = np.linspace(2, 4, 5)
z









    Out[266]:





array([ 2. ,  2.5,  3. ,  3.5,  4. ])



In [267]:

    
d = np.array([0, 1, 1, 0, 0], dtype=bool)
d









    Out[267]:





array([False,  True,  True, False, False], dtype=bool)



In [268]:

    
z[d]









    Out[268]:





array([ 2.5,  3. ])

Array methods



In [269]:

    
A = np.array((4, 3, 2, 1))
A









    Out[269]:





array([4, 3, 2, 1])



In [270]:

    
A.sort()



In [271]:

    
A









    Out[271]:





array([1, 2, 3, 4])



In [272]:

    
A.mean()









    Out[272]:





2.5



In [273]:

    
A.sum()









    Out[273]:





10



In [274]:

    
A.max()









    Out[274]:





4



In [275]:

    
A.cumsum()









    Out[275]:





array([ 1,  3,  6, 10])



In [276]:

    
A.var()









    Out[276]:





1.25



In [277]:

    
A.shape = (2, 2)
A









    Out[277]:





array([[1, 2],
       [3, 4]])



In [278]:

    
A.T  # Transpose, equivalent to A.transpose()









    Out[278]:





array([[1, 3],
       [2, 4]])

Operations on arrays

Standard arithmetic operations on arrays act elementwise



In [279]:

    
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])



In [280]:

    
a + b









    Out[280]:





array([ 6,  8, 10, 12])



In [281]:

    
a - b









    Out[281]:





array([-4, -4, -4, -4])



In [282]:

    
a + 10









    Out[282]:





array([11, 12, 13, 14])



In [283]:

    
a.shape = 2, 2
b.shape = 2, 2



In [284]:

    
a









    Out[284]:





array([[1, 2],
       [3, 4]])



In [285]:

    
b









    Out[285]:





array([[5, 6],
       [7, 8]])



In [286]:

    
a * b # Pointwise multiplication!!









    Out[286]:





array([[ 5, 12],
       [21, 32]])



In [287]:

    
np.dot(a, b) # Matrix multiplication









    Out[287]:





array([[19, 22],
       [43, 50]])

For Python $\geq 3.5$ and NumPy $\geq 1.1$ the @ operator also works.



In [288]:

    
a @ b









    Out[288]:





array([[19, 22],
       [43, 50]])

I'll continue to use np.dot below for the benefit of those who are using older versions. But in my opinion the @ operator is much nicer.

Comparisons



In [289]:

    
z = np.array([2, 3])
y = np.array([2, 3])
z == y









    Out[289]:





array([ True,  True], dtype=bool)



In [290]:

    
y[0] = 3
z == y









    Out[290]:





array([False,  True], dtype=bool)



In [291]:

    
z = np.linspace(0, 10, 5)
z









    Out[291]:





array([  0. ,   2.5,   5. ,   7.5,  10. ])



In [292]:

    
z > 3









    Out[292]:





array([False, False,  True,  True,  True], dtype=bool)



In [293]:

    
z[z > 3]  # Conditional extraction









    Out[293]:





array([  5. ,   7.5,  10. ])

Matplotlib

Matplotlib is an outstanding plotting and visualization library for Python that interacts nicely with NumPy. Here are a few quick examples. We'll see more below when we discuss the SciPy library.



In [294]:

    
import matplotlib.pyplot as plt  # Import main functionality

Display figures in this browser window rather than having them open up separately:



In [295]:

    
%matplotlib inline

Create something to plot



In [296]:

    
x = np.linspace(-2, 2, 100)
y = x**2



In [297]:

    
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-')









    Out[297]:





[<matplotlib.lines.Line2D at 0x7f77a347d048>]

Here's a slightly more complex plot



In [298]:

    
y3 = x**3
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-', lw=2, alpha=0.8, label='$x^2$')
ax.plot(x, y3, 'g-', lw=2, alpha=0.8, label='$x^3$')
ax.legend(loc='lower right')









    Out[298]:





<matplotlib.legend.Legend at 0x7f77a228fcf8>

SciPy

Let's just cover some simple examples --- references for further reading are below

Statistics and distributions

Let's use scipy.stats to generate some data from the Beta distribution



In [299]:

    
from scipy.stats import beta
q = beta(5, 5)      # Beta(a, b), with a = b = 5
obs = q.rvs(2000)   # 2000 observations

Now let's histogram it and compare it to the original density



In [300]:

    
fig, ax = plt.subplots()
ax.hist(obs, bins=40, normed=True)
grid = np.linspace(0.01, 0.99, 100)
ax.plot(grid, q.pdf(grid), 'k-', linewidth=2)









    Out[300]:





[<matplotlib.lines.Line2D at 0x7f77a22d2ef0>]

Other methods



In [301]:

    
type(q)









    Out[301]:





scipy.stats._distn_infrastructure.rv_frozen



In [302]:

    
dir(q)  # Let's see all its methods









    Out[302]:





['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'a',
 'args',
 'b',
 'cdf',
 'dist',
 'entropy',
 'expect',
 'interval',
 'isf',
 'kwds',
 'logcdf',
 'logpdf',
 'logpmf',
 'logsf',
 'mean',
 'median',
 'moment',
 'pdf',
 'pmf',
 'ppf',
 'random_state',
 'rvs',
 'sf',
 'stats',
 'std',
 'var']



In [303]:

    
q.cdf(0.5)









    Out[303]:





0.50000000000000011



In [304]:

    
q.pdf(0.5)









    Out[304]:





2.4609375000000009



In [305]:

    
q.mean()









    Out[305]:





0.5

Basic linear regression:



In [306]:

    
from scipy.stats import linregress
n = 100
alpha, beta, sigma = 1, 2, 1.5
x = np.random.randn(n)  # n standard normals
y = alpha + beta * x + sigma * np.random.randn(n)
beta_hat, alpha_hat, r_value, p_value, std_err = linregress(x, y)
print("gradient = {}".format(beta_hat))
print("intercept = {}".format(alpha_hat))









    



gradient = 1.874172371477785
intercept = 1.4035329152756753

Let's plot this with data and line of best fit



In [307]:

    
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, 'bo', alpha=0.6, label='observations')
xgrid = np.linspace(-3, 3, 2)
ax.plot(xgrid, alpha_hat + beta_hat * xgrid, 'k-', lw=2, alpha=0.8, label='best fit')
ax.grid()
ax.legend(loc='upper left')









    Out[307]:





<matplotlib.legend.Legend at 0x7f77a21026a0>

Roots and fixed points

Let's choose an arbitrary function to work with



In [308]:

    
fig, ax = plt.subplots()
def f(x):
     return np.sin(4 * (x - 0.25)) + x + x**20 - 1
x = np.linspace(0, 1, 100)
ax.plot(x, f(x))
ax.plot(x, 0 * x)









    Out[308]:





[<matplotlib.lines.Line2D at 0x7f77a2129c18>]



In [309]:

    
from scipy.optimize import bisect  # Bisection algorithm --- slow but robust
bisect(f, 0, 1)









    Out[309]:





0.4082935042797544



In [310]:

    
from scipy.optimize import newton  #  Newton's method --- fast but less robust
newton(f, 0.2)   # Start the search at initial condition x = 0.2









    Out[310]:





0.40829350427935679



In [311]:

    
newton(f, 0.7)   # Start the search at x = 0.7 instead









    Out[311]:





0.70017000000002816

Here we see that the algorithm gets it wrong --- newton is fast but not robust

Let's try a hybrid method



In [312]:

    
from scipy.optimize import brentq
brentq(f, 0, 1) # Hybrid method









    Out[312]:





0.40829350427936706



In [313]:

    
timeit bisect(f, 0, 1)









    



10000 loops, best of 3: 64.7 µs per loop



In [314]:

    
timeit newton(f, 0.2)









    



100000 loops, best of 3: 14.2 µs per loop



In [315]:

    
timeit brentq(f, 0, 1)









    



100000 loops, best of 3: 16.1 µs per loop

Note that the hybrid method is robust but still quite fast...

Numerical optimization and integration



In [316]:

    
from scipy.optimize import fminbound
fminbound(lambda x: x**2, -1, 2)  # Search in [-1, 2]









    Out[316]:





0.0



In [317]:

    
from scipy.integrate import quad
integral, error = quad(lambda x: x**2, 0, 1)
integral









    Out[317]:





0.33333333333333337

Linear Algebra

Let's look at some of the most common routines from linear and matrix algebra



In [318]:

    
import scipy.linalg as la

We'll experiment with matrices

$$ A = \begin{bmatrix} 2 & -1 \\ 3 & 0 \end{bmatrix} \quad \text{and} \quad b = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$



In [319]:

    
A = [[2, -1],
     [3, 0]]
A = np.array(A) # Convert from list to NumPy array
b = np.ones((2, 1))  # Shape is 2 x 1



In [320]:

    
A









    Out[320]:





array([[ 2, -1],
       [ 3,  0]])



In [321]:

    
b









    Out[321]:





array([[ 1.],
       [ 1.]])



In [322]:

    
x = la.solve(A, b)  # Solve for x in Ax = b
print(x)









    



[[ 0.33333333]
 [-0.33333333]]

Let's check that $Ax = b$



In [323]:

    
np.dot(A, x)









    Out[323]:





array([[ 1.],
       [ 1.]])

We can also invert directly



In [324]:

    
la.inv(A)









    Out[324]:





array([[ 0.        ,  0.33333333],
       [-1.        ,  0.66666667]])



In [325]:

    
np.dot(A, la.inv(A))  # Should be the identity









    Out[325]:





array([[ 1.,  0.],
       [ 0.,  1.]])

Let's compute the eigenvalues and eigenvectors



In [326]:

    
eigvals, eigvecs = la.eig(A)



In [327]:

    
print("eigenvalues = {}".format(eigvals))









    



eigenvalues = [ 1.+1.41421356j  1.-1.41421356j]



In [328]:

    
print("first eigenvector = {}".format(eigvecs[:, 0]))









    



first eigenvector = [ 0.28867513+0.40824829j  0.86602540+0.j        ]

More information

linear algebra: http://docs.scipy.org/doc/scipy/reference/linalg.html
numerical integration: http://docs.scipy.org/doc/scipy/reference/integrate.html
interpolation: http://docs.scipy.org/doc/scipy/reference/interpolate.html
optimization: http://docs.scipy.org/doc/scipy/reference/optimize.html
distributions and random number generation: http://docs.scipy.org/doc/scipy/reference/stats.html
signal processing: http://docs.scipy.org/doc/scipy/reference/signal.html

Pandas

Pandas is a very popular library for working with data sets. In pandas, data is held in a dataframe, which is kind of like a spread sheet



In [329]:

    
import pandas as pd

Let's start by writing a test data set to the present working directory, so we can read it back in as a dataframe using pandas. We use an IPython magic to write the data from a cell to a file:



In [330]:

    
%%file test_data.csv
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"









    



Overwriting test_data.csv



In [331]:

    
%ls ./*.csv # Check it's there









    



./test_data.csv



In [332]:

    
df = pd.read_csv('./test_data.csv')



In [333]:

    
df









    Out[333]:






  
    
      
      country
      country isocode
      year
      POP
      XRAT
      tcgdp
      cc
      cg
    
  
  
    
      0
      Argentina
      ARG
      2000
      37335.653
      0.999500
      295072.218690
      75.716805
      5.578804
    
    
      1
      Australia
      AUS
      2000
      19053.186
      1.724830
      541804.652100
      67.759026
      6.720098
    
    
      2
      India
      IND
      2000
      1006300.297
      44.941600
      1728144.374800
      64.575551
      14.072206
    
    
      3
      Israel
      ISR
      2000
      6114.570
      4.077330
      129253.894230
      64.436451
      10.266688
    
    
      4
      Malawi
      MWI
      2000
      11801.505
      59.543808
      5026.221784
      74.707624
      11.658954
    
    
      5
      South Africa
      ZAF
      2000
      45064.098
      6.939830
      227242.369490
      72.718710
      5.726546
    
    
      6
      United States
      USA
      2000
      282171.957
      1.000000
      9898700.000000
      72.347054
      6.032454
    
    
      7
      Uruguay
      URY
      2000
      3219.793
      12.099592
      25255.961693
      78.978740
      5.108068

Let's try that again but this time using the country as the index column



In [334]:

    
df = pd.read_csv('./test_data.csv', index_col='country')



In [335]:

    
df









    Out[335]:






  
    
      
      country isocode
      year
      POP
      XRAT
      tcgdp
      cc
      cg
    
    
      country
      
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      2000
      37335.653
      0.999500
      295072.218690
      75.716805
      5.578804
    
    
      Australia
      AUS
      2000
      19053.186
      1.724830
      541804.652100
      67.759026
      6.720098
    
    
      India
      IND
      2000
      1006300.297
      44.941600
      1728144.374800
      64.575551
      14.072206
    
    
      Israel
      ISR
      2000
      6114.570
      4.077330
      129253.894230
      64.436451
      10.266688
    
    
      Malawi
      MWI
      2000
      11801.505
      59.543808
      5026.221784
      74.707624
      11.658954
    
    
      South Africa
      ZAF
      2000
      45064.098
      6.939830
      227242.369490
      72.718710
      5.726546
    
    
      United States
      USA
      2000
      282171.957
      1.000000
      9898700.000000
      72.347054
      6.032454
    
    
      Uruguay
      URY
      2000
      3219.793
      12.099592
      25255.961693
      78.978740
      5.108068

Let's drop the year since it's not very informative



In [336]:

    
df.drop(['year'], axis=1, inplace=True)
df









    Out[336]:






  
    
      
      country isocode
      POP
      XRAT
      tcgdp
      cc
      cg
    
    
      country
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      37335.653
      0.999500
      295072.218690
      75.716805
      5.578804
    
    
      Australia
      AUS
      19053.186
      1.724830
      541804.652100
      67.759026
      6.720098
    
    
      India
      IND
      1006300.297
      44.941600
      1728144.374800
      64.575551
      14.072206
    
    
      Israel
      ISR
      6114.570
      4.077330
      129253.894230
      64.436451
      10.266688
    
    
      Malawi
      MWI
      11801.505
      59.543808
      5026.221784
      74.707624
      11.658954
    
    
      South Africa
      ZAF
      45064.098
      6.939830
      227242.369490
      72.718710
      5.726546
    
    
      United States
      USA
      282171.957
      1.000000
      9898700.000000
      72.347054
      6.032454
    
    
      Uruguay
      URY
      3219.793
      12.099592
      25255.961693
      78.978740
      5.108068

Let's add a column for GDP per capita



In [337]:

    
df['GDP percap'] = df['tcgdp'] / df['POP']



In [338]:

    
df









    Out[338]:






  
    
      
      country isocode
      POP
      XRAT
      tcgdp
      cc
      cg
      GDP percap
    
    
      country
      
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      37335.653
      0.999500
      295072.218690
      75.716805
      5.578804
      7.903229
    
    
      Australia
      AUS
      19053.186
      1.724830
      541804.652100
      67.759026
      6.720098
      28.436433
    
    
      India
      IND
      1006300.297
      44.941600
      1728144.374800
      64.575551
      14.072206
      1.717325
    
    
      Israel
      ISR
      6114.570
      4.077330
      129253.894230
      64.436451
      10.266688
      21.138673
    
    
      Malawi
      MWI
      11801.505
      59.543808
      5026.221784
      74.707624
      11.658954
      0.425897
    
    
      South Africa
      ZAF
      45064.098
      6.939830
      227242.369490
      72.718710
      5.726546
      5.042648
    
    
      United States
      USA
      282171.957
      1.000000
      9898700.000000
      72.347054
      6.032454
      35.080382
    
    
      Uruguay
      URY
      3219.793
      12.099592
      25255.961693
      78.978740
      5.108068
      7.843971

Let's sort the whole data frame by GDP per capita



In [339]:

    
df.sort_values(by='GDP percap', inplace=True)



In [340]:

    
df









    Out[340]:






  
    
      
      country isocode
      POP
      XRAT
      tcgdp
      cc
      cg
      GDP percap
    
    
      country
      
      
      
      
      
      
      
    
  
  
    
      Malawi
      MWI
      11801.505
      59.543808
      5026.221784
      74.707624
      11.658954
      0.425897
    
    
      India
      IND
      1006300.297
      44.941600
      1728144.374800
      64.575551
      14.072206
      1.717325
    
    
      South Africa
      ZAF
      45064.098
      6.939830
      227242.369490
      72.718710
      5.726546
      5.042648
    
    
      Uruguay
      URY
      3219.793
      12.099592
      25255.961693
      78.978740
      5.108068
      7.843971
    
    
      Argentina
      ARG
      37335.653
      0.999500
      295072.218690
      75.716805
      5.578804
      7.903229
    
    
      Israel
      ISR
      6114.570
      4.077330
      129253.894230
      64.436451
      10.266688
      21.138673
    
    
      Australia
      AUS
      19053.186
      1.724830
      541804.652100
      67.759026
      6.720098
      28.436433
    
    
      United States
      USA
      282171.957
      1.000000
      9898700.000000
      72.347054
      6.032454
      35.080382

Now we'll plot per capital GDP using the dataframe's plot method



In [341]:

    
df['GDP percap'].plot(kind='bar')









    Out[341]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f77a207cda0>

Exercises

Here are two exercises. Feel free to consult documentation such as can be found here. The solutions are below. The cell with "solution below" is mean to push them below your line of sight and save you from temptation.

Exercise 1

Generate 10000 data points from the exponential distribution with density

$$ f(x; \alpha) = \alpha \exp(-\alpha x) \qquad (x > 0, \alpha > 0) $$

using scipy.stats and taking $\alpha = 0.5$. Then, after looking up the maximum likelihood estimator of $\alpha$, compute the estimate given your data and check that it is in fact close to $\alpha$.



In [342]:

    
# Put your solution here

Exercise 2

Using the same data set, implement maximum likelihood again, but this time pretending that you don't know the analytical expression for the maximum likelihood estimator. Set up the log likelihood function and maximize it numerically using a routine from scipy.optimize.



In [343]:

    
# Put your solution here

Solutions



In [344]:

    
# Print some nonsense to partially hide solutions
filler_text = "solution below\n" * 25
print(filler_text)









    



solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below

Solution to Exercise 1

After checking the docs for the exponential distribution we proceed as follows



In [345]:

    
from scipy.stats import expon
alpha = 0.5
n = 10000
ep = expon(scale=1.0/alpha)  # scale controls the exponential parameter
x = ep.rvs(n)

Let's check we've got the right distribution here



In [346]:

    
fig, ax = plt.subplots(figsize=(8, 5))
xmin, xmax = 0.001, 10.0
ax.set_xlim(xmin, xmax)
ax.hist(x, normed=True, bins=40, alpha=0.3)
grid = np.linspace(xmin, xmax, 200)
ax.plot(grid, ep.pdf(grid), 'g-', lw=2, label='true density')
ax.legend()









    Out[346]:





<matplotlib.legend.Legend at 0x7f77a2118048>

It's well-known that the MLE of $\alpha$ is $1/\bar x$ where $\bar x$ is the mean of the sample. Let's check that it is indeed close to $\alpha$.



In [347]:

    
alpha_mle = 1.0 / x.mean()
print("max likelihood estimate of alpha is {}".format(alpha_mle))









    



max likelihood estimate of alpha is 0.5042739418869412



In [348]:

    
s = x.sum()
def neg_loglike(a):
    "Minus the log likelihood function for exponential"
    return - n * np.log(a) + a * s

Minimize over a reasonable parameter space



In [349]:

    
from scipy.optimize import fminbound
fminbound(neg_loglike, 0.01, 10.0)









    Out[349]:





0.5042730725954131

This is very close to the analytical value of the max likelihood estimator we got in exercise 1



In [ ]:

	country	country isocode	year	POP	XRAT	tcgdp	cc	cg
0	Argentina	ARG	2000	37335.653	0.999500	295072.218690	75.716805	5.578804
1	Australia	AUS	2000	19053.186	1.724830	541804.652100	67.759026	6.720098
2	India	IND	2000	1006300.297	44.941600	1728144.374800	64.575551	14.072206
3	Israel	ISR	2000	6114.570	4.077330	129253.894230	64.436451	10.266688
4	Malawi	MWI	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
5	South Africa	ZAF	2000	45064.098	6.939830	227242.369490	72.718710	5.726546
6	United States	USA	2000	282171.957	1.000000	9898700.000000	72.347054	6.032454
7	Uruguay	URY	2000	3219.793	12.099592	25255.961693	78.978740	5.108068

	country isocode	year	POP	XRAT	tcgdp	cc	cg
country
Argentina	ARG	2000	37335.653	0.999500	295072.218690	75.716805	5.578804
Australia	AUS	2000	19053.186	1.724830	541804.652100	67.759026	6.720098
India	IND	2000	1006300.297	44.941600	1728144.374800	64.575551	14.072206
Israel	ISR	2000	6114.570	4.077330	129253.894230	64.436451	10.266688
Malawi	MWI	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
South Africa	ZAF	2000	45064.098	6.939830	227242.369490	72.718710	5.726546
United States	USA	2000	282171.957	1.000000	9898700.000000	72.347054	6.032454
Uruguay	URY	2000	3219.793	12.099592	25255.961693	78.978740	5.108068