Introduction to Scipy and Statsmodels libraries


In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

The SciPy library is one of the core packages that make up the SciPy stack. It provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.

1. File input/output - scipy.io

Scipy provides an io module to help load some data type. We can easily read MATLAB .mat files using io.loadmat and io.savemat.


In [ ]:
from scipy.io import loadmat, savemat
a = np.ones((3, 3))
savemat('file.mat', {'a': a}) # savemat expects a dictionary
data = loadmat('file.mat', struct_as_record=True)
data['a']
EXERCISE - `scipy.io`:
  • Load the matfile from `data/spectra.mat` using `scipy.io.loadmat`.
  • Extract from the loaded dictionary two variables (`spectra`, `frequency`). You should call `ravel` the `frequency` array to obtain a 1-D array.
  • Plot the spectra in function of the frequency.

In [ ]:


In [ ]:


In [ ]:


In [ ]:

2. Signal interpolation - scipy.interpolate

The scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points where no measure exists. By imagining experimental data close to a sine function:


In [ ]:
measured_time = np.linspace(0, 1, 10)
noise = (np.random.random(10)*2 - 1) * 1e-1
measures = np.sin(2 * np.pi * measured_time) + noise

The scipy.interpolate.interp1d class can build a linear interpolation function:


In [ ]:
from scipy.interpolate import interp1d
linear_interp = interp1d(measured_time, measures)

Then the scipy.interpolate.linear_interp instance needs to be evaluated at the time of interest:


In [ ]:
computed_time = np.linspace(0, 1, 50)
linear_results = linear_interp(computed_time)

A cubic interpolation can also be selected by providing the kind optional keyword argument:


In [ ]:
cubic_interp = interp1d(measured_time, measures, kind='cubic')
cubic_results = cubic_interp(computed_time)

Let's see the difference by plotting the results.


In [ ]:
plt.plot(measured_time, measures, 'or', label='Measures')
plt.plot(computed_time, linear_results, label='Linear interpolation')
plt.plot(computed_time, cubic_results, label='Cubic interpolation')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.show()
EXERCISE - `scipy.interpolate`:
  • Interpolate each spectra values corresponding to the integral frequencies {401, 402, ..., 3999} using `scipy.interpolate.interp1d`.
  • Plot the spectra in function of the frequencies.

In [ ]:


In [ ]:

3. Optimization - scipy.optimize

Optimization is the problem of finding a numerical solution to a minimization or equality.

The scipy.optimize module provides useful algorithms for function minimization (scalar or multi-dimensional), curve fitting and root finding.


In [ ]:
from scipy import optimize

Finding the minimum of a scalar function

Let’s define the following function:


In [ ]:
def f(x):
    return x ** 2 + 10 * np.sin(x)

and plot it:


In [ ]:
x = np.arange(-10, 10, 0.1)
plt.plot(x, f(x))
plt.show()

This function has a global minimum around -1.3 and a local minimum around 3.8.

The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:


In [ ]:
res = optimize.minimize(f, 0, method='L-BFGS-B')
res

A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:


In [ ]:
res2 = optimize.minimize(f, 3, method='L-BFGS-B')
res2

If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, we use scipy.optimize.basinhopping() (which combines a local optimizer with stochastic sampling of starting points for the local optimizer):


In [ ]:
optimize.basinhopping(f, 3, niter=1000)

Finding the roots of a scalar function

To find a root, i.e. a point where $f(x) = 0$, of the function f above we can use for example scipy.optimize.fsolve():


In [ ]:
root = optimize.fsolve(f, 1)  # our initial guess is 1
root

Note that only one root is found. Inspecting the plot of f reveals that there is a second root around -2.5. We find the exact value of it by adjusting our initial guess:


In [ ]:
root2 = optimize.fsolve(f, -2.5)
root2

Curve fitting

Suppose we have data sampled from $f$ with some noise:


In [ ]:
xdata = np.linspace(-10, 10, num=100)
ydata = f(xdata) + np.random.normal(0, 2, xdata.shape)

Now if we know the functional form of the function from which the samples were drawn ($x^2 + \sin(x)$ in this case) but not the amplitudes of the terms, we can find those by least squares curve fitting. First we have to define the function to fit:


In [ ]:
def f2(x, a, b):
    return a*x**2 + b*np.sin(x)

Then we can use scipy.optimize.curve_fit() to find $a$ and $b$:


In [ ]:
guess = [2, 2]
params, params_covariance = optimize.curve_fit(f2, xdata, ydata, guess)
params

Summary in a single plot


In [ ]:
x = np.arange(-10, 10, 0.1)
plt.plot(xdata, ydata)
# plot the local minima
plt.plot(res.x, f(res.x), 'or', label='minimum')
plt.plot(res2.x, f(res2.x), 'or')
# plot the roots
plt.plot(root, f(root), '^g', label='roots')
plt.plot(root2, f(root2), '^g')
# plot the curved fitted
plt.plot(x, f2(x, params[0], params[1]), '--', label='fitted')
plt.legend()
plt.show()
EXERCISE - `scipy.optimize`: The previous spectra can be modelled using a simple function `model_bi_functions` which we defined as:

$$ S(f)=\left\{ \begin{array}{ll} a f + b, & 0 < f < \mu - 3 \sigma \\ (a (\mu - 3 \sigma) + b) + \exp\left( - \frac{(f - \mu)^{2}}{2 \sigma^{2}} \right), & f \geq \mu - 3 \sigma\\ \end{array} \right. $$ See below a plot which illustrate the profile of this function.
  • Using `scipy.optimize.curve_fit`, fit `model_bi_functions` in the first spectra from `spectra_interp`. You also have to use `frequency_interp` as `x` values. Use the initial parameters `[0.0, 0.01, 100, 3300, 300]`
  • Plot the results.

In [ ]:
# import helper regarding normal distribution
from scipy.stats import norm

def find_nearest_index(array, value):
    """Find the nearest index of a value in an array."""
    idx = (np.abs(array - value)).argmin()
    return idx

def model_bi_functions(freqs, a=1e-5, b=0.01,
                      scale=100, mu=3300, sigma=300):
    """Model to be fitted.
    
    It corresponds to a line from [0, f0] and a
    Normal distribution profile from [f0, end].
    
    Parameters
    ----------
    freqs : ndarray, shape (n_freqs,)
        Frequencies for which the spectrum will be calculated
        
    a : float, (default=1e-5)
        Slope of the line.
        
    b : float, (default=0.01)
        Values where the line cut the y-axis.
        
    scale : float, (default=100)
        Scaling factor for the amplitude of the Gaussian profile.
        
    mu : float, (default=3300)
        Central value of the Gaussian profile.
        
    sigma : float, (default=300)
        Standard deviation of the Gaussian profile.
    
    """
    y = np.zeros(freqs.shape)
    # find the index of the inflexion point
    f0_idx = find_nearest_index(freqs, mu - 3 * sigma)
    # line equation
    y[:f0_idx] = a * freqs[:f0_idx] + b
    # Gaussian profile
    y[f0_idx:] = ((a * freqs[f0_idx] + b) +
                  (scale * norm.pdf(freqs[f0_idx:], mu, sigma)))
    return y

In [ ]:
y = model_bi_functions(frequency_interp)
plt.plot(frequency_interp, y)
plt.xlabel('Frequency')
plt.ylabel('Amplitude')

In [ ]:


In [ ]:

4. Numerical integration - scipy.integrate

Given a function object, the most generic integration routine is scipy.integrate.quad().


In [ ]:
from scipy.integrate import quad
res, err = quad(np.sin, 0, np.pi / 2)
res

If only fixed sample are given, the trapeze method (scipy.integrate.trapz()) or Simpson's integration rule scipy.integrate.simps()) can be used.


In [ ]:
x = np.linspace(0, np.pi / 2, num=200)
y = np.sin(x)

In [ ]:
from scipy.integrate import simps
res = simps(y, x)
res
EXERCISE - `scipy.integrate`: We would be interested in the area under the Gaussian profile since it is related to what we want to quantify.
  • Using `scipy.integrate.simps`, compute the area under the Gaussian profile between $[\mu - 3 \sigma, \mu + 3 \sigma]$. Those parameters can be found as the results of the curve fitting previusly done. The indexes corresponding to the interval values can be computed using `find_nearest_index`.
  • You can do the same using the original data to see the difference od quantification.

In [ ]:

5. Linear algebra - scipy.linalg

The scipy.linalg offers basic operation used in linear algebra such as inverse (scipy.linalg.inv), pseudo-inverse (scipy.linalg.pinv), determinant (scipy.linalg.det) as well as decompostion as standard decompisition as SVD, QR, or Cholesky among others.

`np.array` vs. `np.matrix`:

By default the multiplication between two `np.array` (i.e. `*` operator) do not lead to a matrix multiplication. You need to use `np.dot` to perform this operation.

Another possibility is to convert the `np.array` to `np.matrix` which perform this operation when using the operator `*`. The operations become more readable when there is a lot of algebric operations involved.

We illustrate this behaviour in the example below.

Let's declare two arrays of shape $3 \times 3$ and $3 \times 1$, respectively.


In [ ]:
A = np.array([[ 3,  3, -1],
              [ 2, -3,  4],
              [-1, .5, -1]])

b = np.array([[ 1],
              [-2],
              [ 0]])

Using the * operator does not lead to a matrix multiplication since the matrix returned is a $3 \times 3$ matrix. Instead, it multiply each column of $A$ by the vector $b$.


In [ ]:
A * b

You need to use the function np.dot to obtain the matrix multiplication.


In [ ]:
np.dot(A, b)

However, by converting $A$ and $b$ to matrices (i.e., np.matrix), it is possible to use the * operator directly.


In [ ]:
A = np.matrix(A)
b = np.matrix(b)

A * b
EXERCISE - `scipy.linalg`:
  • Solve the following system of linear equations using the normal equation.

$$ \left[\begin{array}{cc} 3x & 3y & -z \\ 2x & -3y & 4z \\ -x & .5y & -z \end{array}\right] \left[\begin{array}{cc} x_1 \\ x_2 \\ x_3 \end{array}\right] = \left[\begin{array}{cc} -1 \\ -2 \\ 0 \end{array}\right] $$ This problem can be seen as: $$ A x = b $$ $x$ can be find such that: $$ x = (A^{T} A)^{-1} A^{T} b $$ Find $x$ using the above equation

In [ ]:


In [ ]:

EXERCISE:
  • Solve the following system of linear equations using SVD.

The above problem can also be solved using an SVD decomposition such that: $$ x = V S^{-1} (U^{T} b) $$ where $U$, $S$, and $V^{T}$ can be found with `scipy.linalg.svd` such that: `U, S, Vh = svd(A)`

In [ ]:

6. Statistics - scipy.stats and statsmodel

scipy.stats

scipy.stats contains mainly helper of most common continuous and discrete distribution.

In addition, this module contain statistical functions to perform statistical tests for instance.


In [ ]:
import pandas as pd
data = pd.read_csv('data/brain_size.csv', sep=';', na_values=".")
data.head()

1-sample t-test

scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value. Let see if the VIQ of our population is equal to 0.


In [ ]:
from scipy.stats import ttest_1samp

ttest_1samp(data['VIQ'], 0)

With a p-value of $10^{-28}$ we can claim that the population mean for the IQ (VIQ measure) is not 0.

2-sample t-test

scipy.stats.ttest_ind() can compare two populations and check if the difference is significant or not. We can study if there is a difference of the VIQ between Male and Female.


In [ ]:
groupby_gender = data.groupby('Gender')
for gender, value in groupby_gender['VIQ']:
    print((gender, value.mean()))

To see if this difference is significant, we can use scipy.stats.ttest_ind().


In [ ]:
from scipy.stats import ttest_ind
female_viq = data[data['Gender'] == 'Female']['VIQ']
male_viq = data[data['Gender'] == 'Male']['VIQ']
ttest_ind(female_viq, male_viq)
EXERCISE:
  • Test the difference between weights in males and females. You can fill the missing data using `pandas.fillna()` and using the mean weight of the population.
  • Use non parametric statistics to test the difference between VIQ in males and females (refer to `scipy.stats.mannwhitneyu`).


In [ ]:


In [ ]:


In [ ]:

statsmodels

Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. In other terms: $$ y = x \times coef + intercept + e $$ where e is observation noise. We will use the statsmodels module to:

  • Fit a linear model. We will use the simplest strategy, ordinary least squares (OLS).
  • Test that coef is non zero.

In [ ]:
x = np.linspace(-5, 5, 20)
np.random.seed(1)
# normal distributed noise
y = -5 + 3 * x + 4 * np.random.normal(size=x.shape)
# Create a data frame containing all the relevant variables
data = pd.DataFrame({'x': x, 'y': y})

Then we specify an OLS model and fit it:


In [ ]:
from statsmodels.formula.api import ols
model = ols("y ~ x + 1", data).fit()

We can inspect the various statistics derived from the fit:


In [ ]:
print(model.summary())

Intercept: We can remove the intercept using - 1 in the formula, or force the use of an intercept using + 1.

Let's see another example: is VIQ can be predicted using Gender.


In [ ]:
from statsmodels.formula.api import ols
data = pd.read_csv('data/brain_size.csv', sep=';', na_values=".")
model = ols("VIQ ~ Gender + 1", data).fit()
print(model.summary())
EXERCISE:
  • Run an OLS to check if Weight can be predicted using Gender and Height.


In [ ]: