Introduction to Scipy and Statsmodels libraries
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
The SciPy library is one of the core packages that make up the SciPy stack. It provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.
Scipy provides an io
module to help load some data type. We can easily read MATLAB .mat
files using io.loadmat
and io.savemat
.
In [ ]:
from scipy.io import loadmat, savemat
a = np.ones((3, 3))
savemat('file.mat', {'a': a}) # savemat expects a dictionary
data = loadmat('file.mat', struct_as_record=True)
data['a']
In [ ]:
In [ ]:
In [ ]:
In [ ]:
The scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points where no measure exists. By imagining experimental data close to a sine function:
In [ ]:
measured_time = np.linspace(0, 1, 10)
noise = (np.random.random(10)*2 - 1) * 1e-1
measures = np.sin(2 * np.pi * measured_time) + noise
The scipy.interpolate.interp1d
class can build a linear interpolation function:
In [ ]:
from scipy.interpolate import interp1d
linear_interp = interp1d(measured_time, measures)
Then the scipy.interpolate.linear_interp
instance needs to be evaluated at the time of interest:
In [ ]:
computed_time = np.linspace(0, 1, 50)
linear_results = linear_interp(computed_time)
A cubic interpolation can also be selected by providing the kind
optional keyword argument:
In [ ]:
cubic_interp = interp1d(measured_time, measures, kind='cubic')
cubic_results = cubic_interp(computed_time)
Let's see the difference by plotting the results.
In [ ]:
plt.plot(measured_time, measures, 'or', label='Measures')
plt.plot(computed_time, linear_results, label='Linear interpolation')
plt.plot(computed_time, cubic_results, label='Cubic interpolation')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.show()
In [ ]:
In [ ]:
Optimization is the problem of finding a numerical solution to a minimization or equality.
The scipy.optimize module provides useful algorithms for function minimization (scalar or multi-dimensional), curve fitting and root finding.
In [ ]:
from scipy import optimize
In [ ]:
def f(x):
return x ** 2 + 10 * np.sin(x)
and plot it:
In [ ]:
x = np.arange(-10, 10, 0.1)
plt.plot(x, f(x))
plt.show()
This function has a global minimum around -1.3 and a local minimum around 3.8.
The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:
In [ ]:
res = optimize.minimize(f, 0, method='L-BFGS-B')
res
A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:
In [ ]:
res2 = optimize.minimize(f, 3, method='L-BFGS-B')
res2
If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, we use scipy.optimize.basinhopping()
(which combines a local optimizer with stochastic sampling of starting points for the local optimizer):
In [ ]:
optimize.basinhopping(f, 3, niter=1000)
In [ ]:
root = optimize.fsolve(f, 1) # our initial guess is 1
root
Note that only one root is found. Inspecting the plot of f reveals that there is a second root around -2.5. We find the exact value of it by adjusting our initial guess:
In [ ]:
root2 = optimize.fsolve(f, -2.5)
root2
In [ ]:
xdata = np.linspace(-10, 10, num=100)
ydata = f(xdata) + np.random.normal(0, 2, xdata.shape)
Now if we know the functional form of the function from which the samples were drawn ($x^2 + \sin(x)$ in this case) but not the amplitudes of the terms, we can find those by least squares curve fitting. First we have to define the function to fit:
In [ ]:
def f2(x, a, b):
return a*x**2 + b*np.sin(x)
Then we can use scipy.optimize.curve_fit()
to find $a$ and $b$:
In [ ]:
guess = [2, 2]
params, params_covariance = optimize.curve_fit(f2, xdata, ydata, guess)
params
In [ ]:
x = np.arange(-10, 10, 0.1)
plt.plot(xdata, ydata)
# plot the local minima
plt.plot(res.x, f(res.x), 'or', label='minimum')
plt.plot(res2.x, f(res2.x), 'or')
# plot the roots
plt.plot(root, f(root), '^g', label='roots')
plt.plot(root2, f(root2), '^g')
# plot the curved fitted
plt.plot(x, f2(x, params[0], params[1]), '--', label='fitted')
plt.legend()
plt.show()
In [ ]:
# import helper regarding normal distribution
from scipy.stats import norm
def find_nearest_index(array, value):
"""Find the nearest index of a value in an array."""
idx = (np.abs(array - value)).argmin()
return idx
def model_bi_functions(freqs, a=1e-5, b=0.01,
scale=100, mu=3300, sigma=300):
"""Model to be fitted.
It corresponds to a line from [0, f0] and a
Normal distribution profile from [f0, end].
Parameters
----------
freqs : ndarray, shape (n_freqs,)
Frequencies for which the spectrum will be calculated
a : float, (default=1e-5)
Slope of the line.
b : float, (default=0.01)
Values where the line cut the y-axis.
scale : float, (default=100)
Scaling factor for the amplitude of the Gaussian profile.
mu : float, (default=3300)
Central value of the Gaussian profile.
sigma : float, (default=300)
Standard deviation of the Gaussian profile.
"""
y = np.zeros(freqs.shape)
# find the index of the inflexion point
f0_idx = find_nearest_index(freqs, mu - 3 * sigma)
# line equation
y[:f0_idx] = a * freqs[:f0_idx] + b
# Gaussian profile
y[f0_idx:] = ((a * freqs[f0_idx] + b) +
(scale * norm.pdf(freqs[f0_idx:], mu, sigma)))
return y
In [ ]:
y = model_bi_functions(frequency_interp)
plt.plot(frequency_interp, y)
plt.xlabel('Frequency')
plt.ylabel('Amplitude')
In [ ]:
In [ ]:
Given a function object, the most generic integration routine is scipy.integrate.quad()
.
In [ ]:
from scipy.integrate import quad
res, err = quad(np.sin, 0, np.pi / 2)
res
If only fixed sample are given, the trapeze method (scipy.integrate.trapz()
) or Simpson's integration rule scipy.integrate.simps()
) can be used.
In [ ]:
x = np.linspace(0, np.pi / 2, num=200)
y = np.sin(x)
In [ ]:
from scipy.integrate import simps
res = simps(y, x)
res
In [ ]:
The scipy.linalg
offers basic operation used in linear algebra such as inverse (scipy.linalg.inv
), pseudo-inverse (scipy.linalg.pinv
), determinant (scipy.linalg.det
) as well as decompostion as standard decompisition as SVD, QR, or Cholesky among others.
Let's declare two arrays of shape $3 \times 3$ and $3 \times 1$, respectively.
In [ ]:
A = np.array([[ 3, 3, -1],
[ 2, -3, 4],
[-1, .5, -1]])
b = np.array([[ 1],
[-2],
[ 0]])
Using the *
operator does not lead to a matrix multiplication since the matrix returned is a $3 \times 3$ matrix. Instead, it multiply each column of $A$ by the vector $b$.
In [ ]:
A * b
You need to use the function np.dot
to obtain the matrix multiplication.
In [ ]:
np.dot(A, b)
However, by converting $A$ and $b$ to matrices (i.e., np.matrix
), it is possible to use the *
operator directly.
In [ ]:
A = np.matrix(A)
b = np.matrix(b)
A * b
In [ ]:
In [ ]:
In [ ]:
scipy.stats
contains mainly helper of most common continuous and discrete distribution.
In addition, this module contain statistical functions to perform statistical tests for instance.
In [ ]:
import pandas as pd
data = pd.read_csv('data/brain_size.csv', sep=';', na_values=".")
data.head()
In [ ]:
from scipy.stats import ttest_1samp
ttest_1samp(data['VIQ'], 0)
With a p-value of $10^{-28}$ we can claim that the population mean for the IQ (VIQ measure) is not 0.
scipy.stats.ttest_ind()
can compare two populations and check if the difference is significant or not. We can study if there is a difference of the VIQ between Male and Female.
In [ ]:
groupby_gender = data.groupby('Gender')
for gender, value in groupby_gender['VIQ']:
print((gender, value.mean()))
To see if this difference is significant, we can use scipy.stats.ttest_ind()
.
In [ ]:
from scipy.stats import ttest_ind
female_viq = data[data['Gender'] == 'Female']['VIQ']
male_viq = data[data['Gender'] == 'Male']['VIQ']
ttest_ind(female_viq, male_viq)
In [ ]:
In [ ]:
In [ ]:
Given two set of observations, x and y, we want to test the hypothesis that y is a linear function of x. In other terms: $$ y = x \times coef + intercept + e $$ where e is observation noise. We will use the statsmodels module to:
In [ ]:
x = np.linspace(-5, 5, 20)
np.random.seed(1)
# normal distributed noise
y = -5 + 3 * x + 4 * np.random.normal(size=x.shape)
# Create a data frame containing all the relevant variables
data = pd.DataFrame({'x': x, 'y': y})
Then we specify an OLS model and fit it:
In [ ]:
from statsmodels.formula.api import ols
model = ols("y ~ x + 1", data).fit()
We can inspect the various statistics derived from the fit:
In [ ]:
print(model.summary())
Intercept: We can remove the intercept using - 1 in the formula, or force the use of an intercept using + 1.
Let's see another example: is VIQ can be predicted using Gender.
In [ ]:
from statsmodels.formula.api import ols
data = pd.read_csv('data/brain_size.csv', sep=';', na_values=".")
model = ols("VIQ ~ Gender + 1", data).fit()
print(model.summary())
In [ ]: