This notebook shows how I think we should do Mahalanobis distance for the SECIM project. From JMP Website:
The Mahalanobis distance takes into account the correlation structure of the data and the individual scales. For each value, the Mahalanobis distance is denoted $M_i$ and is computed as
$M_i = \sqrt{(Y_i - \bar Y)^T S^{-1}(Y_i - \bar Y)}$
where:
In [10]:
import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats as stats
import matplotlib.pyplot as plt
import cPickle as pickle
import os
%matplotlib inline
In [11]:
covSim = np.array([[1.0, .8, .2, .2],
[.8, 1.0, .3, .3],
[.3, .3, 1.0, .8],
[.2, .2, .8, 1.0]])
np.random.seed(111)
datSim = np.random.multivariate_normal([2, 3, 8, 9], covSim, size=1000)
dfSim = pd.DataFrame(data=datSim, columns=['sample1', 'sample2', 'sample3', 'sample4'])
# Save for comparing in sas
dfSim.to_csv('/home/jfear/tmp/dfSim.csv', index=False)
dfSim.head()
Out[11]:
In [12]:
# Calculate the covaranice matrix from the data
covHat = dfSim.cov()
covHat
Out[12]:
In [13]:
# Get the inverse of the covarance matrix
covHatInv = np.linalg.inv(covHat)
covHatInv
Out[13]:
In [14]:
# Calculate the column means
colMean = dfSim.mean(axis=0)
colMean
Out[14]:
In [15]:
# Subtract the mean from each value
dfSimCenter = (dfSim - colMean).T
dfSimCenter.head()
Out[15]:
In [16]:
# Calculate the mahalanobis distance
MD = np.sqrt(np.dot(np.dot(dfSimCenter.T, covHatInv), dfSimCenter))
MDval = np.diag(MD)
In [17]:
plt.scatter(x=range(len(MDval)), y=MDval)
plt.axhline(np.percentile(MDval, 95), ls='--', lw=2)
plt.show()
In [ ]: