In [60]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')
Name: Abraham D. Flaxman
Date: 2/22/2015
Totally new example project: measuring quality of health facilities. Approach based on "structural quality" in Donabedian's paradigm, but since I want to understand how method can breakdown, I will be using entirely simulated data.
"Ultimately, the secret of quality is love." ---Avedis Donabedian
Generate simulated logit_love
for each facility from mixture of normal distributions; then generate facility-specific indicators of technical resource availability from a distribution that includes the level of love, a thing-level effect, and interactions for complementarity and substitutability.
In [61]:
# set random seed for reproducibility
np.random.seed(123456)
n = 200 # number of facilities
# means and spreads for the logit_love mixture distribution
mu = [2, 0, -2]
sigma = [.6, .4, .6]
# sample logit_love from mixture of gaussians
cluster = np.empty(n, dtype=int)
logit_love = np.empty(n)
for i in range(n):
cluster[i] = np.random.choice(range(len(mu)))
logit_love[i] = np.random.normal(mu[cluster[i]], sigma[cluster[i]])
In [62]:
sns.distplot(logit_love, rug=True)
Out[62]:
In [63]:
# now generate preliminary resource availability indicators
p = 20
logit_thing = np.random.normal(0., 2., size=p)
X = np.empty((n,p))
for i in range(n):
for j in range(p):
X[i,j] = np.random.uniform() < 1 / (1 + np.exp(-(logit_love[i] + logit_thing[j])))
In [64]:
# now simulate complementarity and substitutability
comp_pr = .1 # expect 10% of resources to be complementary
complementary_to = np.empty(p)
for j in range(p):
if np.random.uniform() < comp_pr:
complementary_to[j] = np.random.choice(range(p))
else:
complementary_to[j] = np.nan
subs_pr = .1 # expect 10% of resources to be complementary
substitutable_to = np.empty(p)
for j in range(p):
if np.random.uniform() < subs_pr:
substitutable_to[j] = np.random.choice(range(p))
else:
substitutable_to[j] = np.nan
In [65]:
for i in range(n):
for j in range(p):
# only include a complementary resource if the resource it complements is present
j_prime = complementary_to[j]
if not np.isnan(j_prime):
if X[i,j_prime] == False:
X[i,j] = False
# only include a substitutable resource if the resource that substitutes for it is absent
j_prime = substitutable_to[j]
if not np.isnan(j_prime):
if X[i,j_prime] == True:
X[i,j] = False
TODO: refactor that into a function!
In [66]:
sns.distplot(X.sum(1) + np.random.normal(0., .1, size=n), rug=True)
Out[66]:
In simulation, we can check how the quality scores compare to the true level of love in each facility. But this is "unsupervised learning", because in our real application there is no direct way to check our work.
Since we will not have a real way to check our work, the qualitative/visual inspection of a scatter plot of logit(love) versus the predicted quality seems like a good way to go for now.
In [67]:
plt.plot(logit_love, X.sum(1), 'o')
plt.xlabel('logit(love)')
plt.ylabel('sum of indicators')
Out[67]:
In [68]:
import sklearn.decomposition
In [69]:
pca = sklearn.decomposition.PCA(n_components=1)
pca_quality = pca.fit_transform(X)
In [70]:
plt.plot(logit_love, pca_quality, 'o')
plt.xlabel('logit(love)')
plt.ylabel('$PC_0$')
Out[70]:
In [71]:
pca = sklearn.decomposition.KernelPCA(n_components=1, kernel='rbf', gamma=.001)
pca_quality = pca.fit_transform(X)
plt.plot(logit_love, pca_quality, 'o')
plt.xlabel('logit(love)')
plt.ylabel('$PC_0$')
Out[71]:
Lots of fun to be had with this, although it is not clear that it is a good idea...
In [ ]: