In [60]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')

Improved Model One Notebook

Name: Abraham D. Flaxman

Date: 2/22/2015

A Brief Recap of the Data

Totally new example project: measuring quality of health facilities. Approach based on "structural quality" in Donabedian's paradigm, but since I want to understand how method can breakdown, I will be using entirely simulated data.

"Ultimately, the secret of quality is love." ---Avedis Donabedian

Generate simulated logit_love for each facility from mixture of normal distributions; then generate facility-specific indicators of technical resource availability from a distribution that includes the level of love, a thing-level effect, and interactions for complementarity and substitutability.


In [61]:
# set random seed for reproducibility
np.random.seed(123456)

n = 200 # number of facilities

# means and spreads for the logit_love mixture distribution
mu = [2, 0, -2]
sigma = [.6, .4, .6]

# sample logit_love from mixture of gaussians
cluster = np.empty(n, dtype=int)
logit_love = np.empty(n)

for i in range(n):
    cluster[i] = np.random.choice(range(len(mu)))
    logit_love[i] = np.random.normal(mu[cluster[i]], sigma[cluster[i]])

In [62]:
sns.distplot(logit_love, rug=True)


Out[62]:
<matplotlib.axes.AxesSubplot at 0x7f6353901910>

In [63]:
# now generate preliminary resource availability indicators
p = 20

logit_thing = np.random.normal(0., 2., size=p)

X = np.empty((n,p))
for i in range(n):
    for j in range(p):
        X[i,j] = np.random.uniform() < 1 / (1 + np.exp(-(logit_love[i] + logit_thing[j])))

In [64]:
# now simulate complementarity and substitutability

comp_pr = .1 # expect 10% of resources to be complementary
complementary_to = np.empty(p)
for j in range(p):
    if np.random.uniform() < comp_pr:
        complementary_to[j] = np.random.choice(range(p))
    else:
        complementary_to[j] = np.nan

subs_pr = .1 # expect 10% of resources to be complementary
substitutable_to = np.empty(p)
for j in range(p):
    if np.random.uniform() < subs_pr:
        substitutable_to[j] = np.random.choice(range(p))
    else:
        substitutable_to[j] = np.nan

In [65]:
for i in range(n):
    for j in range(p):
        # only include a complementary resource if the resource it complements is present
        j_prime = complementary_to[j]
        if not np.isnan(j_prime):
            if X[i,j_prime] == False:
                X[i,j] = False

        # only include a substitutable resource if the resource that substitutes for it is absent
        j_prime = substitutable_to[j]
        if not np.isnan(j_prime):
            if X[i,j_prime] == True:
                X[i,j] = False

TODO: refactor that into a function!

A Brief Description of the Baseline Model

You might not believe this, but the baseline model, which we have already used and published results with is to sum up the rows and call it done!


In [66]:
sns.distplot(X.sum(1) + np.random.normal(0., .1, size=n), rug=True)


Out[66]:
<matplotlib.axes.AxesSubplot at 0x7f6354376350>

A Brief Description of the Metric(s) of Success

In simulation, we can check how the quality scores compare to the true level of love in each facility. But this is "unsupervised learning", because in our real application there is no direct way to check our work.

A Brief Description of how to measure the performance of the baseline model

Since we will not have a real way to check our work, the qualitative/visual inspection of a scatter plot of logit(love) versus the predicted quality seems like a good way to go for now.


In [67]:
plt.plot(logit_love, X.sum(1), 'o')
plt.xlabel('logit(love)')
plt.ylabel('sum of indicators')


Out[67]:
<matplotlib.text.Text at 0x7f6352e76810>

Brief Description of the approach you will take for model one

For Model One, I am going to see if Principle Component Analysis (PCA) does something better than simply summing the indicators.

Now try it out!


In [68]:
import sklearn.decomposition

In [69]:
pca = sklearn.decomposition.PCA(n_components=1)
pca_quality = pca.fit_transform(X)

In [70]:
plt.plot(logit_love, pca_quality, 'o')
plt.xlabel('logit(love)')
plt.ylabel('$PC_0$')


Out[70]:
<matplotlib.text.Text at 0x7f6352e69890>

How did it do?

Seems like I'll need a better metric of success to really compare.

What are the most promising directions to explore next?

Kernelized PCA!


In [71]:
pca = sklearn.decomposition.KernelPCA(n_components=1, kernel='rbf', gamma=.001)
pca_quality = pca.fit_transform(X)

plt.plot(logit_love, pca_quality, 'o')
plt.xlabel('logit(love)')
plt.ylabel('$PC_0$')


Out[71]:
<matplotlib.text.Text at 0x7f6352e039d0>

Lots of fun to be had with this, although it is not clear that it is a good idea...


In [ ]: