In [24]:

    
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')

Baseline Model Notebook

Name: Abraham D. Flaxman

Date: 2/17/2015

A Brief Recap of the Data (which may have changed since last week...)

Text from Biostats and Global Health Metrics compentencies, the SPH compentencies for degrees website (https://sph.washington.edu/prospective/programs/competencies.asp)

A Brief Description of the Baseline Model

I will start with penalized logistic regression, with an L2 penalty, and explore a range of $C$ values.

A Brief Description of the Metric(s) of Success

Accuracy of predictions of whether a compentency is for biostats or GHM.

A Brief Description of how to measure the performance of the baseline model

I will use accuracy (out-of-sample, of course), with a 10-fold cross-validation (1 replicate, for now, to keep things relatively speedy). When I find the best $C$ value, I will have a look at the corresponding $\beta$ values for the whole dataset, and see if they are interesting.

Now to try it out



In [2]:

    
corpus_biostats = []
label_biostats = []

# Biostatistics competencies for all MPH Students

for txt in """Select and interpret appropriate graphical displays and numerical summaries for both quantitative and categorical data.
Explain the logic and interpret the results of statistical hypothesis tests and confidence intervals.
Select appropriate measures of association of nominal and continuous variables.
Select appropriate methods for statistical inference to compare one group to a standard, or two or more groups to each other.
Develop or evaluate a statistical analysis plan to address the major research questions of a public health or biomedical study based on the data collected and the design of the study.
Explain the roles of sample size, power, and precision in standard study designs.""".split('\n'):
    corpus_biostats.append(txt)
    label_biostats.append('Biostats')

# Upon satisfactory completion of the MPH in Biostatistics, graduates will be able to:

for txt in """Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Select and interpret appropriate graphical displays and numerical summaries for both quantitative and categorical data;
Explain the logic and interpret the results of statistical hypothesis tests and confidence intervals;
Select appropriate methods for statistical inference to compare one group to a standard, or two or more groups to each other;
Use appropriate statistical techniques to perform multiple comparisons, to account for confounding or to gain precision;
Use appropriate regression analysis techniques for continuous, binary, count and censored-time-to-event outcomes to analyze independent data from medical and other public health studies;
Explain different modeling strategies employed in regression analysis, depending on whether the purpose of the analysis is to develop a predictive model or to make adjusted comparisons;
Develop or evaluate a statistical analysis plan to address the major research questions of a biomedical study based on the data collected and the design of the study;
Determine the sample size needed for a study; and
Communicate the aims and results of regression analyses of continuous, binary, count and censored-time-to-event outcomes, to an audience of non-statistician collaborators, including a full interpretation of relevant parameter estimates.""".split('\n'):
    corpus_biostats.append(txt)
    label_biostats.append('Biostats')



In [3]:

    
corpus_hme = []
label_hme = []

# Upon satisfactory completion of the MPH in Global Health, HME track, graduates will be able to:

for txt in """Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Meet the generic learning objectives of the DGH core curriculum:
Describe the most commoncauses of morbidity and mortality globally, both communicable and non-communicable, among newborns, children, adolescents, women, and men and apply this knowledge in the design, implementation, or evaluation of health services or programs;
Describe the major components ofhealth information systems (e.g., surveillance, national registries, surveys, administrative data) and some of the uses, challenges and limitations of gathering and using health statistics;
Analyze the role of leading factors, institutions and policy frameworks in shaping the organization and governance of international health since the mid-20th century;
Analyze how historical, political, and economic factors have and are shaping, maintaining and reforming health and health care systems;
Apply scientific methods to plan, scale up and/or evaluate interventions to improve determinants of health and health systems;
Discuss the major causes of disease burden, the pattern and variability in health issues around the globe, as well as think critically about the magnitude and complex nature of global health challenges and ways to address them;
Identify and describe the world's most significant diseases, injuries and risk factors, including their causes, symptoms, treatment, prevention, and associated risk factors;
Elaborate on specific topics such as: defining and quantifying health, measuring mortality and trends in adult and child mortality, diseases and risk factors in populations, the epidemiological transition, health inequalities, framework for health systems performance assessment, financing of health care;
Compare and contrast the health status of different populations with respect to their disease burden, epidemics, human resources for health, organization and quality of health care delivery, health reforms;
Describe the rationale, conceptual and historical basis of population health measurement;
Critically examine different measures of population health and health system performance;
Compare and contrast the main sources of information on population health and health system performance;
Apply and develop statistical methods and analytic techniques;
Demonstrate proficiency in at least two statistical packages, e.g. STATA, R, etc.;
Demonstrate proficiency in analyzing large survey datasets and compute quantities of interest while taking into consideration complex sampling frames;
Exhibit knowledge and technical acumen of a number of statistical models including, but not limited to: linear regression, logic and profit models, count models, hierarchical models;
Calculate and interpret important health statistics such as disease incidence and prevalence, maternal mortality rates and ratios, disability‐adjusted life years, attributable burden, and avoidable risk;
Analyze systematically the evidence presented in published research on global health problems, potential solutions, system barriers and political/economic dimensions, using appropriate techniques and methods;
Describe and explain the use of health metrics in health policy, planning and priority setting;
State and interpret the concepts and steps in designing impact evaluation studies;
Describe and critique select high‐profiled impact evaluation studies in global health;
Apply appropriate methods to control for confounding in evaluation studies;
Demonstrate ability to implement statistical methods used in evaluation studies including various types of matching, instrumental variables and panel regression;
Distinguish between the various types of evaluation studies and recognize the circumstances that they should be used in;
Describe the key steps in survey design, list the main types of surveys and distinguish the advantages and disadvantages of each one. Categorize the bias present in available data sources for evaluation studies and demonstrate ability to correct for it using statistical techniques;
Demonstrate ability to communicate effectively in oral and written format, and to lay and professional audiences;
Use appropriately on‐line resources to perform comprehensive literature reviews;
Demonstrate ability to organize and construct grant proposals and scientific papers; and
Critique journal articles.""".split('\n'):
    corpus_hme.append(txt)
    label_hme.append('GH-HME')



In [4]:

    
import sklearn.feature_extraction
vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)



In [5]:

    
X = vectorizer.fit_transform(corpus_hme + corpus_biostats)



In [17]:

    
y = np.array(label_hme + label_biostats)



In [18]:

    
import sklearn.linear_model

Try it out with an arbitrary value of C:



In [29]:

    
clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=10)
clf.fit(X, y)

for i in np.where(clf.coef_ > .5)[1]:
    print vectorizer.get_feature_names()[i],
    
print
print
for i in np.where(clf.coef_ < -.5)[1]:
    print vectorizer.get_feature_names()[i],









    



apply articles critique curriculum demonstrate describe dgh evaluation health in journal methods models studies

analysis appropriate confidence continuous data explain for hypothesis interpret intervals or precision results sample select size study tests



In [30]:

    
import sklearn.cross_validation, pandas as pd



In [31]:

    
C_list = [.01, .1, 1., 10., 100.]

scores = {}
for C in C_list:
    scores[C] = []
    
cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10)
for train, test in cv:
    for C in C_list:
        clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=C)
        clf.fit(X[train], y[train])
        
        y_pred = clf.predict(X[test])
        scores[C].append(np.mean(y_pred == y[test]))

pd.DataFrame(scores).mean()









    Out[31]:





0.01      0.640000
0.10      0.721667
1.00      0.848333
10.00     0.848333
100.00    0.848333
dtype: float64



In [35]:

    
clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=1)
clf.fit(X, y)

for i in np.where(clf.coef_ > .1)[1]:
    print vectorizer.get_feature_names()[i],
    
print
print
for i in np.where(clf.coef_ < -.1)[1]:
    print vectorizer.get_feature_names()[i],









    



ability analytic and apply appropriately articles at comprehensive control core critique curriculum demonstrate describe dgh distinguish etc evaluation factors generic health historical impact in journal knowledge learning least line literature meet methods models objectives packages performance population proficiency resources reviews risk scientific stata steps studies system systems types used using various

analysis appropriate association binary both categorical censored comparisons confidence continuous count data designs determine displays event explain for from graphical hypothesis independent interpret intervals logic measures medical needed nominal numerical of or other outcomes power precision public quantitative regression results roles sample select size standard statistical study summaries tests the time to variables

How did it do?

Not perfect, but not hopeless... some of the words do make sense, but a lot of them should have been considered "stop words" and ignored. 85% correct is pretty good, actually, and I'm not sure much better is possible.

What are the most promising directions to explore next?

Data cleaning: by using different options in the vectorizer I can get results that I like better Non-linear models: kernalized SVMs can explore all interactions, but will not be easily interpretable; decision trees might give some more interesting balance.



In [ ]: