In [32]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')
Name: Abraham D. Flaxman
Date: 2/11/2015
I will use penalized logistic regression to distinguish 10+10 exercise text from Biostats compentencies, and use the coefficient values to identify what is and is not Biostats in your proposed projects.
Text from 10+10 exercise and from the SPH compentencies for degrees website (https://sph.washington.edu/prospective/programs/competencies.asp)
A bunch of text processing to get all this text from various files into one place.
In [2]:
# load all 10+10 exercises
import glob
collected_work_dirs = glob.glob('/projects/46a44406-fa15-4d4a-98a8-3133426c492b/example_course-collect/Week_5/*/')
In [3]:
def extract_project_idea(txt):
""" The final bit of the 10+10.md file is the selected project idea
Here is a hacky way to get it.
"""
txt = txt.split("==================================================================================")
#print txt
return txt[-1]
#extract_project_idea(txt)
For each dir in the collected work dir list, load the 10+10.md file, extract the project idea, and append it to the corpus. And label the results AI4HM.
In [4]:
corpus_ai4hm = []
label_ai4hm = []
for dir_name in collected_work_dirs:
fname = dir_name + '10+10.md'
with open(fname) as f:
txt = f.read()
txt = extract_project_idea(txt)
if len(txt.strip()) > 0:
corpus_ai4hm.append(txt)
label_ai4hm.append('AI4HM')
In [5]:
# how many did we get?
len(corpus_ai4hm)
Out[5]:
In [6]:
# have a look at one
print np.random.choice(corpus_ai4hm)
Now load the biostats compentencies:
In [8]:
corpus_biostats = []
label_biostats = []
# Biostatistics competencies for all MPH Students
for txt in """Select and interpret appropriate graphical displays and numerical summaries for both quantitative and categorical data.
Explain the logic and interpret the results of statistical hypothesis tests and confidence intervals.
Select appropriate measures of association of nominal and continuous variables.
Select appropriate methods for statistical inference to compare one group to a standard, or two or more groups to each other.
Develop or evaluate a statistical analysis plan to address the major research questions of a public health or biomedical study based on the data collected and the design of the study.
Explain the roles of sample size, power, and precision in standard study designs.""".split('\n'):
corpus_biostats.append(txt)
label_biostats.append('Biostats')
In [9]:
# Upon satisfactory completion of the MPH in Biostatistics, graduates will be able to:
for txt in """Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Select and interpret appropriate graphical displays and numerical summaries for both quantitative and categorical data;
Explain the logic and interpret the results of statistical hypothesis tests and confidence intervals;
Select appropriate methods for statistical inference to compare one group to a standard, or two or more groups to each other;
Use appropriate statistical techniques to perform multiple comparisons, to account for confounding or to gain precision;
Use appropriate regression analysis techniques for continuous, binary, count and censored-time-to-event outcomes to analyze independent data from medical and other public health studies;
Explain different modeling strategies employed in regression analysis, depending on whether the purpose of the analysis is to develop a predictive model or to make adjusted comparisons;
Develop or evaluate a statistical analysis plan to address the major research questions of a biomedical study based on the data collected and the design of the study;
Determine the sample size needed for a study; and
Communicate the aims and results of regression analyses of continuous, binary, count and censored-time-to-event outcomes, to an audience of non-statistician collaborators, including a full interpretation of relevant parameter estimates.""".split('\n'):
corpus_biostats.append(txt)
label_biostats.append('Biostats')
In [10]:
len(corpus_biostats)
Out[10]:
In [11]:
corpus_hme = []
label_hme = []
# Upon satisfactory completion of the MPH in Global Health, HME track, graduates will be able to:
for txt in """Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Meet the generic learning objectives of the DGH core curriculum:
Describe the most commoncauses of morbidity and mortality globally, both communicable and non-communicable, among newborns, children, adolescents, women, and men and apply this knowledge in the design, implementation, or evaluation of health services or programs;
Describe the major components ofhealth information systems (e.g., surveillance, national registries, surveys, administrative data) and some of the uses, challenges and limitations of gathering and using health statistics;
Analyze the role of leading factors, institutions and policy frameworks in shaping the organization and governance of international health since the mid-20th century;
Analyze how historical, political, and economic factors have and are shaping, maintaining and reforming health and health care systems;
Apply scientific methods to plan, scale up and/or evaluate interventions to improve determinants of health and health systems;
Discuss the major causes of disease burden, the pattern and variability in health issues around the globe, as well as think critically about the magnitude and complex nature of global health challenges and ways to address them;
Identify and describe the world's most significant diseases, injuries and risk factors, including their causes, symptoms, treatment, prevention, and associated risk factors;
Elaborate on specific topics such as: defining and quantifying health, measuring mortality and trends in adult and child mortality, diseases and risk factors in populations, the epidemiological transition, health inequalities, framework for health systems performance assessment, financing of health care;
Compare and contrast the health status of different populations with respect to their disease burden, epidemics, human resources for health, organization and quality of health care delivery, health reforms;
Describe the rationale, conceptual and historical basis of population health measurement;
Critically examine different measures of population health and health system performance;
Compare and contrast the main sources of information on population health and health system performance;
Apply and develop statistical methods and analytic techniques;
Demonstrate proficiency in at least two statistical packages, e.g. STATA, R, etc.;
Demonstrate proficiency in analyzing large survey datasets and compute quantities of interest while taking into consideration complex sampling frames;
Exhibit knowledge and technical acumen of a number of statistical models including, but not limited to: linear regression, logic and profit models, count models, hierarchical models;
Calculate and interpret important health statistics such as disease incidence and prevalence, maternal mortality rates and ratios, disability‐adjusted life years, attributable burden, and avoidable risk;
Analyze systematically the evidence presented in published research on global health problems, potential solutions, system barriers and political/economic dimensions, using appropriate techniques and methods;
Describe and explain the use of health metrics in health policy, planning and priority setting;
State and interpret the concepts and steps in designing impact evaluation studies;
Describe and critique select high‐profiled impact evaluation studies in global health;
Apply appropriate methods to control for confounding in evaluation studies;
Demonstrate ability to implement statistical methods used in evaluation studies including various types of matching, instrumental variables and panel regression;
Distinguish between the various types of evaluation studies and recognize the circumstances that they should be used in;
Describe the key steps in survey design, list the main types of surveys and distinguish the advantages and disadvantages of each one. Categorize the bias present in available data sources for evaluation studies and demonstrate ability to correct for it using statistical techniques;
Demonstrate ability to communicate effectively in oral and written format, and to lay and professional audiences;
Use appropriately on‐line resources to perform comprehensive literature reviews;
Demonstrate ability to organize and construct grant proposals and scientific papers; and
Critique journal articles.""".split('\n'):
corpus_hme.append(txt)
label_hme.append('GH-HME')
Let's see what we've got:
In [12]:
pd.Series(label_ai4hm + label_biostats + label_hme).value_counts()
Out[12]:
In [13]:
print '\n'.join(corpus_hme)
In [14]:
import sklearn.feature_extraction
vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)
In [15]:
X = vectorizer.fit_transform(corpus_ai4hm + corpus_biostats)
In [16]:
np.random.choice(vectorizer.get_feature_names(), size=10)
Out[16]:
In [17]:
y = label_ai4hm + label_biostats
I can't resist, let's see what it does:
In [18]:
import sklearn.linear_model
In [22]:
clf = sklearn.linear_model.LogisticRegression(penalty='l1', C=10)
clf.fit(X, y)
np.where(clf.coef_ != 0)
Out[22]:
In [26]:
for i in np.where(clf.coef_ > .1)[1]:
print vectorizer.get_feature_names()[i],
In [24]:
np.where(clf.coef_ < -.1)
Out[24]:
In [25]:
for i in np.where(clf.coef_ < -.1)[1]:
print vectorizer.get_feature_names()[i],
Well, that did not do anything interesting...
But I will not give up yet.
In [27]:
clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=10)
clf.fit(X, y)
np.where(clf.coef_ > .1)[1]
Out[27]:
In [28]:
np.where(clf.coef_ < -.1)[1]
Out[28]:
In [29]:
for i in np.where(clf.coef_ > .1)[1]:
print vectorizer.get_feature_names()[i], '|',
In [30]:
for i in np.where(clf.coef_ < -.1)[1]:
print vectorizer.get_feature_names()[i], '|',
That was fun. But perhaps useless. But I'll play around a little more.
In [ ]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=[1,2])
X = vectorizer.fit_transform(corpus_ai4hm + corpus_biostats)
y = label_ai4hm + label_biostats
clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=100)
clf.fit(X, y)
print 'class A'
for i in np.where(clf.coef_ > .1)[1]:
print vectorizer.get_feature_names()[i], '|',
print
print
print 'class B'
for i in np.where(clf.coef_ < -.1)[1]:
print vectorizer.get_feature_names()[i], '|',
In [66]:
i=0
In [67]:
print i
print corpus_ai4hm[i]
In [64]:
i += 1
In [ ]: