In [32]:

    
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')

Data Cleaning Notebook

Name: Abraham D. Flaxman

Date: 2/11/2015

A Brief Description of the Data

I will use penalized logistic regression to distinguish 10+10 exercise text from Biostats compentencies, and use the coefficient values to identify what is and is not Biostats in your proposed projects.

What is available

Text from 10+10 exercise and from the SPH compentencies for degrees website (https://sph.washington.edu/prospective/programs/competencies.asp)

What is needed

A bunch of text processing to get all this text from various files into one place.



In [2]:

    
# load all 10+10 exercises
import glob

collected_work_dirs = glob.glob('/projects/46a44406-fa15-4d4a-98a8-3133426c492b/example_course-collect/Week_5/*/')



In [3]:

    
def extract_project_idea(txt):
    """ The final bit of the 10+10.md file is the selected project idea
    
    Here is a hacky way to get it.
    """
    
    txt = txt.split("==================================================================================")
    #print txt
    return txt[-1]
    
#extract_project_idea(txt)

For each dir in the collected work dir list, load the 10+10.md file, extract the project idea, and append it to the corpus. And label the results AI4HM.



In [4]:

    
corpus_ai4hm = []
label_ai4hm = []

for dir_name in collected_work_dirs:
    fname = dir_name + '10+10.md'
    with open(fname) as f:
        txt = f.read()
        txt = extract_project_idea(txt)
        
        if len(txt.strip()) > 0: 
            corpus_ai4hm.append(txt)
            label_ai4hm.append('AI4HM')



In [5]:

    
# how many did we get?
len(corpus_ai4hm)









    Out[5]:





14



In [6]:

    
# have a look at one

print np.random.choice(corpus_ai4hm)









    




The DEX team works with healthcare expenditure and volume data on a patient-level basis that comes from a variety of different sources and settings. This variability, along with sparseness and the wide variety of trends in the data present a challenge when trying to produce an overarching regression technique. Machine Learning Regression would allow cross-validation and would result in models that are unique to individual causes and functions, improving our estimates and alleviating the oversmoothing currently encountered.

Now load the biostats compentencies:



In [8]:

    
corpus_biostats = []
label_biostats = []

# Biostatistics competencies for all MPH Students

for txt in """Select and interpret appropriate graphical displays and numerical summaries for both quantitative and categorical data.
Explain the logic and interpret the results of statistical hypothesis tests and confidence intervals.
Select appropriate measures of association of nominal and continuous variables.
Select appropriate methods for statistical inference to compare one group to a standard, or two or more groups to each other.
Develop or evaluate a statistical analysis plan to address the major research questions of a public health or biomedical study based on the data collected and the design of the study.
Explain the roles of sample size, power, and precision in standard study designs.""".split('\n'):
    corpus_biostats.append(txt)
    label_biostats.append('Biostats')



In [9]:

    
# Upon satisfactory completion of the MPH in Biostatistics, graduates will be able to:

for txt in """Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Select and interpret appropriate graphical displays and numerical summaries for both quantitative and categorical data;
Explain the logic and interpret the results of statistical hypothesis tests and confidence intervals;
Select appropriate methods for statistical inference to compare one group to a standard, or two or more groups to each other;
Use appropriate statistical techniques to perform multiple comparisons, to account for confounding or to gain precision;
Use appropriate regression analysis techniques for continuous, binary, count and censored-time-to-event outcomes to analyze independent data from medical and other public health studies;
Explain different modeling strategies employed in regression analysis, depending on whether the purpose of the analysis is to develop a predictive model or to make adjusted comparisons;
Develop or evaluate a statistical analysis plan to address the major research questions of a biomedical study based on the data collected and the design of the study;
Determine the sample size needed for a study; and
Communicate the aims and results of regression analyses of continuous, binary, count and censored-time-to-event outcomes, to an audience of non-statistician collaborators, including a full interpretation of relevant parameter estimates.""".split('\n'):
    corpus_biostats.append(txt)
    label_biostats.append('Biostats')



In [10]:

    
len(corpus_biostats)









    Out[10]:





17



In [11]:

    
corpus_hme = []
label_hme = []

# Upon satisfactory completion of the MPH in Global Health, HME track, graduates will be able to:

for txt in """Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Meet the generic learning objectives of the DGH core curriculum:
Describe the most commoncauses of morbidity and mortality globally, both communicable and non-communicable, among newborns, children, adolescents, women, and men and apply this knowledge in the design, implementation, or evaluation of health services or programs;
Describe the major components ofhealth information systems (e.g., surveillance, national registries, surveys, administrative data) and some of the uses, challenges and limitations of gathering and using health statistics;
Analyze the role of leading factors, institutions and policy frameworks in shaping the organization and governance of international health since the mid-20th century;
Analyze how historical, political, and economic factors have and are shaping, maintaining and reforming health and health care systems;
Apply scientific methods to plan, scale up and/or evaluate interventions to improve determinants of health and health systems;
Discuss the major causes of disease burden, the pattern and variability in health issues around the globe, as well as think critically about the magnitude and complex nature of global health challenges and ways to address them;
Identify and describe the world's most significant diseases, injuries and risk factors, including their causes, symptoms, treatment, prevention, and associated risk factors;
Elaborate on specific topics such as: defining and quantifying health, measuring mortality and trends in adult and child mortality, diseases and risk factors in populations, the epidemiological transition, health inequalities, framework for health systems performance assessment, financing of health care;
Compare and contrast the health status of different populations with respect to their disease burden, epidemics, human resources for health, organization and quality of health care delivery, health reforms;
Describe the rationale, conceptual and historical basis of population health measurement;
Critically examine different measures of population health and health system performance;
Compare and contrast the main sources of information on population health and health system performance;
Apply and develop statistical methods and analytic techniques;
Demonstrate proficiency in at least two statistical packages, e.g. STATA, R, etc.;
Demonstrate proficiency in analyzing large survey datasets and compute quantities of interest while taking into consideration complex sampling frames;
Exhibit knowledge and technical acumen of a number of statistical models including, but not limited to: linear regression, logic and profit models, count models, hierarchical models;
Calculate and interpret important health statistics such as disease incidence and prevalence, maternal mortality rates and ratios, disability‐adjusted life years, attributable burden, and avoidable risk;
Analyze systematically the evidence presented in published research on global health problems, potential solutions, system barriers and political/economic dimensions, using appropriate techniques and methods;
Describe and explain the use of health metrics in health policy, planning and priority setting;
State and interpret the concepts and steps in designing impact evaluation studies;
Describe and critique select high‐profiled impact evaluation studies in global health;
Apply appropriate methods to control for confounding in evaluation studies;
Demonstrate ability to implement statistical methods used in evaluation studies including various types of matching, instrumental variables and panel regression;
Distinguish between the various types of evaluation studies and recognize the circumstances that they should be used in;
Describe the key steps in survey design, list the main types of surveys and distinguish the advantages and disadvantages of each one. Categorize the bias present in available data sources for evaluation studies and demonstrate ability to correct for it using statistical techniques;
Demonstrate ability to communicate effectively in oral and written format, and to lay and professional audiences;
Use appropriately on‐line resources to perform comprehensive literature reviews;
Demonstrate ability to organize and construct grant proposals and scientific papers; and
Critique journal articles.""".split('\n'):
    corpus_hme.append(txt)
    label_hme.append('GH-HME')

Let's see what we've got:



In [12]:

    
pd.Series(label_ai4hm + label_biostats + label_hme).value_counts()









    Out[12]:





GH-HME      32
Biostats    17
AI4HM       14
dtype: int64



In [13]:

    
print '\n'.join(corpus_hme)









    



Meet the generic SPH learning objectives for the MPH degree;
Meet the Core-Specific Learning Objectives for all MPH students;
Meet the generic learning objectives of the DGH core curriculum:
Describe the most commoncauses of morbidity and mortality globally, both communicable and non-communicable, among newborns, children, adolescents, women, and men and apply this knowledge in the design, implementation, or evaluation of health services or programs;
Describe the major components ofhealth information systems (e.g., surveillance, national registries, surveys, administrative data) and some of the uses, challenges and limitations of gathering and using health statistics;
Analyze the role of leading factors, institutions and policy frameworks in shaping the organization and governance of international health since the mid-20th century;
Analyze how historical, political, and economic factors have and are shaping, maintaining and reforming health and health care systems;
Apply scientific methods to plan, scale up and/or evaluate interventions to improve determinants of health and health systems;
Discuss the major causes of disease burden, the pattern and variability in health issues around the globe, as well as think critically about the magnitude and complex nature of global health challenges and ways to address them;
Identify and describe the world's most significant diseases, injuries and risk factors, including their causes, symptoms, treatment, prevention, and associated risk factors;
Elaborate on specific topics such as: defining and quantifying health, measuring mortality and trends in adult and child mortality, diseases and risk factors in populations, the epidemiological transition, health inequalities, framework for health systems performance assessment, financing of health care;
Compare and contrast the health status of different populations with respect to their disease burden, epidemics, human resources for health, organization and quality of health care delivery, health reforms;
Describe the rationale, conceptual and historical basis of population health measurement;
Critically examine different measures of population health and health system performance;
Compare and contrast the main sources of information on population health and health system performance;
Apply and develop statistical methods and analytic techniques;
Demonstrate proficiency in at least two statistical packages, e.g. STATA, R, etc.;
Demonstrate proficiency in analyzing large survey datasets and compute quantities of interest while taking into consideration complex sampling frames;
Exhibit knowledge and technical acumen of a number of statistical models including, but not limited to: linear regression, logic and profit models, count models, hierarchical models;
Calculate and interpret important health statistics such as disease incidence and prevalence, maternal mortality rates and ratios, disability‐adjusted life years, attributable burden, and avoidable risk;
Analyze systematically the evidence presented in published research on global health problems, potential solutions, system barriers and political/economic dimensions, using appropriate techniques and methods;
Describe and explain the use of health metrics in health policy, planning and priority setting;
State and interpret the concepts and steps in designing impact evaluation studies;
Describe and critique select high‐profiled impact evaluation studies in global health;
Apply appropriate methods to control for confounding in evaluation studies;
Demonstrate ability to implement statistical methods used in evaluation studies including various types of matching, instrumental variables and panel regression;
Distinguish between the various types of evaluation studies and recognize the circumstances that they should be used in;
Describe the key steps in survey design, list the main types of surveys and distinguish the advantages and disadvantages of each one. Categorize the bias present in available data sources for evaluation studies and demonstrate ability to correct for it using statistical techniques;
Demonstrate ability to communicate effectively in oral and written format, and to lay and professional audiences;
Use appropriately on‐line resources to perform comprehensive literature reviews;
Demonstrate ability to organize and construct grant proposals and scientific papers; and
Critique journal articles.

Need to get text into a format more familiar to our class



In [14]:

    
import sklearn.feature_extraction
vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)



In [15]:

    
X = vectorizer.fit_transform(corpus_ai4hm + corpus_biostats)



In [16]:

    
np.random.choice(vectorizer.get_feature_names(), size=10)









    Out[16]:





array([u'representations', u'clusters', u'ages', u'working', u'adding',
       u'shifts', u'study', u'clinician', u'ihme', u'marie'], 
      dtype='<U17')



In [17]:

    
y = label_ai4hm + label_biostats

I can't resist, let's see what it does:



In [18]:

    
import sklearn.linear_model



In [22]:

    
clf = sklearn.linear_model.LogisticRegression(penalty='l1', C=10)
clf.fit(X, y)
np.where(clf.coef_ != 0)









    Out[22]:





(array([0, 0, 0, 0, 0, 0, 0, 0]),
 array([ 52,  92, 386, 786, 787, 803, 888, 897]))



In [26]:

    
for i in np.where(clf.coef_ > .1)[1]:
    print vectorizer.get_feature_names()[i],



In [24]:

    
np.where(clf.coef_ < -.1)









    Out[24]:





(array([0, 0, 0, 0, 0]), array([ 92, 786, 803, 888, 897]))



In [25]:

    
for i in np.where(clf.coef_ < -.1)[1]:
    print vectorizer.get_feature_names()[i],









    



be that to with would

Well, that did not do anything interesting...

But I will not give up yet.



In [27]:

    
clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=10)
clf.fit(X, y)
np.where(clf.coef_ > .1)[1]









    Out[27]:





array([ 28,  39,  50,  52,  60,  77, 107, 113, 126, 132, 166, 169, 173,
       179, 181, 185, 215, 218, 225, 226, 228, 243, 269, 283, 293, 311,
       328, 340, 343, 368, 401, 414, 416, 444, 456, 468, 477, 480, 484,
       494, 506, 513, 521, 528, 529, 533, 543, 593, 595, 599, 625, 626,
       630, 650, 669, 675, 682, 696, 723, 743, 745, 748, 752, 757, 759,
       766, 781, 784, 787, 838])



In [28]:

    
np.where(clf.coef_ < -.1)[1]









    Out[28]:





array([ 32,  48,  63,  67,  79,  92, 118, 120, 129, 150, 187, 191, 201,
       251, 316, 320, 327, 348, 377, 386, 399, 421, 423, 447, 449, 463,
       523, 535, 555, 569, 735, 786, 792, 797, 801, 814, 815, 842, 850,
       854, 871, 879, 886, 888, 897])



In [29]:

    
for i in np.where(clf.coef_ > .1)[1]:
    print vectorizer.get_feature_names()[i], '|',









    



adjusted | all | analysis | and | appropriate | association | binary | both | categorical | censored | compare | comparisons | confidence | continuous | core | count | degree | depending | designs | determine | develop | displays | employed | event | explain | for | generic | graphical | group | hypothesis | inference | interpret | intervals | learning | logic | make | measures | meet | methods | modeling | mph | needed | nominal | numerical | objectives | of | or | power | precision | predictive | public | purpose | quantitative | regression | results | roles | sample | select | size | sph | standard | statistical | strategies | students | study | summaries | techniques | tests | the | use |



In [30]:

    
for i in np.where(clf.coef_ < -.1)[1]:
    print vectorizer.get_feature_names()[i], '|',









    



age | an | are | as | at | be | by | can | causes | clustering | countries | covariates | currently | do | from | functions | generate | have | if | in | individual | is | it | level | like | machine | not | on | over | patterns | sources | that | these | this | time | tree | trends | using | validation | variety | we | when | will | with | would |

That was fun. But perhaps useless. But I'll play around a little more.



In [ ]:

    
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=[1,2])
X = vectorizer.fit_transform(corpus_ai4hm + corpus_biostats)
y = label_ai4hm + label_biostats

clf = sklearn.linear_model.LogisticRegression(penalty='l2', C=100)
clf.fit(X, y)

print 'class A'
for i in np.where(clf.coef_ > .1)[1]:
    print vectorizer.get_feature_names()[i], '|',
    
print
print
print 'class B'
for i in np.where(clf.coef_ < -.1)[1]:
    print vectorizer.get_feature_names()[i], '|',









    



class A
account | account for | adjusted | adjusted comparisons | all | all mph | analysis | analysis depending | and | and categorical | and censored | and confidence | and continuous | and interpret | and numerical | and precision | appropriate | appropriate graphical | appropriate measures | appropriate methods | appropriate statistical | association | association of | binary | binary count | both | both quantitative | categorical | categorical data | censored | censored time | compare one | comparisons | comparisons to | confidence | confidence intervals | confounding | confounding or | continuous | continuous binary | continuous variables | core | core specific | count | count and | degree | depending | depending on | designs | determine | determine the | develop | develop predictive | different modeling | displays | displays and | each other | employed | employed in | event | event outcomes | explain | explain different | explain the | for | for all | for both | for confounding | for statistical | for study | for the | gain | gain precision | generic | generic sph | graphical | graphical displays | group to | groups to | hypothesis | hypothesis tests | in regression | in standard | inference | inference to | to account | to develop | to each | to event | to gain | to make | to perform | to standard | two or | use | use appropriate | whether the |

class B
about | age | are | as | at | be | by | can | causes | clustering | could | countries | covariates | currently | do | from | gbd | generate | have | if | in | in the | individual | is | it | level | like | machine | machine learning | not | on | over | over time | patterns | so | that | these | this | time | to the | tree | trends | using | variety | variety of | we | when | will | will be | with | would | would be |



In [66]:

    
i=0



In [67]:

    
print i
print corpus_ai4hm[i]









    



0


In a perfect world, I would do:

--Spectral clustering (option 2), with
-- a fully nested distance metric (option 6), and
-- change-over-time covariates included, with the cluster analysis rerun for each year (of ICD10 data), option 9.

Realistically, I don't know that I'll have the time or the capacity to do that.  The first hurdle will definitely be a distance metric for the causes of death, and I would love your input on that.  I'll probably wind up just using k-means, but I would like to try including change-over-time covariates to see if they have a lasting effect on outputs.



In [64]:

    
i += 1



In [ ]: