Genentech Cervical Cancer Screening - Insights

https://www.kaggle.com/c/cervical-cancer-screening/

Cervical cancer is the third most common cancer in women worldwide, affecting over 500,000 women and resulting in approximately 275,000 deaths every year. Most women in the US have access to cervical cancer screening, yet 4,000 women die every year from cervical cancer in the US and it is estimated that 30% of US women do not receive regular pap screenings.

Prior research suggests that lower screening rates are associated with low income, low education, lack of interaction with the healthcare system, and lack of health insurance. But research also shows that even in women with access to healthcare fail to get this preventive test, indicating that barriers like lack of education and not being comfortable with the procedure are influencing their behavior.

If one could better identify women who are not being screened, education campaigns could target them with content that speaks directly to their unique risk factors. Identifying predictors of not receiving pap smears will provide important information to stakeholders in cervical cancer prevention who run awareness programs.

Below is one analysis for the Kaggle Genentech Cervical Cancer Screening competition.

The focus is on identifying healthy women who are not being screened.

By Paul Perry, Elena Cuoco, and Zygmunt Zając.

Load

This notebook is almost runnable in Kaggle scripts, but some features extractions would take too long, and we make use of geopandas for mapping functions. We use Postgres, but it would be easily swappable for Kaggle's sqlite3.



In [1]:

    
# imports
import sys # for stderr
import numpy as np
import pandas as pd
import sklearn as skl
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
# settings 
plt.style.use('ggplot')
# plt.rcParams['figure.figsize'] = (10.0, 10.0)
# pd.set_option('display.max_rows', 50)
# pd.set_option('display.max_columns', 50)



In [3]:

    
# versions 
import sys
print(pd.datetime.now())
print('Python: '+sys.version)
print('numpy: '+np.__version__)
print('pandas: '+pd.__version__)
print('sklearn: '+skl.__version__)









    



2016-02-08 13:43:29.988998
Python: 2.7.11 |Anaconda 2.4.0 (x86_64)| (default, Dec  6 2015, 18:57:58) 
[GCC 4.2.1 (Apple Inc. build 5577)]
numpy: 1.10.2
pandas: 0.17.1
sklearn: 0.17



In [4]:

    
import psycopg2
# import sqlite3



In [5]:

    
db = psycopg2.connect("dbname='ccancer' user='paulperry' host='localhost' password=''")
#db = sqlite3.connect('../input/database.sqlite')



In [6]:

    
import datetime
start = datetime.datetime.now()
print(start)









    



2016-02-08 13:43:33.322108



In [7]:

    
fdir = './features/'



In [8]:

    
train_file = './input/patients_train.csv.gz'
train = pd.read_csv(train_file)
train.set_index('patient_id', inplace=True)
train.drop('patient_gender', axis = 1, inplace = True )
train_exclude = pd.read_csv('./input/train_patients_to_exclude.csv', header=None, names=['patient_id'])
train.drop(train_exclude.patient_id, inplace=True)
original_train_rows = train.shape[0]
print(train.shape)









    



(1157817, 6)



In [9]:

    
train[:5]









    Out[9]:






  
    
      
      patient_age_group
      patient_state
      ethinicity
      household_income
      education_level
      is_screener
    
    
      patient_id
      
      
      
      
      
      
    
  
  
    
      336201912
      51-53
      SD
      ALL OTHER
      UNKNOWN
      UNKNOWN
      1
    
    
      94237712
      39-41
      NE
      ALL OTHER
      UNKNOWN
      UNKNOWN
      1
    
    
      186124512
      24-26
      CA
      ALL OTHER
      UNKNOWN
      UNKNOWN
      0
    
    
      767144212
      27-29
      NY
      ALL OTHER
      UNKNOWN
      UNKNOWN
      1
    
    
      104743512
      30-32
      HI
      ALL OTHER
      UNKNOWN
      UNKNOWN
      0



In [10]:

    
test_file = './input/patients_test.csv.gz'
test = pd.read_csv(test_file)
test.set_index('patient_id', inplace=True)
test.drop( 'patient_gender', axis = 1, inplace = True )
test_exclude = pd.read_csv('./input/test_patients_to_exclude.csv', header=None, names=['patient_id'])
test.drop(test_exclude.patient_id, inplace=True)
original_test_rows = test.shape[0]
print(test.shape)









    



(1701813, 5)



In [11]:

    
test[:5]









    Out[11]:






  
    
      
      patient_age_group
      patient_state
      ethinicity
      household_income
      education_level
    
    
      patient_id
      
      
      
      
      
    
  
  
    
      148341312
      66-68
      TX
      ALL OTHER
      UNKNOWN
      UNKNOWN
    
    
      130010912
      45-47
      IN
      ALL OTHER
      UNKNOWN
      UNKNOWN
    
    
      103994412
      27-29
      CA
      ALL OTHER
      UNKNOWN
      UNKNOWN
    
    
      318658812
      27-29
      TN
      ALL OTHER
      UNKNOWN
      UNKNOWN
    
    
      93750312
      54-56
      AZ
      ALL OTHER
      UNKNOWN
      UNKNOWN

Overall screening percentage

The average screening of patients overall is 55%:



In [12]:

    
train.is_screener.mean()









    Out[12]:





0.5577850385682711

Screening by Age, Ethnicity, Household income, and Education level

At high level, younger patients have a higher likelihood of being screened, and while a higher household income has a higher likelihood of being screened, ethnicity and education level are not distinguishing features at the overall aggregate level.



In [27]:

    
age = pd.DataFrame([train.groupby('patient_age_group').is_screener.count(),
                    train.groupby('patient_age_group').is_screener.mean()])
age = age.T
age.columns = ['count','screened']
age









    Out[27]:






  
    
      
      count
      screened
    
    
      patient_age_group
      
      
    
  
  
    
      24-26
      71355
      0.718884
    
    
      27-29
      72822
      0.704663
    
    
      30-32
      70283
      0.678044
    
    
      33-35
      66622
      0.656510
    
    
      36-38
      67729
      0.641572
    
    
      39-41
      74664
      0.624558
    
    
      42-44
      72312
      0.602500
    
    
      45-47
      81554
      0.581247
    
    
      48-50
      86363
      0.558978
    
    
      51-53
      87038
      0.536628
    
    
      54-56
      84658
      0.509556
    
    
      57-59
      79807
      0.489719
    
    
      60-62
      75017
      0.460322
    
    
      63-65
      63527
      0.407858
    
    
      66-68
      57399
      0.344901
    
    
      69-71
      46667
      0.285234



In [178]:

    
# train.groupby('ethinicity').is_screener.count()
train.groupby('ethinicity').is_screener.mean()









    Out[178]:





ethinicity
AFRICAN AMERICAN    0.550724
ALL OTHER           0.563541
CAUCASIAN           0.552519
HISPANIC            0.588702
Name: is_screener, dtype: float64



In [176]:

    
#train.groupby('household_income').is_screener.count()
train.groupby('household_income').is_screener.mean()









    Out[176]:





household_income
$100K+      0.600612
<$50-99K    0.563834
<=$49K      0.517505
UNKNOWN     0.565792
Name: is_screener, dtype: float64



In [177]:

    
#train.groupby('education_level').is_screener.count()
train.groupby('education_level').is_screener.mean()









    Out[177]:





education_level
ASSOCIATE DEGREE AND ABOVE    0.592762
HIGH SCHOOL OR LESS           0.542836
SOME COLLEGE                  0.552917
UNKNOWN                       0.559441
Name: is_screener, dtype: float64

Interaction with the Medical System: Have had a Medical Exam

As a measure of the patient's interaction with the medical system, we distinguish whether the patient has ever had medical exam ( diagnosis_code in ('V70', 'V70.0', 'V70.1', 'V70.2', 'V70.3', 'V70.4', 'V70.5', 'V70.6', 'V70.7', 'V70.8', 'V70.9'), and calculate the likelihood of screening for the patient.



In [145]:

    
medical_exam = pd.read_sql_query("select * from diagnosis where diagnosis_code in \
    ('V70', 'V70.0', 'V70.1', 'V70.2', 'V70.3', 'V70.4', 'V70.5', 'V70.6', 'V70.7', 'V70.8', 'V70.9' ) \
    and patient_id in (select patient_id from patients_train );", db)



In [146]:

    
medical_exam = pd.read_csv('./features/train_medical_exam.csv')

There are 1157817 patients in the train set, of which, 498157 have had a medical exam during the entire 7 years of the data.



In [147]:

    
medical_exam.drop_duplicates('patient_id', inplace=True)
medical_exam.shape[0]









    Out[147]:





498157

So less than half of the population has had a regular medical exam:



In [148]:

    
medical_exam.shape[0] / float(train.shape[0])









    Out[148]:





0.43025538578203637

But the average is 66% for patients who have had any medical exam:



In [149]:

    
train.loc[medical_exam.patient_id].is_screener.mean()









    Out[149]:





0.668441876757729

And only 47% if they never have had one:



In [150]:

    
train.loc[~train.index.isin(medical_exam.patient_id)].is_screener.mean()









    Out[150]:





0.4742200527544493

Patient was referred to a Gynecological Exam

If the patient had one of these codes, then they may have been referred to a gynecological exam:

 diagnosis_code |  diagnosis_description                                                 
----------------+-----------------------------------
 V72.3          | GYNECOLOGICAL EXAMINATION
 V72.31         | ROUTINE GYNECOLOGICAL EXAMINATION



In [39]:

    
gyn_exam = pd.read_sql_query("select t1.patient_id, claim_type, diagnosis_date, diagnosis_code from diagnosis t1 \
                             right join patients_train t2 on (t1.patient_id=t2.patient_id ) \
                             where diagnosis_code in ('V72.3', 'V72.31');", db)



In [40]:

    
gyn_exam[:5]









    Out[40]:






  
    
      
      patient_id
      claim_type
      diagnosis_date
      diagnosis_code
    
  
  
    
      0
      112956268
      MX
      201310
      V72.31
    
    
      1
      183576058
      MX
      201310
      V72.31
    
    
      2
      187476589
      MX
      201310
      V72.31
    
    
      3
      163205483
      MX
      201310
      V72.31
    
    
      4
      170954263
      MX
      201310
      V72.31



In [41]:

    
gyn_exam.drop_duplicates('patient_id', inplace=True)
gyn_exam.shape[0]









    Out[41]:





648737

More than have the patients in this database have had a gynecological exam:



In [43]:

    
gyn_exam.shape[0] / float(train.shape[0])









    Out[43]:





0.5603104808445549

And if they have had an exam then the likelihood that they have been screened is high:



In [44]:

    
train.loc[gyn_exam.patient_id].is_screener.mean()









    Out[44]:





0.8079591575630802

And quite low if they have not had a gynecological exam:



In [50]:

    
no_gyn_exam = train.loc[~train.index.isin(gyn_exam.patient_id)]
no_gyn_exam.is_screener.mean()









    Out[50]:





0.23898012100259292

We can look at whether a patient has had a medical exam and a gynecological exam



In [187]:

    
medical_and_gyn_exam_df = train.loc[medical_exam.loc[medical_exam.patient_id.isin(gyn_exam.patient_id)].patient_id]
medical_and_gyn_exam = medical_and_gyn_exam_df.is_screener.mean()



In [180]:

    
medical_and_no_gyn_exam_ids = set(medical_exam.patient_id) - set(gyn_exam.patient_id) 
medical_and_no_gyn_exam = train.loc[medical_and_no_gyn_exam_ids].is_screener.mean()



In [182]:

    
no_medical_and_gyn_exam_ids = set(gyn_exam.patient_id) - set(medical_exam.patient_id)
no_medical_and_gyn_exam = train.loc[no_medical_and_gyn_exam_ids].is_screener.mean()



In [181]:

    
no_medical_and_no_gyn_exam_ids = set(train.index) - (set(medical_exam.patient_id) | set(gyn_exam.patient_id))
no_medical_and_no_gyn_exam = train.loc[no_medical_and_no_gyn_exam_ids].is_screener.mean()



In [189]:

    
mat = [[medical_and_gyn_exam, medical_and_no_gyn_exam], [no_medical_and_gyn_exam, no_medical_and_no_gyn_exam]]



In [141]:

    
# Pick a colormap
# I'd like to set alpha=0.2 , but haven't figured it out yet
cmap='RdYlGn'



In [194]:

    
mat_df = pd.DataFrame(mat, index=['Medical Exam', 'No Medical Exam'], columns=['Gyn Exam', 'No Gyn Exam'])
mat_df









    Out[194]:






  
    
      
      Gyn Exam
      No Gyn Exam
    
  
  
    
      Medical Exam
      0.831919
      0.336702
    
    
      No Medical Exam
      0.782578
      0.192350

Looking at the each of the demographics separately, at the aggregate level, for the patients that did not have a medical exam or a gynecological exam, we don't see any trends other than younger patients are more likely to be screened:



In [192]:

    
df = train.loc[no_medical_and_no_gyn_exam_ids].groupby(['patient_age_group',
                                                        'household_income']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0, low=.5, high=1)



In [196]:

    
df = train.loc[no_medical_and_no_gyn_exam_ids].groupby(['patient_age_group',
                                                        'ethinicity']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0, low=.5, high=1)









    Out[196]:





        

        
        

        
            
            
                
                
                
                AFRICAN AMERICAN
                
                ALL OTHER
                
                CAUCASIAN
                
                HISPANIC
                
            
            
        
        
            
            
                
                
                    
                        24-26
                    
                
                
                    
                        0.448831
                    
                
                
                    
                        0.427834
                    
                
                
                    
                        0.424127
                    
                
                
                    
                        0.417922
                    
                
            
            
            
                
                
                    
                        27-29
                    
                
                
                    
                        0.393543
                    
                
                
                    
                        0.373111
                    
                
                
                    
                        0.375114
                    
                
                
                    
                        0.366501
                    
                
            
            
            
                
                
                    
                        30-32
                    
                
                
                    
                        0.332603
                    
                
                
                    
                        0.31666
                    
                
                
                    
                        0.318951
                    
                
                
                    
                        0.302847
                    
                
            
            
            
                
                
                    
                        33-35
                    
                
                
                    
                        0.275481
                    
                
                
                    
                        0.274533
                    
                
                
                    
                        0.277941
                    
                
                
                    
                        0.268778
                    
                
            
            
            
                
                
                    
                        36-38
                    
                
                
                    
                        0.232281
                    
                
                
                    
                        0.238983
                    
                
                
                    
                        0.237715
                    
                
                
                    
                        0.239593
                    
                
            
            
            
                
                
                    
                        39-41
                    
                
                
                    
                        0.195408
                    
                
                
                    
                        0.194208
                    
                
                
                    
                        0.207307
                    
                
                
                    
                        0.212946
                    
                
            
            
            
                
                
                    
                        42-44
                    
                
                
                    
                        0.166923
                    
                
                
                    
                        0.172685
                    
                
                
                    
                        0.188342
                    
                
                
                    
                        0.168142
                    
                
            
            
            
                
                
                    
                        45-47
                    
                
                
                    
                        0.151502
                    
                
                
                    
                        0.167876
                    
                
                
                    
                        0.170555
                    
                
                
                    
                        0.150207
                    
                
            
            
            
                
                
                    
                        48-50
                    
                
                
                    
                        0.147607
                    
                
                
                    
                        0.14553
                    
                
                
                    
                        0.162721
                    
                
                
                    
                        0.140216
                    
                
            
            
            
                
                
                    
                        51-53
                    
                
                
                    
                        0.149037
                    
                
                
                    
                        0.141427
                    
                
                
                    
                        0.154014
                    
                
                
                    
                        0.141261
                    
                
            
            
            
                
                
                    
                        54-56
                    
                
                
                    
                        0.127381
                    
                
                
                    
                        0.131833
                    
                
                
                    
                        0.137601
                    
                
                
                    
                        0.131681
                    
                
            
            
            
                
                
                    
                        57-59
                    
                
                
                    
                        0.123374
                    
                
                
                    
                        0.126517
                    
                
                
                    
                        0.133971
                    
                
                
                    
                        0.107062
                    
                
            
            
            
                
                
                    
                        60-62
                    
                
                
                    
                        0.120066
                    
                
                
                    
                        0.120992
                    
                
                
                    
                        0.123824
                    
                
                
                    
                        0.130714
                    
                
            
            
            
                
                
                    
                        63-65
                    
                
                
                    
                        0.112058
                    
                
                
                    
                        0.113335
                    
                
                
                    
                        0.123171
                    
                
                
                    
                        0.102389
                    
                
            
            
            
                
                
                    
                        66-68
                    
                
                
                    
                        0.113145
                    
                
                
                    
                        0.108734
                    
                
                
                    
                        0.131278
                    
                
                
                    
                        0.113861
                    
                
            
            
            
                
                
                    
                        69-71
                    
                
                
                    
                        0.106726
                    
                
                
                    
                        0.098173
                    
                
                
                    
                        0.116529
                    
                
                
                    
                        0.111314



In [197]:

    
df = train.loc[no_medical_and_no_gyn_exam_ids].groupby(['patient_age_group',
                                                        'education_level']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)









    Out[197]:





        

        
        

        
            
            
                
                
                
                ASSOCIATE DEGREE AND ABOVE
                
                HIGH SCHOOL OR LESS
                
                SOME COLLEGE
                
                UNKNOWN
                
            
            
        
        
            
            
                
                
                    
                        24-26
                    
                
                
                    
                        0.357949
                    
                
                
                    
                        0.449572
                    
                
                
                    
                        0.419912
                    
                
                
                    
                        0.427118
                    
                
            
            
            
                
                
                    
                        27-29
                    
                
                
                    
                        0.349741
                    
                
                
                    
                        0.384125
                    
                
                
                    
                        0.372601
                    
                
                
                    
                        0.375372
                    
                
            
            
            
                
                
                    
                        30-32
                    
                
                
                    
                        0.298061
                    
                
                
                    
                        0.325328
                    
                
                
                    
                        0.320943
                    
                
                
                    
                        0.316006
                    
                
            
            
            
                
                
                    
                        33-35
                    
                
                
                    
                        0.281627
                    
                
                
                    
                        0.273682
                    
                
                
                    
                        0.280928
                    
                
                
                    
                        0.272813
                    
                
            
            
            
                
                
                    
                        36-38
                    
                
                
                    
                        0.247107
                    
                
                
                    
                        0.235989
                    
                
                
                    
                        0.238261
                    
                
                
                    
                        0.23637
                    
                
            
            
            
                
                
                    
                        39-41
                    
                
                
                    
                        0.205592
                    
                
                
                    
                        0.20536
                    
                
                
                    
                        0.204496
                    
                
                
                    
                        0.194661
                    
                
            
            
            
                
                
                    
                        42-44
                    
                
                
                    
                        0.181771
                    
                
                
                    
                        0.18483
                    
                
                
                    
                        0.179961
                    
                
                
                    
                        0.17301
                    
                
            
            
            
                
                
                    
                        45-47
                    
                
                
                    
                        0.166161
                    
                
                
                    
                        0.169943
                    
                
                
                    
                        0.15937
                    
                
                
                    
                        0.169606
                    
                
            
            
            
                
                
                    
                        48-50
                    
                
                
                    
                        0.169666
                    
                
                
                    
                        0.152824
                    
                
                
                    
                        0.158492
                    
                
                
                    
                        0.146316
                    
                
            
            
            
                
                
                    
                        51-53
                    
                
                
                    
                        0.149171
                    
                
                
                    
                        0.152483
                    
                
                
                    
                        0.15318
                    
                
                
                    
                        0.140466
                    
                
            
            
            
                
                
                    
                        54-56
                    
                
                
                    
                        0.136911
                    
                
                
                    
                        0.13476
                    
                
                
                    
                        0.13785
                    
                
                
                    
                        0.129056
                    
                
            
            
            
                
                
                    
                        57-59
                    
                
                
                    
                        0.138577
                    
                
                
                    
                        0.133143
                    
                
                
                    
                        0.126523
                    
                
                
                    
                        0.123337
                    
                
            
            
            
                
                
                    
                        60-62
                    
                
                
                    
                        0.120542
                    
                
                
                    
                        0.118644
                    
                
                
                    
                        0.131788
                    
                
                
                    
                        0.118175
                    
                
            
            
            
                
                
                    
                        63-65
                    
                
                
                    
                        0.134359
                    
                
                
                    
                        0.117697
                    
                
                
                    
                        0.119039
                    
                
                
                    
                        0.10992
                    
                
            
            
            
                
                
                    
                        66-68
                    
                
                
                    
                        0.151788
                    
                
                
                    
                        0.11603
                    
                
                
                    
                        0.132014
                    
                
                
                    
                        0.10293
                    
                
            
            
            
                
                
                    
                        69-71
                    
                
                
                    
                        0.121727
                    
                
                
                    
                        0.111138
                    
                
                
                    
                        0.115672
                    
                
                
                    
                        0.096601

Our models showed the following diagnosis were strong predictors of being screened:

646.83 OTHER SPECIFIED ANTEPARTUM COMPLICATIONS
648.93 OTHER CURRENT CONDITIONS CLASSIFIABLE ELSEWHERE OF MOTHER, ANTEPARTUM
650 NORMAL DELIVERY
V22.0 SUPERVISION OF NORMAL FIRST PREGNANCY
V22.1 SUPERVISION OF OTHER NORMAL PREGNANCY
V22.2 PREGNANT STATE, INCIDENTAL
V24.2 ROUTINE POSTPARTUM FOLLOW-UP
V25.2 STERILIZATION
V27.0 MOTHER WITH SINGLE LIVEBORN
V28.3 ENCOUNTER FOR ROUTINE SCREENING FOR MALFORMATION USING ULTRASONICS
V74.5 SCREENING EXAMINATION FOR VENEREAL DISEASE



In [94]:

    
pregnancy = pd.read_sql_query("select t1.patient_id, claim_type, diagnosis_date, diagnosis_code from diagnosis t1 \
                             right join patients_train t2 on (t1.patient_id=t2.patient_id) where diagnosis_code in \
                             ('V22.0','V22.1','V22.2','V24.2','V25.2','V27.0','V28.3','V70.0','V74.5');", db)
pregnancy.shape









    Out[94]:





(3905637, 4)



In [95]:

    
pregnancy.to_csv('train_pregnant.csv', index=False)



In [96]:

    
pregnancy.drop_duplicates('patient_id', inplace=True)
pregnancy.shape









    Out[96]:





(627113, 4)

Patients that had one of the above diagnosis were screened with the following likelihood:



In [97]:

    
train.loc[pregnancy.patient_id].is_screener.mean()









    Out[97]:





0.6839947505473495

And those that were not:



In [98]:

    
train.loc[~train.index.isin(pregnancy.patient_id)].is_screener.mean()









    Out[98]:





0.40864775844915435

Looking at these pregnancy diagnosis by demographics:



In [198]:

    
df = train.loc[pregnancy.patient_id].groupby(['patient_age_group',
                                              'household_income']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0, low=.5, high=1)

Versus the patients who has no such diagnosis:



In [199]:

    
df = train.loc[~train.index.isin(pregnancy.patient_id)].groupby(['patient_age_group',
                                              'household_income']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)

Again, by ethnicity, for patients that had such diagnosis:



In [200]:

    
df = train.loc[pregnancy.patient_id].groupby(['patient_age_group',
                                              'ethinicity']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)









    Out[200]:





        

        
        

        
            
            
                
                
                
                AFRICAN AMERICAN
                
                ALL OTHER
                
                CAUCASIAN
                
                HISPANIC
                
            
            
        
        
            
            
                
                
                    
                        24-26
                    
                
                
                    
                        0.777142
                    
                
                
                    
                        0.745553
                    
                
                
                    
                        0.780252
                    
                
                
                    
                        0.7613
                    
                
            
            
            
                
                
                    
                        27-29
                    
                
                
                    
                        0.778385
                    
                
                
                    
                        0.739889
                    
                
                
                    
                        0.774871
                    
                
                
                    
                        0.762374
                    
                
            
            
            
                
                
                    
                        30-32
                    
                
                
                    
                        0.760115
                    
                
                
                    
                        0.724043
                    
                
                
                    
                        0.76089
                    
                
                
                    
                        0.752357
                    
                
            
            
            
                
                
                    
                        33-35
                    
                
                
                    
                        0.750954
                    
                
                
                    
                        0.722142
                    
                
                
                    
                        0.752208
                    
                
                
                    
                        0.743203
                    
                
            
            
            
                
                
                    
                        36-38
                    
                
                
                    
                        0.725958
                    
                
                
                    
                        0.728929
                    
                
                
                    
                        0.749281
                    
                
                
                    
                        0.734131
                    
                
            
            
            
                
                
                    
                        39-41
                    
                
                
                    
                        0.714376
                    
                
                
                    
                        0.722218
                    
                
                
                    
                        0.744253
                    
                
                
                    
                        0.746135
                    
                
            
            
            
                
                
                    
                        42-44
                    
                
                
                    
                        0.687624
                    
                
                
                    
                        0.713423
                    
                
                
                    
                        0.725379
                    
                
                
                    
                        0.746662
                    
                
            
            
            
                
                
                    
                        45-47
                    
                
                
                    
                        0.661436
                    
                
                
                    
                        0.700341
                    
                
                
                    
                        0.705973
                    
                
                
                    
                        0.739829
                    
                
            
            
            
                
                
                    
                        48-50
                    
                
                
                    
                        0.654453
                    
                
                
                    
                        0.680173
                    
                
                
                    
                        0.686444
                    
                
                
                    
                        0.708137
                    
                
            
            
            
                
                
                    
                        51-53
                    
                
                
                    
                        0.611948
                    
                
                
                    
                        0.658698
                    
                
                
                    
                        0.661367
                    
                
                
                    
                        0.68171
                    
                
            
            
            
                
                
                    
                        54-56
                    
                
                
                    
                        0.5851
                    
                
                
                    
                        0.635671
                    
                
                
                    
                        0.638175
                    
                
                
                    
                        0.644328
                    
                
            
            
            
                
                
                    
                        57-59
                    
                
                
                    
                        0.566601
                    
                
                
                    
                        0.603451
                    
                
                
                    
                        0.62352
                    
                
                
                    
                        0.639644
                    
                
            
            
            
                
                
                    
                        60-62
                    
                
                
                    
                        0.545287
                    
                
                
                    
                        0.583766
                    
                
                
                    
                        0.588393
                    
                
                
                    
                        0.618273
                    
                
            
            
            
                
                
                    
                        63-65
                    
                
                
                    
                        0.482042
                    
                
                
                    
                        0.519207
                    
                
                
                    
                        0.539853
                    
                
                
                    
                        0.584084
                    
                
            
            
            
                
                
                    
                        66-68
                    
                
                
                    
                        0.435464
                    
                
                
                    
                        0.450258
                    
                
                
                    
                        0.473522
                    
                
                
                    
                        0.479015
                    
                
            
            
            
                
                
                    
                        69-71
                    
                
                
                    
                        0.394518
                    
                
                
                    
                        0.383451
                    
                
                
                    
                        0.413671
                    
                
                
                    
                        0.452184

And those patients that had no such diagnosis:



In [201]:

    
df = train.loc[~train.index.isin(pregnancy.patient_id)].groupby(['patient_age_group',
                                              'ethinicity']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)









    Out[201]:





        

        
        

        
            
            
                
                
                
                AFRICAN AMERICAN
                
                ALL OTHER
                
                CAUCASIAN
                
                HISPANIC
                
            
            
        
        
            
            
                
                
                    
                        24-26
                    
                
                
                    
                        0.466558
                    
                
                
                    
                        0.435958
                    
                
                
                    
                        0.52791
                    
                
                
                    
                        0.475155
                    
                
            
            
            
                
                
                    
                        27-29
                    
                
                
                    
                        0.450423
                    
                
                
                    
                        0.430262
                    
                
                
                    
                        0.515258
                    
                
                
                    
                        0.46798
                    
                
            
            
            
                
                
                    
                        30-32
                    
                
                
                    
                        0.464433
                    
                
                
                    
                        0.432626
                    
                
                
                    
                        0.499036
                    
                
                
                    
                        0.463415
                    
                
            
            
            
                
                
                    
                        33-35
                    
                
                
                    
                        0.439522
                    
                
                
                    
                        0.451082
                    
                
                
                    
                        0.505461
                    
                
                
                    
                        0.435041
                    
                
            
            
            
                
                
                    
                        36-38
                    
                
                
                    
                        0.459504
                    
                
                
                    
                        0.457486
                    
                
                
                    
                        0.514641
                    
                
                
                    
                        0.477531
                    
                
            
            
            
                
                
                    
                        39-41
                    
                
                
                    
                        0.429807
                    
                
                
                    
                        0.45304
                    
                
                
                    
                        0.523231
                    
                
                
                    
                        0.487695
                    
                
            
            
            
                
                
                    
                        42-44
                    
                
                
                    
                        0.411437
                    
                
                
                    
                        0.446224
                    
                
                
                    
                        0.51544
                    
                
                
                    
                        0.488999
                    
                
            
            
            
                
                
                    
                        45-47
                    
                
                
                    
                        0.41694
                    
                
                
                    
                        0.43738
                    
                
                
                    
                        0.496827
                    
                
                
                    
                        0.453608
                    
                
            
            
            
                
                
                    
                        48-50
                    
                
                
                    
                        0.405517
                    
                
                
                    
                        0.413524
                    
                
                
                    
                        0.472597
                    
                
                
                    
                        0.437195
                    
                
            
            
            
                
                
                    
                        51-53
                    
                
                
                    
                        0.389627
                    
                
                
                    
                        0.394016
                    
                
                
                    
                        0.456199
                    
                
                
                    
                        0.423846
                    
                
            
            
            
                
                
                    
                        54-56
                    
                
                
                    
                        0.353197
                    
                
                
                    
                        0.371726
                    
                
                
                    
                        0.426389
                    
                
                
                    
                        0.407437
                    
                
            
            
            
                
                
                    
                        57-59
                    
                
                
                    
                        0.337867
                    
                
                
                    
                        0.353929
                    
                
                
                    
                        0.40547
                    
                
                
                    
                        0.376395
                    
                
            
            
            
                
                
                    
                        60-62
                    
                
                
                    
                        0.319601
                    
                
                
                    
                        0.335297
                    
                
                
                    
                        0.377718
                    
                
                
                    
                        0.373289
                    
                
            
            
            
                
                
                    
                        63-65
                    
                
                
                    
                        0.283481
                    
                
                
                    
                        0.293562
                    
                
                
                    
                        0.334995
                    
                
                
                    
                        0.32523
                    
                
            
            
            
                
                
                    
                        66-68
                    
                
                
                    
                        0.260491
                    
                
                
                    
                        0.247044
                    
                
                
                    
                        0.28846
                    
                
                
                    
                        0.273973
                    
                
            
            
            
                
                
                    
                        69-71
                    
                
                
                    
                        0.215208
                    
                
                
                    
                        0.208198
                    
                
                
                    
                        0.236592
                    
                
                
                    
                        0.240337

Patient has other highly predictive diagnosis

Our modelling revealed the following diagnosis were also highly predictive of screening, and the same analysis as above could be performed, but these diagnosis look like the patient is already down a path that required that they be screened, but we are looking for patients that are not being screened.

462 ACUTE PHARYNGITIS
496 CHRONIC AIRWAY OBSTRUCTION, NOT ELSEWHERE CLASSIFIED
585.3 CHRONIC KIDNEY DISEASE, STAGE III (MODERATE)
616 CERVICITIS AND ENDOCERVICITIS
616.1 VAGINITIS AND VULVOVAGINITIS, UNSPECIFIED
620.2 OTHER AND UNSPECIFIED OVARIAN CYST
622.1 DYSPLASIA OF CERVIX, UNSPECIFIED
622.11 MILD DYSPLASIA OF CERVIX
623.5 LEUKORRHEA, NOT SPECIFIED AS INFECTIVE
625.3 DYSMENORRHEA
625.9 UNSPECIFIED SYMPTOM ASSOCIATED WITH FEMALE GENITAL ORGANS
626 ABSENCE OF MENSTRUATION
626.2 EXCESSIVE OR FREQUENT MENSTRUATION
626.4 IRREGULAR MENSTRUAL CYCLE
626.8 OTHER DISORDERS OF MENSTRUATION AND OTHER ABNORMAL BLEEDING FROM FEMALE GENITAL
795 ABNORMAL GLANDULAR PAPANICOLAOU SMEAR OF CERVIX

Patient had procedures predictive of not being screened

We performed a L1 (Lasso) feature selection using Vowpal Wabbit on procedures and found the following procedures positively correlated with a patient being screened:

procedure_code   procedure_description                            RelScore
57454            COLPOSCOPY CERVIX BX CERVIX & ENDOCRV CURRETAGE   100.00%
1252             GJB2 GENE ANALYSIS FULL GENE SEQUENCE              96.93%
57456            COLPOSCOPY CERVIX ENDOCERVICAL CURETTAGE           95.00%
57455            COLPOSCOPY CERVIX UPPR/ADJCNT VAGINA W/CERVIX BX   91.42%
S4020            IN VITRO FERTILIZATION PROCEDURE CANCELLED BEFOR   85.76%
S0605            DIGITAL RECTAL EXAMINATION, MALE, ANNUAL           83.64%
G0143            SCREENING CYTOPATHOLOGY, CERVICAL OR VAGINAL (AN   78.39%
90696            DTAP-IPV VACCINE CHILD 4-6 YRS FOR IM USE          76.98%
S4023            DONOR EGG CYCLE, INCOMPLETE, CASE RATE             76.67%
69710            IMPLTJ/RPLCMT EMGNT BONE CNDJ DEV TEMPORAL BONE    72.06%

and the following procedures negatively correlated:

procedure_code        procedure_description             RelScore
K0735  SKIN PROTECTION WHEELCHAIR SEAT CUSHION, ADJUSTA  -62.48%
34805  EVASC RPR AAA AORTO-UNIILIAC/AORTO-UNIFEM PROSTH  -64.68%
L5975  ALL LOWER EXTREMITY PROSTHESIS, COMBINATION SING  -65.39%
89321  SEMEN ANALYSIS SPERM PRESENCE&/MOTILITY SPRM      -65.51%
S9145  INSULIN PUMP INITIATION, INSTRUCTION IN INITIAL   -69.96%
00632  ANESTHESIA LUMBAR REGION LUMBAR SYMPATHECTOMY     -77.13%
27756  PRQ SKELETAL FIXATION TIBIAL SHAFT FRACTURE       -78.61%
3303F  AJCC CANCER STAGE IA, DOCUMENTED (ONC), (ML)      -82.77%
23675  CLTX SHOULDER DISLC W/SURG/ANTMCL NECK FX W/MANJ  -83.14%
Q4111  GAMMAGRAFT, PER SQUARE CENTIMETER                 -85.49%

It seems clear that a number of negatively correlated procedures identify patients that have bigger problems than the need for a cervical exam, and it is possible physicians have not recommended a screening for these reasons. Therefore the analysis above should be rerun without these patients skewing the likelihood percentages.

Where are the patients ?

We can review all slices of screener likleihoods by geography.



In [153]:

    
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from matplotlib.collections import PatchCollection
import geopandas as gpd
print('geopandas: '+gpd.__version__)









    



geopandas: 0.1.0.dev-



In [168]:

    
# settings  
plt.rcParams['figure.figsize']=(16,14)



In [154]:

    
# US Bounding box
minx, miny, maxx, maxy = -125,22,-65,50  # USA bounding box
usbbox = Polygon([(minx,miny),(minx,maxy),(maxx, maxy),(maxx,miny)])

Map States

Looking at the likelyhood by state:



In [155]:

    
# We grabbed the USA shape files from : https://github.com/matplotlib/basemap/tree/master/examples
# or state files from here: https://www.census.gov/geo/maps-data/data/cbf/cbf_state.html
# We took the CBSA files from https://www.census.gov/geo/maps-data/data/cbf/cbf_msa.html



In [157]:

    
state_pct = pd.read_csv('./features/state_screen_percent.csv')
state_pct.set_index('patient_state', inplace=True)
state_pct[:5]









    Out[157]:






  
    
      
      state_pct
    
    
      patient_state
      
    
  
  
    
      AK
      0.496903
    
    
      AL
      0.417596
    
    
      AR
      0.437814
    
    
      AZ
      0.504682
    
    
      CA
      0.526563



In [162]:

    
states_key = pd.read_csv('./datastudy/state.csv')
states_key.set_index('name_long', inplace=True)
states_key[:5]









    Out[162]:






  
    
      
      name_short
    
    
      name_long
      
    
  
  
    
      Alabama
      AL
    
    
      Alaska
      AK
    
    
      Arizona
      AZ
    
    
      Arkansas
      AR
    
    
      California
      CA



In [163]:

    
state_file = './datastudy/st99_d00.shp'
#bbox=(-119,22,64,49)
bbox=(-125,22,-65,50)
kw = dict(bbox=bbox)
states_map = gpd.GeoDataFrame.from_file(state_file, **kw)

#state_file = 'cb_2014_us_state_500k.shp'
# states = gpd.GeoDataFrame.from_file(state_file)

if gpd.__version__ != '0.1.0.dev-':  # temp fix 
    states_map.geometry = convert_3D_2D(states.geometry)



In [164]:

    
states_map = states_map.merge(states_key, left_on='NAME', right_index=True, how='left')
states_map = states_map.merge(state_pct, left_on='name_short', right_index=True, how='left')



In [169]:

    
states_map = states_map[states_map.geometry.within(usbbox)]
# plt.figure(figsize=(12,10))
ax = states_map[states_map.state_pct.notnull()].plot(column='state_pct', scheme='QUANTILES', k=5, colormap='RdYlGn', 
                                             alpha=0.9, linewidth=0.1, legend=True)
ax.axis('off')
plt.title('Percentage screened by State')
plt.show()

By CBSA

Every patient was mapped to the CBSA where they had the most diagnosis. We can then look at the likelihoods by these areas. Combining this map as screened by the features above would render a better map for targeting patients who are not being screened.



In [171]:

    
cbsa_pct = pd.read_csv('./features/cbsa_pct.csv.gz')
cbsa_pct.set_index('cbsa', inplace=True)
cbsa_pct[:5]









    Out[171]:






  
    
      
      pct_screened
    
    
      cbsa
      
    
  
  
    
      10100
      0.528302
    
    
      10140
      0.312376
    
    
      10180
      0.301802
    
    
      10220
      0.591623
    
    
      10300
      0.681319



In [172]:

    
cbsa = gpd.GeoDataFrame.from_file('./datastudy/cb_2014_us_cbsa_500k.shp')
cbsa.GEOID = cbsa.GEOID.astype(float)
cbsa.set_index('GEOID', inplace=True)
cbsa.shape









    Out[172]:





(929, 8)



In [173]:

    
cbsa['pct_screened'] = cbsa_pct.pct_screened



In [175]:

    
cbsa = cbsa[cbsa.geometry.within(usbbox)]
ax = states_map.plot(column=None, color='white', linewidth=0.1)
ax = cbsa[cbsa.pct_screened.notnull()].plot(column='pct_screened', scheme='QUANTILES', k=5, colormap='RdYlGn', 
               alpha=0.9, ax=ax, linewidth=0.1, legend=True)
ax.axis('off')
plt.title('Percentage screened by CBSA')
plt.show()

Further Work

There are many physicians performing a routine gynecological exam that have a specialty_code other than OBG or GYN. Looking into these differences by geography might be interesting.
We have started to look at the percentage of screening by the primary_practitioner referring or performing the GYN exam and there are some stark differences. Understanding these differences will help target the physicians rather than the patients.

	$100K+	<$50-99K	<=$49K	UNKNOWN
24-26	0.35969	0.423193	0.440111	0.425877
27-29	0.355485	0.367428	0.38366	0.374631
30-32	0.312824	0.32666	0.318838	0.316347
33-35	0.306968	0.268364	0.274752	0.272195
36-38	0.251977	0.23088	0.237742	0.236991
39-41	0.203505	0.209778	0.204159	0.193592
42-44	0.176207	0.187968	0.177838	0.177167
45-47	0.17234	0.169952	0.157674	0.168947
48-50	0.164665	0.156355	0.152788	0.149411
51-53	0.156243	0.152659	0.14912	0.141805
54-56	0.137091	0.142986	0.130019	0.129569
57-59	0.129941	0.135702	0.126996	0.126115
60-62	0.125631	0.125445	0.122352	0.119756
63-65	0.130214	0.120849	0.11828	0.111176
66-68	0.14774	0.137996	0.119116	0.105384
69-71	0.140173	0.12709	0.10566	0.097047

	$100K+	<$50-99K	<=$49K	UNKNOWN
24-26	0.797293	0.790567	0.769439	0.743877
27-29	0.768755	0.781171	0.77039	0.739053
30-32	0.754581	0.766104	0.75922	0.720977
33-35	0.750239	0.75418	0.749749	0.719814
36-38	0.750427	0.751125	0.731734	0.726982
39-41	0.747668	0.74766	0.7277	0.71825
42-44	0.725005	0.731678	0.71186	0.709507
45-47	0.713178	0.706094	0.688185	0.699114
48-50	0.693725	0.682825	0.672379	0.68079
51-53	0.665283	0.654345	0.646761	0.659756
54-56	0.646083	0.638881	0.602627	0.637787
57-59	0.630167	0.622405	0.598021	0.605849
60-62	0.61221	0.590204	0.557609	0.577913
63-65	0.575651	0.545377	0.505536	0.514887
66-68	0.509791	0.486594	0.442522	0.447966
69-71	0.474187	0.43068	0.392409	0.375991

	$100K+	<$50-99K	<=$49K	UNKNOWN
24-26	0.57907	0.533294	0.484793	0.434208
27-29	0.573262	0.537277	0.463075	0.427531
30-32	0.549265	0.514217	0.455711	0.43464
33-35	0.551695	0.506888	0.450314	0.45063
36-38	0.583227	0.513025	0.454395	0.453069
39-41	0.567245	0.524049	0.436784	0.459003
42-44	0.561044	0.509442	0.426593	0.45488
45-47	0.541899	0.491072	0.411236	0.446415
48-50	0.51024	0.472053	0.394319	0.422343
51-53	0.496547	0.455047	0.38144	0.399983
54-56	0.465046	0.425922	0.354866	0.37905
57-59	0.444062	0.409069	0.346257	0.356588
60-62	0.420429	0.386057	0.322683	0.338219
63-65	0.389248	0.353181	0.28851	0.289513
66-68	0.349532	0.306871	0.25485	0.242549
69-71	0.286869	0.262884	0.213248	0.203547

	patient_age_group	patient_state	ethinicity	household_income	education_level	is_screener
patient_id
336201912	51-53	SD	ALL OTHER	UNKNOWN	UNKNOWN	1
94237712	39-41	NE	ALL OTHER	UNKNOWN	UNKNOWN	1
186124512	24-26	CA	ALL OTHER	UNKNOWN	UNKNOWN	0
767144212	27-29	NY	ALL OTHER	UNKNOWN	UNKNOWN	1
104743512	30-32	HI	ALL OTHER	UNKNOWN	UNKNOWN	0

	patient_age_group	patient_state	ethinicity	household_income	education_level
patient_id
148341312	66-68	TX	ALL OTHER	UNKNOWN	UNKNOWN
130010912	45-47	IN	ALL OTHER	UNKNOWN	UNKNOWN
103994412	27-29	CA	ALL OTHER	UNKNOWN	UNKNOWN
318658812	27-29	TN	ALL OTHER	UNKNOWN	UNKNOWN
93750312	54-56	AZ	ALL OTHER	UNKNOWN	UNKNOWN

	count	screened
patient_age_group
24-26	71355	0.718884
27-29	72822	0.704663
30-32	70283	0.678044
33-35	66622	0.656510
36-38	67729	0.641572
39-41	74664	0.624558
42-44	72312	0.602500
45-47	81554	0.581247
48-50	86363	0.558978
51-53	87038	0.536628
54-56	84658	0.509556
57-59	79807	0.489719
60-62	75017	0.460322
63-65	63527	0.407858
66-68	57399	0.344901
69-71	46667	0.285234

	patient_id	claim_type	diagnosis_date	diagnosis_code
0	112956268	MX	201310	V72.31
1	183576058	MX	201310	V72.31
2	187476589	MX	201310	V72.31
3	163205483	MX	201310	V72.31
4	170954263	MX	201310	V72.31

	AFRICAN AMERICAN	ALL OTHER	CAUCASIAN	HISPANIC
24-26	0.448831	0.427834	0.424127	0.417922
27-29	0.393543	0.373111	0.375114	0.366501
30-32	0.332603	0.31666	0.318951	0.302847
33-35	0.275481	0.274533	0.277941	0.268778
36-38	0.232281	0.238983	0.237715	0.239593
39-41	0.195408	0.194208	0.207307	0.212946
42-44	0.166923	0.172685	0.188342	0.168142
45-47	0.151502	0.167876	0.170555	0.150207
48-50	0.147607	0.14553	0.162721	0.140216
51-53	0.149037	0.141427	0.154014	0.141261
54-56	0.127381	0.131833	0.137601	0.131681
57-59	0.123374	0.126517	0.133971	0.107062
60-62	0.120066	0.120992	0.123824	0.130714
63-65	0.112058	0.113335	0.123171	0.102389
66-68	0.113145	0.108734	0.131278	0.113861
69-71	0.106726	0.098173	0.116529	0.111314

	ASSOCIATE DEGREE AND ABOVE	HIGH SCHOOL OR LESS	SOME COLLEGE	UNKNOWN
24-26	0.357949	0.449572	0.419912	0.427118
27-29	0.349741	0.384125	0.372601	0.375372
30-32	0.298061	0.325328	0.320943	0.316006
33-35	0.281627	0.273682	0.280928	0.272813
36-38	0.247107	0.235989	0.238261	0.23637
39-41	0.205592	0.20536	0.204496	0.194661
42-44	0.181771	0.18483	0.179961	0.17301
45-47	0.166161	0.169943	0.15937	0.169606
48-50	0.169666	0.152824	0.158492	0.146316
51-53	0.149171	0.152483	0.15318	0.140466
54-56	0.136911	0.13476	0.13785	0.129056
57-59	0.138577	0.133143	0.126523	0.123337
60-62	0.120542	0.118644	0.131788	0.118175
63-65	0.134359	0.117697	0.119039	0.10992
66-68	0.151788	0.11603	0.132014	0.10293
69-71	0.121727	0.111138	0.115672	0.096601

	AFRICAN AMERICAN	ALL OTHER	CAUCASIAN	HISPANIC
24-26	0.777142	0.745553	0.780252	0.7613
27-29	0.778385	0.739889	0.774871	0.762374
30-32	0.760115	0.724043	0.76089	0.752357
33-35	0.750954	0.722142	0.752208	0.743203
36-38	0.725958	0.728929	0.749281	0.734131
39-41	0.714376	0.722218	0.744253	0.746135
42-44	0.687624	0.713423	0.725379	0.746662
45-47	0.661436	0.700341	0.705973	0.739829
48-50	0.654453	0.680173	0.686444	0.708137
51-53	0.611948	0.658698	0.661367	0.68171
54-56	0.5851	0.635671	0.638175	0.644328
57-59	0.566601	0.603451	0.62352	0.639644
60-62	0.545287	0.583766	0.588393	0.618273
63-65	0.482042	0.519207	0.539853	0.584084
66-68	0.435464	0.450258	0.473522	0.479015
69-71	0.394518	0.383451	0.413671	0.452184

	AFRICAN AMERICAN	ALL OTHER	CAUCASIAN	HISPANIC
24-26	0.466558	0.435958	0.52791	0.475155
27-29	0.450423	0.430262	0.515258	0.46798
30-32	0.464433	0.432626	0.499036	0.463415
33-35	0.439522	0.451082	0.505461	0.435041
36-38	0.459504	0.457486	0.514641	0.477531
39-41	0.429807	0.45304	0.523231	0.487695
42-44	0.411437	0.446224	0.51544	0.488999
45-47	0.41694	0.43738	0.496827	0.453608
48-50	0.405517	0.413524	0.472597	0.437195
51-53	0.389627	0.394016	0.456199	0.423846
54-56	0.353197	0.371726	0.426389	0.407437
57-59	0.337867	0.353929	0.40547	0.376395
60-62	0.319601	0.335297	0.377718	0.373289
63-65	0.283481	0.293562	0.334995	0.32523
66-68	0.260491	0.247044	0.28846	0.273973
69-71	0.215208	0.208198	0.236592	0.240337

	state_pct
patient_state
AK	0.496903
AL	0.417596
AR	0.437814
AZ	0.504682
CA	0.526563

	pct_screened
cbsa
10100	0.528302
10140	0.312376
10180	0.301802
10220	0.591623
10300	0.681319