Genentech Cervical Cancer Screening - Insights

https://www.kaggle.com/c/cervical-cancer-screening/

Cervical cancer is the third most common cancer in women worldwide, affecting over 500,000 women and resulting in approximately 275,000 deaths every year. Most women in the US have access to cervical cancer screening, yet 4,000 women die every year from cervical cancer in the US and it is estimated that 30% of US women do not receive regular pap screenings.

Prior research suggests that lower screening rates are associated with low income, low education, lack of interaction with the healthcare system, and lack of health insurance. But research also shows that even in women with access to healthcare fail to get this preventive test, indicating that barriers like lack of education and not being comfortable with the procedure are influencing their behavior.

If one could better identify women who are not being screened, education campaigns could target them with content that speaks directly to their unique risk factors. Identifying predictors of not receiving pap smears will provide important information to stakeholders in cervical cancer prevention who run awareness programs.

Below is one analysis for the Kaggle Genentech Cervical Cancer Screening competition.

The focus is on identifying healthy women who are not being screened.

  • By Paul Perry, Elena Cuoco, and Zygmunt Zając.

Load

This notebook is almost runnable in Kaggle scripts, but some features extractions would take too long, and we make use of geopandas for mapping functions. We use Postgres, but it would be easily swappable for Kaggle's sqlite3.


In [1]:
# imports
import sys # for stderr
import numpy as np
import pandas as pd
import sklearn as skl
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# settings 
plt.style.use('ggplot')
# plt.rcParams['figure.figsize'] = (10.0, 10.0)
# pd.set_option('display.max_rows', 50)
# pd.set_option('display.max_columns', 50)

In [3]:
# versions 
import sys
print(pd.datetime.now())
print('Python: '+sys.version)
print('numpy: '+np.__version__)
print('pandas: '+pd.__version__)
print('sklearn: '+skl.__version__)


2016-02-08 13:43:29.988998
Python: 2.7.11 |Anaconda 2.4.0 (x86_64)| (default, Dec  6 2015, 18:57:58) 
[GCC 4.2.1 (Apple Inc. build 5577)]
numpy: 1.10.2
pandas: 0.17.1
sklearn: 0.17

In [4]:
import psycopg2
# import sqlite3

In [5]:
db = psycopg2.connect("dbname='ccancer' user='paulperry' host='localhost' password=''")
#db = sqlite3.connect('../input/database.sqlite')

In [6]:
import datetime
start = datetime.datetime.now()
print(start)


2016-02-08 13:43:33.322108

In [7]:
fdir = './features/'

In [8]:
train_file = './input/patients_train.csv.gz'
train = pd.read_csv(train_file)
train.set_index('patient_id', inplace=True)
train.drop('patient_gender', axis = 1, inplace = True )
train_exclude = pd.read_csv('./input/train_patients_to_exclude.csv', header=None, names=['patient_id'])
train.drop(train_exclude.patient_id, inplace=True)
original_train_rows = train.shape[0]
print(train.shape)


(1157817, 6)

In [9]:
train[:5]


Out[9]:
patient_age_group patient_state ethinicity household_income education_level is_screener
patient_id
336201912 51-53 SD ALL OTHER UNKNOWN UNKNOWN 1
94237712 39-41 NE ALL OTHER UNKNOWN UNKNOWN 1
186124512 24-26 CA ALL OTHER UNKNOWN UNKNOWN 0
767144212 27-29 NY ALL OTHER UNKNOWN UNKNOWN 1
104743512 30-32 HI ALL OTHER UNKNOWN UNKNOWN 0

In [10]:
test_file = './input/patients_test.csv.gz'
test = pd.read_csv(test_file)
test.set_index('patient_id', inplace=True)
test.drop( 'patient_gender', axis = 1, inplace = True )
test_exclude = pd.read_csv('./input/test_patients_to_exclude.csv', header=None, names=['patient_id'])
test.drop(test_exclude.patient_id, inplace=True)
original_test_rows = test.shape[0]
print(test.shape)


(1701813, 5)

In [11]:
test[:5]


Out[11]:
patient_age_group patient_state ethinicity household_income education_level
patient_id
148341312 66-68 TX ALL OTHER UNKNOWN UNKNOWN
130010912 45-47 IN ALL OTHER UNKNOWN UNKNOWN
103994412 27-29 CA ALL OTHER UNKNOWN UNKNOWN
318658812 27-29 TN ALL OTHER UNKNOWN UNKNOWN
93750312 54-56 AZ ALL OTHER UNKNOWN UNKNOWN

Overall screening percentage

The average screening of patients overall is 55%:


In [12]:
train.is_screener.mean()


Out[12]:
0.5577850385682711

Screening by Age, Ethnicity, Household income, and Education level

At high level, younger patients have a higher likelihood of being screened, and while a higher household income has a higher likelihood of being screened, ethnicity and education level are not distinguishing features at the overall aggregate level.


In [27]:
age = pd.DataFrame([train.groupby('patient_age_group').is_screener.count(),
                    train.groupby('patient_age_group').is_screener.mean()])
age = age.T
age.columns = ['count','screened']
age


Out[27]:
count screened
patient_age_group
24-26 71355 0.718884
27-29 72822 0.704663
30-32 70283 0.678044
33-35 66622 0.656510
36-38 67729 0.641572
39-41 74664 0.624558
42-44 72312 0.602500
45-47 81554 0.581247
48-50 86363 0.558978
51-53 87038 0.536628
54-56 84658 0.509556
57-59 79807 0.489719
60-62 75017 0.460322
63-65 63527 0.407858
66-68 57399 0.344901
69-71 46667 0.285234

In [178]:
# train.groupby('ethinicity').is_screener.count()
train.groupby('ethinicity').is_screener.mean()


Out[178]:
ethinicity
AFRICAN AMERICAN    0.550724
ALL OTHER           0.563541
CAUCASIAN           0.552519
HISPANIC            0.588702
Name: is_screener, dtype: float64

In [176]:
#train.groupby('household_income').is_screener.count()
train.groupby('household_income').is_screener.mean()


Out[176]:
household_income
$100K+      0.600612
<$50-99K    0.563834
<=$49K      0.517505
UNKNOWN     0.565792
Name: is_screener, dtype: float64

In [177]:
#train.groupby('education_level').is_screener.count()
train.groupby('education_level').is_screener.mean()


Out[177]:
education_level
ASSOCIATE DEGREE AND ABOVE    0.592762
HIGH SCHOOL OR LESS           0.542836
SOME COLLEGE                  0.552917
UNKNOWN                       0.559441
Name: is_screener, dtype: float64

Interaction with the Medical System: Have had a Medical Exam

As a measure of the patient's interaction with the medical system, we distinguish whether the patient has ever had medical exam ( diagnosis_code in ('V70', 'V70.0', 'V70.1', 'V70.2', 'V70.3', 'V70.4', 'V70.5', 'V70.6', 'V70.7', 'V70.8', 'V70.9'), and calculate the likelihood of screening for the patient.


In [145]:
medical_exam = pd.read_sql_query("select * from diagnosis where diagnosis_code in \
    ('V70', 'V70.0', 'V70.1', 'V70.2', 'V70.3', 'V70.4', 'V70.5', 'V70.6', 'V70.7', 'V70.8', 'V70.9' ) \
    and patient_id in (select patient_id from patients_train );", db)

In [146]:
medical_exam = pd.read_csv('./features/train_medical_exam.csv')

There are 1157817 patients in the train set, of which, 498157 have had a medical exam during the entire 7 years of the data.


In [147]:
medical_exam.drop_duplicates('patient_id', inplace=True)
medical_exam.shape[0]


Out[147]:
498157

So less than half of the population has had a regular medical exam:


In [148]:
medical_exam.shape[0] / float(train.shape[0])


Out[148]:
0.43025538578203637

But the average is 66% for patients who have had any medical exam:


In [149]:
train.loc[medical_exam.patient_id].is_screener.mean()


Out[149]:
0.668441876757729

And only 47% if they never have had one:


In [150]:
train.loc[~train.index.isin(medical_exam.patient_id)].is_screener.mean()


Out[150]:
0.4742200527544493

Patient was referred to a Gynecological Exam

If the patient had one of these codes, then they may have been referred to a gynecological exam:

 diagnosis_code |  diagnosis_description                                                 
----------------+-----------------------------------
 V72.3          | GYNECOLOGICAL EXAMINATION
 V72.31         | ROUTINE GYNECOLOGICAL EXAMINATION

In [39]:
gyn_exam = pd.read_sql_query("select t1.patient_id, claim_type, diagnosis_date, diagnosis_code from diagnosis t1 \
                             right join patients_train t2 on (t1.patient_id=t2.patient_id ) \
                             where diagnosis_code in ('V72.3', 'V72.31');", db)

In [40]:
gyn_exam[:5]


Out[40]:
patient_id claim_type diagnosis_date diagnosis_code
0 112956268 MX 201310 V72.31
1 183576058 MX 201310 V72.31
2 187476589 MX 201310 V72.31
3 163205483 MX 201310 V72.31
4 170954263 MX 201310 V72.31

In [41]:
gyn_exam.drop_duplicates('patient_id', inplace=True)
gyn_exam.shape[0]


Out[41]:
648737

More than have the patients in this database have had a gynecological exam:


In [43]:
gyn_exam.shape[0] / float(train.shape[0])


Out[43]:
0.5603104808445549

And if they have had an exam then the likelihood that they have been screened is high:


In [44]:
train.loc[gyn_exam.patient_id].is_screener.mean()


Out[44]:
0.8079591575630802

And quite low if they have not had a gynecological exam:


In [50]:
no_gyn_exam = train.loc[~train.index.isin(gyn_exam.patient_id)]
no_gyn_exam.is_screener.mean()


Out[50]:
0.23898012100259292

We can look at whether a patient has had a medical exam and a gynecological exam


In [187]:
medical_and_gyn_exam_df = train.loc[medical_exam.loc[medical_exam.patient_id.isin(gyn_exam.patient_id)].patient_id]
medical_and_gyn_exam = medical_and_gyn_exam_df.is_screener.mean()

In [180]:
medical_and_no_gyn_exam_ids = set(medical_exam.patient_id) - set(gyn_exam.patient_id) 
medical_and_no_gyn_exam = train.loc[medical_and_no_gyn_exam_ids].is_screener.mean()

In [182]:
no_medical_and_gyn_exam_ids = set(gyn_exam.patient_id) - set(medical_exam.patient_id)
no_medical_and_gyn_exam = train.loc[no_medical_and_gyn_exam_ids].is_screener.mean()

In [181]:
no_medical_and_no_gyn_exam_ids = set(train.index) - (set(medical_exam.patient_id) | set(gyn_exam.patient_id))
no_medical_and_no_gyn_exam = train.loc[no_medical_and_no_gyn_exam_ids].is_screener.mean()

In [189]:
mat = [[medical_and_gyn_exam, medical_and_no_gyn_exam], [no_medical_and_gyn_exam, no_medical_and_no_gyn_exam]]

In [141]:
# Pick a colormap
# I'd like to set alpha=0.2 , but haven't figured it out yet
cmap='RdYlGn'

In [194]:
mat_df = pd.DataFrame(mat, index=['Medical Exam', 'No Medical Exam'], columns=['Gyn Exam', 'No Gyn Exam'])
mat_df


Out[194]:
Gyn Exam No Gyn Exam
Medical Exam 0.831919 0.336702
No Medical Exam 0.782578 0.192350

Looking at the each of the demographics separately, at the aggregate level, for the patients that did not have a medical exam or a gynecological exam, we don't see any trends other than younger patients are more likely to be screened:


In [192]:
df = train.loc[no_medical_and_no_gyn_exam_ids].groupby(['patient_age_group',
                                                        'household_income']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0, low=.5, high=1)


Out[192]:
$100K+ <$50-99K <=$49K UNKNOWN
24-26 0.35969 0.423193 0.440111 0.425877
27-29 0.355485 0.367428 0.38366 0.374631
30-32 0.312824 0.32666 0.318838 0.316347
33-35 0.306968 0.268364 0.274752 0.272195
36-38 0.251977 0.23088 0.237742 0.236991
39-41 0.203505 0.209778 0.204159 0.193592
42-44 0.176207 0.187968 0.177838 0.177167
45-47 0.17234 0.169952 0.157674 0.168947
48-50 0.164665 0.156355 0.152788 0.149411
51-53 0.156243 0.152659 0.14912 0.141805
54-56 0.137091 0.142986 0.130019 0.129569
57-59 0.129941 0.135702 0.126996 0.126115
60-62 0.125631 0.125445 0.122352 0.119756
63-65 0.130214 0.120849 0.11828 0.111176
66-68 0.14774 0.137996 0.119116 0.105384
69-71 0.140173 0.12709 0.10566 0.097047

In [196]:
df = train.loc[no_medical_and_no_gyn_exam_ids].groupby(['patient_age_group',
                                                        'ethinicity']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0, low=.5, high=1)


Out[196]:
AFRICAN AMERICAN ALL OTHER CAUCASIAN HISPANIC
24-26 0.448831 0.427834 0.424127 0.417922
27-29 0.393543 0.373111 0.375114 0.366501
30-32 0.332603 0.31666 0.318951 0.302847
33-35 0.275481 0.274533 0.277941 0.268778
36-38 0.232281 0.238983 0.237715 0.239593
39-41 0.195408 0.194208 0.207307 0.212946
42-44 0.166923 0.172685 0.188342 0.168142
45-47 0.151502 0.167876 0.170555 0.150207
48-50 0.147607 0.14553 0.162721 0.140216
51-53 0.149037 0.141427 0.154014 0.141261
54-56 0.127381 0.131833 0.137601 0.131681
57-59 0.123374 0.126517 0.133971 0.107062
60-62 0.120066 0.120992 0.123824 0.130714
63-65 0.112058 0.113335 0.123171 0.102389
66-68 0.113145 0.108734 0.131278 0.113861
69-71 0.106726 0.098173 0.116529 0.111314

In [197]:
df = train.loc[no_medical_and_no_gyn_exam_ids].groupby(['patient_age_group',
                                                        'education_level']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)


Out[197]:
ASSOCIATE DEGREE AND ABOVE HIGH SCHOOL OR LESS SOME COLLEGE UNKNOWN
24-26 0.357949 0.449572 0.419912 0.427118
27-29 0.349741 0.384125 0.372601 0.375372
30-32 0.298061 0.325328 0.320943 0.316006
33-35 0.281627 0.273682 0.280928 0.272813
36-38 0.247107 0.235989 0.238261 0.23637
39-41 0.205592 0.20536 0.204496 0.194661
42-44 0.181771 0.18483 0.179961 0.17301
45-47 0.166161 0.169943 0.15937 0.169606
48-50 0.169666 0.152824 0.158492 0.146316
51-53 0.149171 0.152483 0.15318 0.140466
54-56 0.136911 0.13476 0.13785 0.129056
57-59 0.138577 0.133143 0.126523 0.123337
60-62 0.120542 0.118644 0.131788 0.118175
63-65 0.134359 0.117697 0.119039 0.10992
66-68 0.151788 0.11603 0.132014 0.10293
69-71 0.121727 0.111138 0.115672 0.096601

Our models showed the following diagnosis were strong predictors of being screened:

  • 646.83 OTHER SPECIFIED ANTEPARTUM COMPLICATIONS
  • 648.93 OTHER CURRENT CONDITIONS CLASSIFIABLE ELSEWHERE OF MOTHER, ANTEPARTUM
  • 650 NORMAL DELIVERY
  • V22.0 SUPERVISION OF NORMAL FIRST PREGNANCY
  • V22.1 SUPERVISION OF OTHER NORMAL PREGNANCY
  • V22.2 PREGNANT STATE, INCIDENTAL
  • V24.2 ROUTINE POSTPARTUM FOLLOW-UP
  • V25.2 STERILIZATION
  • V27.0 MOTHER WITH SINGLE LIVEBORN
  • V28.3 ENCOUNTER FOR ROUTINE SCREENING FOR MALFORMATION USING ULTRASONICS
  • V74.5 SCREENING EXAMINATION FOR VENEREAL DISEASE

In [94]:
pregnancy = pd.read_sql_query("select t1.patient_id, claim_type, diagnosis_date, diagnosis_code from diagnosis t1 \
                             right join patients_train t2 on (t1.patient_id=t2.patient_id) where diagnosis_code in \
                             ('V22.0','V22.1','V22.2','V24.2','V25.2','V27.0','V28.3','V70.0','V74.5');", db)
pregnancy.shape


Out[94]:
(3905637, 4)

In [95]:
pregnancy.to_csv('train_pregnant.csv', index=False)

In [96]:
pregnancy.drop_duplicates('patient_id', inplace=True)
pregnancy.shape


Out[96]:
(627113, 4)

Patients that had one of the above diagnosis were screened with the following likelihood:


In [97]:
train.loc[pregnancy.patient_id].is_screener.mean()


Out[97]:
0.6839947505473495

And those that were not:


In [98]:
train.loc[~train.index.isin(pregnancy.patient_id)].is_screener.mean()


Out[98]:
0.40864775844915435

Looking at these pregnancy diagnosis by demographics:


In [198]:
df = train.loc[pregnancy.patient_id].groupby(['patient_age_group',
                                              'household_income']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0, low=.5, high=1)


Out[198]:
$100K+ <$50-99K <=$49K UNKNOWN
24-26 0.797293 0.790567 0.769439 0.743877
27-29 0.768755 0.781171 0.77039 0.739053
30-32 0.754581 0.766104 0.75922 0.720977
33-35 0.750239 0.75418 0.749749 0.719814
36-38 0.750427 0.751125 0.731734 0.726982
39-41 0.747668 0.74766 0.7277 0.71825
42-44 0.725005 0.731678 0.71186 0.709507
45-47 0.713178 0.706094 0.688185 0.699114
48-50 0.693725 0.682825 0.672379 0.68079
51-53 0.665283 0.654345 0.646761 0.659756
54-56 0.646083 0.638881 0.602627 0.637787
57-59 0.630167 0.622405 0.598021 0.605849
60-62 0.61221 0.590204 0.557609 0.577913
63-65 0.575651 0.545377 0.505536 0.514887
66-68 0.509791 0.486594 0.442522 0.447966
69-71 0.474187 0.43068 0.392409 0.375991

Versus the patients who has no such diagnosis:


In [199]:
df = train.loc[~train.index.isin(pregnancy.patient_id)].groupby(['patient_age_group',
                                              'household_income']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)


Out[199]:
$100K+ <$50-99K <=$49K UNKNOWN
24-26 0.57907 0.533294 0.484793 0.434208
27-29 0.573262 0.537277 0.463075 0.427531
30-32 0.549265 0.514217 0.455711 0.43464
33-35 0.551695 0.506888 0.450314 0.45063
36-38 0.583227 0.513025 0.454395 0.453069
39-41 0.567245 0.524049 0.436784 0.459003
42-44 0.561044 0.509442 0.426593 0.45488
45-47 0.541899 0.491072 0.411236 0.446415
48-50 0.51024 0.472053 0.394319 0.422343
51-53 0.496547 0.455047 0.38144 0.399983
54-56 0.465046 0.425922 0.354866 0.37905
57-59 0.444062 0.409069 0.346257 0.356588
60-62 0.420429 0.386057 0.322683 0.338219
63-65 0.389248 0.353181 0.28851 0.289513
66-68 0.349532 0.306871 0.25485 0.242549
69-71 0.286869 0.262884 0.213248 0.203547

Again, by ethnicity, for patients that had such diagnosis:


In [200]:
df = train.loc[pregnancy.patient_id].groupby(['patient_age_group',
                                              'ethinicity']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)


Out[200]:
AFRICAN AMERICAN ALL OTHER CAUCASIAN HISPANIC
24-26 0.777142 0.745553 0.780252 0.7613
27-29 0.778385 0.739889 0.774871 0.762374
30-32 0.760115 0.724043 0.76089 0.752357
33-35 0.750954 0.722142 0.752208 0.743203
36-38 0.725958 0.728929 0.749281 0.734131
39-41 0.714376 0.722218 0.744253 0.746135
42-44 0.687624 0.713423 0.725379 0.746662
45-47 0.661436 0.700341 0.705973 0.739829
48-50 0.654453 0.680173 0.686444 0.708137
51-53 0.611948 0.658698 0.661367 0.68171
54-56 0.5851 0.635671 0.638175 0.644328
57-59 0.566601 0.603451 0.62352 0.639644
60-62 0.545287 0.583766 0.588393 0.618273
63-65 0.482042 0.519207 0.539853 0.584084
66-68 0.435464 0.450258 0.473522 0.479015
69-71 0.394518 0.383451 0.413671 0.452184

And those patients that had no such diagnosis:


In [201]:
df = train.loc[~train.index.isin(pregnancy.patient_id)].groupby(['patient_age_group',
                                              'ethinicity']).is_screener.mean().unstack()
df.style.background_gradient(cmap, axis=0,  low=.5, high=1)


Out[201]:
AFRICAN AMERICAN ALL OTHER CAUCASIAN HISPANIC
24-26 0.466558 0.435958 0.52791 0.475155
27-29 0.450423 0.430262 0.515258 0.46798
30-32 0.464433 0.432626 0.499036 0.463415
33-35 0.439522 0.451082 0.505461 0.435041
36-38 0.459504 0.457486 0.514641 0.477531
39-41 0.429807 0.45304 0.523231 0.487695
42-44 0.411437 0.446224 0.51544 0.488999
45-47 0.41694 0.43738 0.496827 0.453608
48-50 0.405517 0.413524 0.472597 0.437195
51-53 0.389627 0.394016 0.456199 0.423846
54-56 0.353197 0.371726 0.426389 0.407437
57-59 0.337867 0.353929 0.40547 0.376395
60-62 0.319601 0.335297 0.377718 0.373289
63-65 0.283481 0.293562 0.334995 0.32523
66-68 0.260491 0.247044 0.28846 0.273973
69-71 0.215208 0.208198 0.236592 0.240337

Patient has other highly predictive diagnosis

Our modelling revealed the following diagnosis were also highly predictive of screening, and the same analysis as above could be performed, but these diagnosis look like the patient is already down a path that required that they be screened, but we are looking for patients that are not being screened.

  • 462 ACUTE PHARYNGITIS
  • 496 CHRONIC AIRWAY OBSTRUCTION, NOT ELSEWHERE CLASSIFIED
  • 585.3 CHRONIC KIDNEY DISEASE, STAGE III (MODERATE)
  • 616 CERVICITIS AND ENDOCERVICITIS
  • 616.1 VAGINITIS AND VULVOVAGINITIS, UNSPECIFIED
  • 620.2 OTHER AND UNSPECIFIED OVARIAN CYST
  • 622.1 DYSPLASIA OF CERVIX, UNSPECIFIED
  • 622.11 MILD DYSPLASIA OF CERVIX
  • 623.5 LEUKORRHEA, NOT SPECIFIED AS INFECTIVE
  • 625.3 DYSMENORRHEA
  • 625.9 UNSPECIFIED SYMPTOM ASSOCIATED WITH FEMALE GENITAL ORGANS
  • 626 ABSENCE OF MENSTRUATION
  • 626.2 EXCESSIVE OR FREQUENT MENSTRUATION
  • 626.4 IRREGULAR MENSTRUAL CYCLE
  • 626.8 OTHER DISORDERS OF MENSTRUATION AND OTHER ABNORMAL BLEEDING FROM FEMALE GENITAL
  • 795 ABNORMAL GLANDULAR PAPANICOLAOU SMEAR OF CERVIX

Patient had procedures predictive of not being screened

We performed a L1 (Lasso) feature selection using Vowpal Wabbit on procedures and found the following procedures positively correlated with a patient being screened:

procedure_code   procedure_description                            RelScore
57454            COLPOSCOPY CERVIX BX CERVIX & ENDOCRV CURRETAGE   100.00%
1252             GJB2 GENE ANALYSIS FULL GENE SEQUENCE              96.93%
57456            COLPOSCOPY CERVIX ENDOCERVICAL CURETTAGE           95.00%
57455            COLPOSCOPY CERVIX UPPR/ADJCNT VAGINA W/CERVIX BX   91.42%
S4020            IN VITRO FERTILIZATION PROCEDURE CANCELLED BEFOR   85.76%
S0605            DIGITAL RECTAL EXAMINATION, MALE, ANNUAL           83.64%
G0143            SCREENING CYTOPATHOLOGY, CERVICAL OR VAGINAL (AN   78.39%
90696            DTAP-IPV VACCINE CHILD 4-6 YRS FOR IM USE          76.98%
S4023            DONOR EGG CYCLE, INCOMPLETE, CASE RATE             76.67%
69710            IMPLTJ/RPLCMT EMGNT BONE CNDJ DEV TEMPORAL BONE    72.06%

and the following procedures negatively correlated:

procedure_code        procedure_description             RelScore
K0735  SKIN PROTECTION WHEELCHAIR SEAT CUSHION, ADJUSTA  -62.48%
34805  EVASC RPR AAA AORTO-UNIILIAC/AORTO-UNIFEM PROSTH  -64.68%
L5975  ALL LOWER EXTREMITY PROSTHESIS, COMBINATION SING  -65.39%
89321  SEMEN ANALYSIS SPERM PRESENCE&/MOTILITY SPRM      -65.51%
S9145  INSULIN PUMP INITIATION, INSTRUCTION IN INITIAL   -69.96%
00632  ANESTHESIA LUMBAR REGION LUMBAR SYMPATHECTOMY     -77.13%
27756  PRQ SKELETAL FIXATION TIBIAL SHAFT FRACTURE       -78.61%
3303F  AJCC CANCER STAGE IA, DOCUMENTED (ONC), (ML)      -82.77%
23675  CLTX SHOULDER DISLC W/SURG/ANTMCL NECK FX W/MANJ  -83.14%
Q4111  GAMMAGRAFT, PER SQUARE CENTIMETER                 -85.49%

It seems clear that a number of negatively correlated procedures identify patients that have bigger problems than the need for a cervical exam, and it is possible physicians have not recommended a screening for these reasons. Therefore the analysis above should be rerun without these patients skewing the likelihood percentages.

Where are the patients ?

We can review all slices of screener likleihoods by geography.


In [153]:
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from matplotlib.collections import PatchCollection
import geopandas as gpd
print('geopandas: '+gpd.__version__)


geopandas: 0.1.0.dev-

In [168]:
# settings  
plt.rcParams['figure.figsize']=(16,14)

In [154]:
# US Bounding box
minx, miny, maxx, maxy = -125,22,-65,50  # USA bounding box
usbbox = Polygon([(minx,miny),(minx,maxy),(maxx, maxy),(maxx,miny)])

Map States

Looking at the likelyhood by state:


In [155]:
# We grabbed the USA shape files from : https://github.com/matplotlib/basemap/tree/master/examples
# or state files from here: https://www.census.gov/geo/maps-data/data/cbf/cbf_state.html
# We took the CBSA files from https://www.census.gov/geo/maps-data/data/cbf/cbf_msa.html

In [157]:
state_pct = pd.read_csv('./features/state_screen_percent.csv')
state_pct.set_index('patient_state', inplace=True)
state_pct[:5]


Out[157]:
state_pct
patient_state
AK 0.496903
AL 0.417596
AR 0.437814
AZ 0.504682
CA 0.526563

In [162]:
states_key = pd.read_csv('./datastudy/state.csv')
states_key.set_index('name_long', inplace=True)
states_key[:5]


Out[162]:
name_short
name_long
Alabama AL
Alaska AK
Arizona AZ
Arkansas AR
California CA

In [163]:
state_file = './datastudy/st99_d00.shp'
#bbox=(-119,22,64,49)
bbox=(-125,22,-65,50)
kw = dict(bbox=bbox)
states_map = gpd.GeoDataFrame.from_file(state_file, **kw)

#state_file = 'cb_2014_us_state_500k.shp'
# states = gpd.GeoDataFrame.from_file(state_file)

if gpd.__version__ != '0.1.0.dev-':  # temp fix 
    states_map.geometry = convert_3D_2D(states.geometry)

In [164]:
states_map = states_map.merge(states_key, left_on='NAME', right_index=True, how='left')
states_map = states_map.merge(state_pct, left_on='name_short', right_index=True, how='left')

In [169]:
states_map = states_map[states_map.geometry.within(usbbox)]
# plt.figure(figsize=(12,10))
ax = states_map[states_map.state_pct.notnull()].plot(column='state_pct', scheme='QUANTILES', k=5, colormap='RdYlGn', 
                                             alpha=0.9, linewidth=0.1, legend=True)
ax.axis('off')
plt.title('Percentage screened by State')
plt.show()


By CBSA

Every patient was mapped to the CBSA where they had the most diagnosis. We can then look at the likelihoods by these areas. Combining this map as screened by the features above would render a better map for targeting patients who are not being screened.


In [171]:
cbsa_pct = pd.read_csv('./features/cbsa_pct.csv.gz')
cbsa_pct.set_index('cbsa', inplace=True)
cbsa_pct[:5]


Out[171]:
pct_screened
cbsa
10100 0.528302
10140 0.312376
10180 0.301802
10220 0.591623
10300 0.681319

In [172]:
cbsa = gpd.GeoDataFrame.from_file('./datastudy/cb_2014_us_cbsa_500k.shp')
cbsa.GEOID = cbsa.GEOID.astype(float)
cbsa.set_index('GEOID', inplace=True)
cbsa.shape


Out[172]:
(929, 8)

In [173]:
cbsa['pct_screened'] = cbsa_pct.pct_screened

In [175]:
cbsa = cbsa[cbsa.geometry.within(usbbox)]
ax = states_map.plot(column=None, color='white', linewidth=0.1)
ax = cbsa[cbsa.pct_screened.notnull()].plot(column='pct_screened', scheme='QUANTILES', k=5, colormap='RdYlGn', 
               alpha=0.9, ax=ax, linewidth=0.1, legend=True)
ax.axis('off')
plt.title('Percentage screened by CBSA')
plt.show()


Further Work

  • There are many physicians performing a routine gynecological exam that have a specialty_code other than OBG or GYN. Looking into these differences by geography might be interesting.
  • We have started to look at the percentage of screening by the primary_practitioner referring or performing the GYN exam and there are some stark differences. Understanding these differences will help target the physicians rather than the patients.