Unsupervised Bayesian Anomaly Detection Using Beta Binomial Distribution

This notebook goes through an example of how one could use a bayesian framework to do unsupervised anomaly detection. In this particular example, we will consider medical coding behaviors from doctor office visits for evaluation and management services.

Note that all data and results in this notebook are synthetic! To see the data generation process, please see /data/generate_data.ipynb.

What is an E/M code?

E/M coding is the process by which physician-patient encounters are translated into five digit CPT codes to facilitate billing.
There are many attributes of the encounter that factor into the level of coding
1. History
2. Examination
3. Medical decision making
4. Counseling
5. Coordination of care
6. Nature of presenting problem
7. Time (spent with patient)

The first three (history, examination, medical decision making) are considered the key components that determine the level of an E/M code.

Setup



In [2]:

    
import os
import sys

import numpy as np
import pandas as pd
import sqlalchemy as sa
from random import randint
import statsmodels.formula.api as sm

%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.lines as mlines
matplotlib.rcParams['figure.figsize'] = (10, 6)

sys.path.append('../../../')
import analysis as an
import em_codes as em

pd.options.mode.chained_assignment = None  # default='warn'



In [3]:

    
# Create DB Engine
db_url = 'postgres://localhost:5432/VincentLa'
db_engine = sa.create_engine(db_url)
conn = db_engine.connect()
conn.execute('CREATE SCHEMA IF NOT EXISTS tutorial_data_ingest;')
metadata = sa.MetaData(bind=conn)



In [4]:

    
# Define EM code groups
OFFICE_EST_PATIENT = [1, 2, 3, 4, 5]

Read in data! Data generation process is here



In [5]:

    
claim_lines = sa.Table(
    'claim_lines',
    metadata,
    schema='tutorial_data_ingest',
    autoload=True
)

Read in the data frame so we can explore!



In [6]:

    
claims = pd.read_sql(em.get_claims_base_data(claim_lines, OFFICE_EST_PATIENT), conn)

Run the analysis, we will look at the output of this later in the notebook!



In [7]:

    
result_output, distribution_values_output = an.run(conn, claim_lines, OFFICE_EST_PATIENT)

Exploration

Let's explore the data intuitively!



In [8]:

    
claims.head()









    Out[8]:






  
    
      
      personid
      servicing_provider_npi
      procedure_code
      procedure_name
      procedure_name_array
    
  
  
    
      0
      11324
      Dr. Darlene Allen
      5
      Evaluation and Management, Highest Intensity
      [Evaluation and Management,  Highest Intensity]
    
    
      1
      1633
      Dr. Darlene Allen
      5
      Evaluation and Management, Highest Intensity
      [Evaluation and Management,  Highest Intensity]
    
    
      2
      18100
      Dr. Darlene Allen
      4
      Evaluation and Management, High Intensity
      [Evaluation and Management,  High Intensity]
    
    
      3
      19251
      Dr. Darlene Allen
      5
      Evaluation and Management, Highest Intensity
      [Evaluation and Management,  Highest Intensity]
    
    
      4
      16174
      Dr. Darlene Allen
      5
      Evaluation and Management, Highest Intensity
      [Evaluation and Management,  Highest Intensity]

How many patient-doctor interactions are there?



In [9]:

    
claims.shape









    Out[9]:





(102246, 5)

What are the different types of procedures in this data set?



In [10]:

    
claims[['procedure_code', 'procedure_name']].drop_duplicates()









    Out[10]:






  
    
      
      procedure_code
      procedure_name
    
  
  
    
      0
      5
      Evaluation and Management, Highest Intensity
    
    
      2
      4
      Evaluation and Management, High Intensity
    
    
      9
      3
      Evaluation and Management, Medium Intensity
    
    
      15
      1
      Evaluation and Management, Lowest Intensity
    
    
      25
      2
      Evaluation and Management, Second Lowest Inten...

How many patients do doctors see?



In [11]:

    
number_of_encounters = claims.groupby('servicing_provider_npi').size()
plt.hist(number_of_encounters)
plt.ylabel('Number of Encounters', fontsize=16)
plt.title('Distribution of Number of Encounters', fontsize=20)
plt.show()

Let's now look at the overall distribution of intensities



In [12]:

    
codes = claims.groupby('procedure_code').size().index
number_of_encounters = claims.groupby('procedure_code').size()
plt.bar(codes, number_of_encounters, align='center')
plt.xticks(codes, codes)
plt.ylabel('Number of Claims', fontsize=16)
plt.title('Overall Distribution of Procedure Codes', fontsize=20)
plt.show()

Let's look at the distribution of proportion of level 5 procedure codes billed by a provider.



In [13]:

    
plt.hist(result_output['All'].pct_high_encounters)
plt.ylabel('Number of Doctors', fontsize=16)
plt.title('Proportion of Codes that are Level 5', fontsize=20)
plt.show()

Results

Let's actually look at our results now! Things we will cover in this section are:

What is the output of the bayesian implementation?
What is the expected value of the proportion of level 5 codes billed for each doctor?
What is the probability that this doctor is "above average"?



In [14]:

    
results = result_output['All']

Notice the relationship between regressed_prob and naive_perc and how it's dependent on the number of observations.



In [28]:

    
results.loc[(results.pct_high_encounters >= 0.7)]









    Out[28]:






  
    
      
      observations
      successes
      pct_high_encounters
      naive_perc
      group
      a
      b
      regressed_prob
      prob_above_average
    
    
      servicing_provider_npi
      
      
      
      
      
      
      
      
      
    
  
  
    
      Dr. Hattie Turner
      44.0
      33.0
      0.750000
      0.750000
      All
      33.623519
      13.867835
      0.707992
      0.98382
    
    
      Dr. Chad Trott
      48.0
      34.0
      0.708333
      0.708333
      All
      34.623519
      16.867835
      0.672414
      0.97753
    
    
      Dr. Catherine Kiely
      72.0
      52.0
      0.722222
      0.722222
      All
      52.623519
      22.867835
      0.697080
      0.98221
    
    
      Dr. Howard Thomas
      51.0
      36.0
      0.705882
      0.705882
      All
      36.623519
      17.867835
      0.672098
      0.97673
    
    
      Dr. Cynthia Simmons
      130.0
      91.0
      0.700000
      0.700000
      All
      91.623519
      41.867835
      0.686363
      0.98083
    
    
      Dr. Debrah Mccray
      90.0
      63.0
      0.700000
      0.700000
      All
      63.623519
      29.867835
      0.680528
      0.98008
    
    
      Dr. Charles Neufeld
      90.0
      63.0
      0.700000
      0.700000
      All
      63.623519
      29.867835
      0.680528
      0.97973
    
    
      Dr. Rachael Gilliam
      141.0
      99.0
      0.702128
      0.702128
      All
      99.623519
      44.867835
      0.689477
      0.98188



In [29]:

    
results.sort_values(['prob_above_average'], ascending=False).head(20)









    Out[29]:






  
    
      
      observations
      successes
      pct_high_encounters
      naive_perc
      group
      a
      b
      regressed_prob
      prob_above_average
    
    
      servicing_provider_npi
      
      
      
      
      
      
      
      
      
    
  
  
    
      Dr. Hattie Turner
      44.0
      33.0
      0.750000
      0.750000
      All
      33.623519
      13.867835
      0.707992
      0.98382
    
    
      Dr. Catherine Kiely
      72.0
      52.0
      0.722222
      0.722222
      All
      52.623519
      22.867835
      0.697080
      0.98221
    
    
      Dr. Rachael Gilliam
      141.0
      99.0
      0.702128
      0.702128
      All
      99.623519
      44.867835
      0.689477
      0.98188
    
    
      Dr. Cynthia Simmons
      130.0
      91.0
      0.700000
      0.700000
      All
      91.623519
      41.867835
      0.686363
      0.98083
    
    
      Dr. Erna Tripp
      124.0
      86.0
      0.693548
      0.693548
      All
      86.623519
      40.867835
      0.679446
      0.98009
    
    
      Dr. Debrah Mccray
      90.0
      63.0
      0.700000
      0.700000
      All
      63.623519
      29.867835
      0.680528
      0.98008
    
    
      Dr. Charles Neufeld
      90.0
      63.0
      0.700000
      0.700000
      All
      63.623519
      29.867835
      0.680528
      0.97973
    
    
      Dr. Beverly Smith
      106.0
      73.0
      0.688679
      0.688679
      All
      73.623519
      35.867835
      0.672414
      0.97958
    
    
      Dr. Charles Haynes
      86.0
      60.0
      0.697674
      0.697674
      All
      60.623519
      28.867835
      0.677423
      0.97901
    
    
      Dr. Joseph Benigno
      107.0
      73.0
      0.682243
      0.682243
      All
      73.623519
      36.867835
      0.666328
      0.97796
    
    
      Dr. Priscilla Edington
      69.0
      48.0
      0.695652
      0.695652
      All
      48.623519
      23.867835
      0.670749
      0.97776
    
    
      Dr. Maurice Sims
      111.0
      76.0
      0.684685
      0.684685
      All
      76.623519
      37.867835
      0.669252
      0.97761
    
    
      Dr. Chad Trott
      48.0
      34.0
      0.708333
      0.708333
      All
      34.623519
      16.867835
      0.672414
      0.97753
    
    
      Dr. Debra Hall
      74.0
      51.0
      0.689189
      0.689189
      All
      51.623519
      25.867835
      0.666184
      0.97700
    
    
      Dr. Michelle Mathis
      162.0
      109.0
      0.672840
      0.672840
      All
      109.623519
      55.867835
      0.662412
      0.97682
    
    
      Dr. Raymond Farah
      68.0
      47.0
      0.691176
      0.691176
      All
      47.623519
      23.867835
      0.666144
      0.97678
    
    
      Dr. Howard Thomas
      51.0
      36.0
      0.705882
      0.705882
      All
      36.623519
      17.867835
      0.672098
      0.97673
    
    
      Dr. Glenda Lewis
      129.0
      87.0
      0.674419
      0.674419
      All
      87.623519
      44.867835
      0.661353
      0.97660
    
    
      Dr. Angelo Redman
      124.0
      84.0
      0.677419
      0.677419
      All
      84.623519
      42.867835
      0.663759
      0.97658
    
    
      Dr. Jesse Blomdahl
      102.0
      69.0
      0.676471
      0.676471
      All
      69.623519
      35.867835
      0.659993
      0.97625

Visualization

Let's make some visualizations!

In this section, we will actually visualize the distribution of the posterior, p, that we are estimating. That is, we will visualize the distribution of our posterior opinion of the proportion of level 5 codes a provider will bill.



In [20]:

    
fig = plt.figure(figsize=(15, 35))
gs1 = gridspec.GridSpec(4, 2)
gs1.update(hspace=.4)
ax_list = [fig.add_subplot(ss) for ss in gs1]

distribution_values = distribution_values_output['All']

npis = result_output['All'].sort_values(['prob_above_average'], ascending=False).head(8).index
for i,npi in enumerate(npis):
    ax=ax_list[i]
    ax.plot(distribution_values['prior'][0], distribution_values['prior'][1])
    ax.plot(distribution_values[npi][0], distribution_values[npi][1])
    ax.set_title(npi, fontsize=20)
    ax.set_xlabel('Pr(Codes a Level 5)', fontsize=15)
    posterior = mlines.Line2D([], [], color='green', label='Posterior: Updated Opinion of Level 5s', linewidth=4)
    prior = mlines.Line2D([], [], color='blue', label='Prior: Overall Distribution of Level 5s', linewidth=4)

    ax.legend(handles=[posterior, prior], fontsize=15)

Validating Results

How do we validate the results of our analysis? For example, one question we might have is how do we know whether the results of our analysis are due to just spurious noise, or if there's real signal here?

In this section, we talk about one very intuitive approach to validate our results. As this is unsupervised learning, it is not possible to do a traditional 'cross-validation' approach. However, one can do something similar in spirit. Below, we split out data set in half randomly by taking separating all even personid from odd personid. We then run our analysis over each of the data sets separately and return the ranking of providers in each. We then correlate the rankings obtained in the separate data sets. If rankings are highly correlated, that's decent evidence that there is real signal. If the rankings are not correlated, one might conclude that results are really due to statistical noise.



In [21]:

    
columns = [
    claim_lines.c.personid,
    claim_lines.c.servicing_provider_npi,
    claim_lines.c.procedure_code,
    claim_lines.c.procedure_name
]
claim_lines_1 = (
    sa.select(columns).
    select_from(claim_lines).
    where(claim_lines.c.personid % 2 == 1)
).correlate(None).alias('claim_lines_1')

claim_lines_2 = (
    sa.select(columns).
    select_from(claim_lines).
    where(claim_lines.c.personid % 2 == 0)
).correlate(None).alias('claim_lines_1')



In [22]:

    
result_output_1, distribution_values_output_1 = an.run(conn, claim_lines_1, OFFICE_EST_PATIENT)
result_output_2, distribution_values_output_2 = an.run(conn, claim_lines_2, OFFICE_EST_PATIENT)



In [23]:

    
output1 = result_output_1['All']
output2 = result_output_2['All']



In [24]:

    
combined = output1.merge(output2, left_index=True, right_index=True)



In [35]:

    
plt.scatter(combined.prob_above_average_x, combined.prob_above_average_y)
plt.ylabel('Odd Personids', fontsize=16)
plt.xlabel('Even Personids', fontsize=16)
plt.title('Correlation across two randomly divided subsets', fontsize=20)
plt.show()

We plot the probability above average returned for each doctor across the two randomly divided subsets (using even/odd personids). There are two main clusters (which is mostly the result of the data generation process) that make it pretty apparent to see the pattern. In the data generation process there are only 2 types of providers, upcoders and non-upcoders. The cluster in the bottom left are the non-upcoders. Notice how the probability above average in one of the randomly divided subsets for each provider is basically uncorrelated with their score in the other data set.

However, notice that for the data points in the top right cluster, which represent the upcoders, the correlation is a lot tighter and much less spread. This provides some evidence that these providers actually are associated with more level 5 codes as opposed to the pattern being generated by random noise.



In [ ]:

	personid	servicing_provider_npi	procedure_code	procedure_name	procedure_name_array
0	11324	Dr. Darlene Allen	5	Evaluation and Management, Highest Intensity	[Evaluation and Management, Highest Intensity]
1	1633	Dr. Darlene Allen	5	Evaluation and Management, Highest Intensity	[Evaluation and Management, Highest Intensity]
2	18100	Dr. Darlene Allen	4	Evaluation and Management, High Intensity	[Evaluation and Management, High Intensity]
3	19251	Dr. Darlene Allen	5	Evaluation and Management, Highest Intensity	[Evaluation and Management, Highest Intensity]
4	16174	Dr. Darlene Allen	5	Evaluation and Management, Highest Intensity	[Evaluation and Management, Highest Intensity]

	procedure_code	procedure_name
0	5	Evaluation and Management, Highest Intensity
2	4	Evaluation and Management, High Intensity
9	3	Evaluation and Management, Medium Intensity
15	1	Evaluation and Management, Lowest Intensity
25	2	Evaluation and Management, Second Lowest Inten...

	observations	successes	pct_high_encounters	naive_perc	group	a	b	regressed_prob	prob_above_average
servicing_provider_npi
Dr. Hattie Turner	44.0	33.0	0.750000	0.750000	All	33.623519	13.867835	0.707992	0.98382
Dr. Chad Trott	48.0	34.0	0.708333	0.708333	All	34.623519	16.867835	0.672414	0.97753
Dr. Catherine Kiely	72.0	52.0	0.722222	0.722222	All	52.623519	22.867835	0.697080	0.98221
Dr. Howard Thomas	51.0	36.0	0.705882	0.705882	All	36.623519	17.867835	0.672098	0.97673
Dr. Cynthia Simmons	130.0	91.0	0.700000	0.700000	All	91.623519	41.867835	0.686363	0.98083
Dr. Debrah Mccray	90.0	63.0	0.700000	0.700000	All	63.623519	29.867835	0.680528	0.98008
Dr. Charles Neufeld	90.0	63.0	0.700000	0.700000	All	63.623519	29.867835	0.680528	0.97973
Dr. Rachael Gilliam	141.0	99.0	0.702128	0.702128	All	99.623519	44.867835	0.689477	0.98188