In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.
In this exercise, you will:
More instructions provided below. Include your work in this notebook and submit to your Github account.
In [34]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import bokeh.plotting as bkp
import seaborn as sns
from mpl_toolkits.axes_grid1 import make_axes_locatable
In [35]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')
In [36]:
# deal with missing and inconvenient portions of data
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')
clean_hospital_read_df.head()
Out[36]:
In [37]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)
ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)
ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)
ax.grid(True)
fig.tight_layout()
Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.
A. Initial observations based on the plot above
B. Statistics
C. Conclusions
D. Regulatory policy recommendations
Include your work on the following in this notebook and submit to your Github account.
A. Do you agree with the above analysis and recommendations? Why or why not?
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:
You can compose in notebook cells using Markdown:
I don't agree with the analysis, as it doesn't properly investigate the difference of means being statistically significant. Thus it is hard to say that recommendations are correct.
Null hyphotesis states that there is no difference in mean excess readmission rate in hospitals with number of discharges < 100 and hospitals with number of discharges > 1000. The alternate hypothesis states that, in fact, there is a difference.
In [38]:
df = clean_hospital_read_df
low_discharges = df[df['Number of Discharges'] < 100]['Excess Readmission Ratio'].dropna()
high_discharges = df[df['Number of Discharges'] > 1000]['Excess Readmission Ratio'].dropna()
(len(low_discharges.index), len(high_discharges.index))
Out[38]:
In [39]:
low_mean = low_discharges.mean()
high_mean = high_discharges.mean()
d = np.sqrt(low_discharges.var() / len(low_discharges.index) + high_discharges.var() / len(high_discharges.index))
z = (low_mean - high_mean) / d
(low_mean, high_mean, std, low_mean - high_mean, z)
Out[39]:
Z-score is 7.6 which gives us p-value << 0.001%. (There is <0.001% chance (two-tailed distribution) of sampling such difference or greater)
In order to report statistical significance for α = .01, I've retrieved expected Z-value for a two-tailed distribution and p = 0.995 which equals 2.58. Our Z-score computed in the analysis is 7.6 which is much greater than required 2.58, thus the difference between means is statistically significant (for α = .01).
General problem with traditional statistics is that if you take large enough samples, almost any difference or any correlation will be significant.
On the other side, practical significance looks at whether the difference is large enough to be of value in a practical sense.
In case of this analysis, there seem to be difference between means of the two sample seems to be very small in real life and it doesn't render a value in practical sense.
The advantage is that it's easy to see how the excess rate of readmissions relate to number of discharges. The disadvantage is that it's hard to compare the two groups we are mainly interested in, as we get data ploted for all the spectrum of discharges.
In [40]:
pd.DataFrame({'low': low_discharges, 'high': high_discharges}).plot.hist(alpha=0.5, bins=20)
Out[40]:
In [ ]: