In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.
In this exercise, you will:
More instructions provided below. Include your work in this notebook and submit to your Github account.
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh.plotting as bkp
from mpl_toolkits.axes_grid1 import make_axes_locatable
In [2]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')
In [3]:
# deal with missing and inconvenient portions of data
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')
In [4]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)
ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)
ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)
ax.grid(True)
fig.tight_layout()
Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.
A. Initial observations based on the plot above
B. Statistics
C. Conclusions
D. Regulatory policy recommendations
Include your work on the following in this notebook and submit to your Github account.
A. Do you agree with the above analysis and recommendations? Why or why not?
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:
You can compose in notebook cells using Markdown:
In [5]:
# Your turn
In [6]:
clean_hospital_read_df.tail()
Out[6]:
In [7]:
tot_hosp_valid = clean_hospital_read_df.loc[clean_hospital_read_df['Excess Readmission Ratio'] > 0]['Hospital Name'].count()
excess_read_ratio_max = clean_hospital_read_df['Excess Readmission Ratio'].max()
excess_read_ratio_min = clean_hospital_read_df['Excess Readmission Ratio'].min()
excess_read_ratio_mean = clean_hospital_read_df['Excess Readmission Ratio'].mean()
ranges = [0, 1, 2]
h1, h2 = clean_hospital_read_df['Excess Readmission Ratio'].groupby(pd.cut(clean_hospital_read_df['Excess Readmission Ratio'], ranges)).count()
print("+---------------------------------------------------|---------------+")
print("| Total hospitals with excess readmission ratio > 0 | %s |" %(format(tot_hosp_valid, ',')))
print("|---------------------------------------------------|---------------|")
print("| Excess readimission ratio: max | %.2f |" % excess_read_ratio_max)
print("| min | %.2f |" % excess_read_ratio_min)
print("| mean | %.2f |" % excess_read_ratio_mean)
print("|---------------------------------------------------|---------------|")
print("| Hospitals with excess readmission ratio <= 1 | %s (%.2f%%)|" %(format(h1, ','), (100*(h1/(h1+h2)))))
print("| Hospitals with excess readmission ratio > 1 & <=2 | %s (%.2f%%)|" %(format(h2, ','), (100*(h2/(h1+h2)))))
print("+---------------------------------------------------|---------------+")
In [8]:
tot_h100 = clean_hospital_read_df['Excess Readmission Ratio'].loc[clean_hospital_read_df['Number of Discharges'] < 100].count()
med_h100 = clean_hospital_read_df['Excess Readmission Ratio'].loc[clean_hospital_read_df['Number of Discharges'] < 100].mean()
tot_h100_exc_gt_one = clean_hospital_read_df['Hospital Name'].loc[(clean_hospital_read_df['Number of Discharges'] < 100) & (clean_hospital_read_df['Excess Readmission Ratio'] > 1)].count()
print("+---------------------------------------------------+--------------+")
print("| Hospitals with discharges < 100 Total | %s |" %(format(tot_h100,',')))
print("| Mean | %.3f |" %med_h100)
print("| Have excess readmission rate > 1 | %.2f%% |" %(100*(tot_h100_exc_gt_one/tot_h100)))
print("+---------------------------------------------------+--------------+")
In [9]:
tot_h1000 = clean_hospital_read_df['Excess Readmission Ratio'].loc[clean_hospital_read_df['Number of Discharges'] > 1000].count()
med_h1000 = clean_hospital_read_df['Excess Readmission Ratio'].loc[clean_hospital_read_df['Number of Discharges'] > 1000].mean()
tot_h1000_exc_gt_one = clean_hospital_read_df['Hospital Name'].loc[(clean_hospital_read_df['Number of Discharges'] > 1000) & (clean_hospital_read_df['Excess Readmission Ratio'] > 1)].count()
print("+---------------------------------------------------+--------------+")
print("| Hospitals with discharges > 1000 Total | %d |" %tot_h1000)
print("| Mean | %.3f |" %med_h1000)
print("| Have excess readmission rate > 1 | %.2f%% |" %(100*(tot_h1000_exc_gt_one/tot_h1000)))
print("+---------------------------------------------------+--------------+")
In [10]:
from scipy.stats import pearsonr
In [11]:
df_temp = clean_hospital_read_df[['Number of Discharges', 'Excess Readmission Ratio']].dropna()
In [12]:
df_temp.head()
Out[12]:
"The Pearson correlation coefficient measures the linear relationship between two datasets. ... Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.".
Source: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html
In [36]:
pearson, pvalue = pearsonr(df_temp[['Number of Discharges']], df_temp[['Excess Readmission Ratio']])
pvalue1 = pvalue
pearson1 = pearson
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for all hospitals: |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.30f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [14]:
sns.set(color_codes=True)
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_temp)
Out[14]:
In [33]:
df_temp_a = df_temp.loc[(df_temp['Number of Discharges'] < 2000) & (df_temp['Excess Readmission Ratio'] < 1.6)]
pearson, pvalue = pearsonr(df_temp_a[['Number of Discharges']], df_temp_a[['Excess Readmission Ratio']])
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for all hospitals with: |")
print("| a) Excess Readmission Ratio < 1.6 |")
print("| b) Number of Discharges < 2,000 |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.30f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [16]:
sns.set(color_codes=True)
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_temp_a)
Out[16]:
In [37]:
df_temp_dischg100 = df_temp.loc[df_temp['Number of Discharges'] < 100]
pearson,pvalue = pearsonr(df_temp_dischg100[['Number of Discharges']], df_temp_dischg100[['Excess Readmission Ratio']])
pvalue2 = pvalue
pearson2 = pearson
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for hospitals with discharges < 100: |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.20f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [18]:
df_dischg_lt100 = clean_hospital_read_df[['Number of Discharges', 'Excess Readmission Ratio']].loc[(clean_hospital_read_df['Number of Discharges'] < 100) & (clean_hospital_read_df['Excess Readmission Ratio'] > 0)]
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_dischg_lt100)
Out[18]:
In [34]:
df_temp_b = df_temp_dischg100.loc[(df_temp_dischg100['Number of Discharges'] > 40) & (df_temp_dischg100['Excess Readmission Ratio'] < 1.2)]
pearson, pvalue = pearsonr(df_temp_b[['Number of Discharges']], df_temp_b[['Excess Readmission Ratio']])
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for all hospitals with: |")
print("| a) Excess Readmission Ratio < 1.2 |")
print("| b) Number of Discharges < 40 |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.30f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [20]:
sns.set(color_codes=True)
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_temp_b)
Out[20]:
In [38]:
df_temp_dischg1000 = df_temp.loc[df_temp['Number of Discharges'] > 1000]
pearson, pvalue = pearsonr(df_temp_dischg1000[['Number of Discharges']], df_temp_dischg1000[['Excess Readmission Ratio']])
pvalue3 = pvalue
pearson3 = pearson
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for hospitals with discharges > 1000: |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.20f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [22]:
df_dischg_gt1000 = clean_hospital_read_df[['Number of Discharges', 'Excess Readmission Ratio']].loc[(clean_hospital_read_df['Number of Discharges'] > 1000) & (clean_hospital_read_df['Excess Readmission Ratio'] > 0)]
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_dischg_gt1000)
Out[22]:
In [23]:
df_temp_c = df_dischg_gt1000.loc[(df_dischg_gt1000['Number of Discharges'] < 2000) & (df_dischg_gt1000['Excess Readmission Ratio'] < 1.2)]
pearson, pvalue = pearsonr(df_temp_c[['Number of Discharges']], df_temp_c[['Excess Readmission Ratio']])
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for all hospitals with: |")
print("| a) Excess Readmission Ratio < 1.2 |")
print("| b) Number of Discharges < 2,000 |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.30f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [24]:
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_temp_c)
Out[24]:
In [25]:
df_temp2 = df_temp.loc[(df_temp['Number of Discharges'] >= 100) & (df_temp['Number of Discharges'] <= 1000)]
pearson, pvalue = pearsonr(df_temp2[['Number of Discharges']], df_temp2[['Excess Readmission Ratio']])
pvalue4 = pvalue
print("+----------------------------------------------------------+")
print("| 'Number of Discharges' versus 'Excess Readmission Ratio' |")
print("| for hospitals with discharges >= 100 and <= 1,000: |")
print("|----------------------------------------------------------|")
print("| Pearson Correlation = %.4f |" %pearson[[0][0]])
print("| p-value = %.20f |" %pvalue[[0][0]])
print("+----------------------------------------------------------+")
In [26]:
df_dischg_medium = clean_hospital_read_df[['Number of Discharges', 'Excess Readmission Ratio']].loc[(clean_hospital_read_df['Number of Discharges'] >= 100) & (clean_hospital_read_df['Number of Discharges'] <= 1000) & (clean_hospital_read_df['Excess Readmission Ratio'] > 0)]
sns.regplot(x="Number of Discharges", y="Excess Readmission Ratio", data=df_dischg_medium)
Out[26]:
Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.
A. Initial observations based on the plot above
B. Statistics
C. Conclusions
D. Regulatory policy recommendations
No.
In [42]:
print("+-----------------------------------------------------------+-------------+")
print("| Hospital/facilities | Correlation*|")
print("|-----------------------------------------------------------|-------------|")
print("| i) All hospitals | %.4f |" %pearson1)
print("| ii) Hospitals/facilities with number of discharges < 100 | %.4f |" %pearson2)
print("| ii) Hospitals/facilities with number of discharges > 1,000| %.4f |" %pearson3)
print("+-----------------------------------------------------------+-------------+")
print("* Pearson Correlation for: Hospital capacity(number of discharges) / readmission rates")
In [43]:
print("+--------------------------------------------------------------+")
print("| Null Hypothesis: |")
print("| Ho: There is *not* a significant correlation between |")
print("| hospital capacity (discharges) and readmission rates|")
print("|--------------------------------------------------------------|")
print("| Alternative Hypothesis: |")
print("| Ha: There is a significant correlation between |")
print("| hospital capacity (discharges) and readmission rates|")
print("+--------------------------------------------------------------+")
In [28]:
print("+--------------------------------------------------------------+")
print("| Scenario 1: |")
print("| All Hospitals: P-Value = %.30f |" %pvalue1)
print("|--------------------------------------------------------------|")
print("| Scenario 2: |")
print("| Hospitals discharges < 100: P-Value = %.20f |" %pvalue2)
print("|--------------------------------------------------------------|")
print("| Scenario 3: |")
print("| Hospitals with discharges > 1,000: P-Value = %.4f |" %pvalue3)
print("|--------------------------------------------------------------|")
print("| Scenario 4: |")
print("| Hospitals discharges>100 and <1,000: P-Value = %.12f|" %pvalue4)
print("+--------------------------------------------------------------+")
In [ ]:
4. Discuss statistical significance and practical significance. Do they differ here? How does this change your recommendation to the client?
In [ ]:
Some advantages:
a) It shows how much one variable affect the other or, in other words. The relationship between two variables is called their correlation;
b) Outliers: the maximum and minimum value, usually, can be easily determined;
c) It is possible to show many variables in a single plot.
Some disadvantages:
d) Discretization: it is difficult to see the values (x,y) if we have many of them that are very close;
In [ ]: