The recent release of data on payments made by drug and medical device companies to physicians and teaching hospitals provides for a unique opportunity to develop insights on trends in the health care industry. For example, what classes of drugs are heavily promoted? in what states are companies investing the most on promoting their products? how does industry spending on health care providers equate to sales within a region? These are just some of the questions that I believe can be answered from these disclosures.
The analysis outlined here utilizes the General Payments data for payments made to physicians and teaching hospitals, provided through the Open Payments website hosted by CMS, and covering the period January 2014 to December 2014. In addition, the analysis also includes data on total retail sales for prescription drugs in 2014 provided by the Henry J. Kaiser family foundation, and tables on state population and physician numbers per state for 2014 published by the Federation of State Medical Board. The following is a summary of the tables.
In [21]:
import MySQLdb
import pandas as pd
import csv
import numpy as np
import scipy as sp
from sqlalchemy import create_engine
from IPython.display import display
import datetime as dt
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline
import plotly.plotly as py
from plotly.graph_objs import Bar, Scatter, Marker, Layout
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import colorlover as cl #colorscale
from IPython.display import HTML
In [22]:
init_notebook_mode() #use plotly offline
In [23]:
# Connect to database
conn = MySQLdb.connect(host="localhost", user="****", passwd="*****",
unix_socket= "/var/run/mysqld/mysqld.sock", db="pharma")
cursor = conn.cursor()
In [24]:
'''Create a data frame from sql queries where we select for each state
the total dollar amount spent by industry in the state, the highest single
contribution, and the physician specialty associated with the highest
dollar contribution. Since we are only interested in payments made to providers
in mainland US (i.e., excluding territories, military bases), we make that requirement
explicit in the sql query'''
pay_df = pd.read_sql_query('SELECT DISTINCT Recipient_State, '
'ROUND(SUM(Total_Amount_of_Payment_USDollars)/1E6, 1) AS `Total_Dollar_Millions`, '
'MAX(Total_Amount_of_Payment_USDollars) AS `Max_Dollar`, '
'Physician_Specialty '
'FROM GeneralPay '
'WHERE Recipient_State IS NOT NULL '
'AND Recipient_Country = "United States" '
'AND Recipient_State IN ("AL", "AK", "AZ", "AR", "CA", "CO", '
'"CT", "DE", "DC", "FL", "GA", '
'"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", '
'"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", '
'"NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", '
'"OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", '
'"TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY") '
'GROUP BY Recipient_State '
'ORDER BY SUM(Total_Amount_of_Payment_USDollars)', conn)
pay_df = pay_df.sort_values(by= 'Recipient_State', ascending = True )
pay_df.head(4)
Out[24]:
Note: Here I am only showing 4 rows of the total 51 generated
We can take this data and create a chloropeth map where states that received the highest payments from industry will have a darker shade and vice-versa, therefore giving as visual representation of what states industry is spending most of their money in. I create the plot using plotly, a handy online graphing and visualization tool based on the D3.js JavaScript visualization library.
In [25]:
import plotly.graph_objs as go
go.Choropleth
color = cl.scales['6']['seq']['Oranges'] #select rgb color scale
colorbns = cl.interp( color, 20 ) #map color scale to 100 bins
colorbns = cl.to_rgb( colorbns ) #convert back to RGB
bin_array = np.linspace(0,1,len(colorbns)) # array of equal-sized bins
bin_array.tolist()
scheme = [] #This will be a list of lists containing the bin and color scheme
for i in range(0, len(colorbns)):
temp = []
temp.append(bin_array[i])
temp.append(colorbns[i])
scheme.insert(i, temp)
data = [
go.Choropleth(
colorscale = scheme,
autocolorscale = False,
locations = pay_df['Recipient_State'],
z = pay_df['Total_Dollar_Millions'], #value-to-color mapping
locationmode = 'USA-states',
text= pay_df['Recipient_State'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = '<b>Industry spending in US$ (millions)</b>'
),
#color is picked from z value
showscale = True, #Shows the colorbar (I will set attributes later)
),
]
layout = dict(
title="Health Care Industry Spending<br> on Physicians and Teaching Hospitals, 2014",
geo = dict(
scope = "usa",
projection = dict( type= "albers usa" ),#The Albers USA projection
showcoastlines = False,
showlakes = False,
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
What we find is that pharmaceutical companies and medical device manufacturers spent over 0.5 billion USD promoting their products with health care providers in the state of California. This is more than twice the second highest state, New York. While one can be tempted to draw conclusions from this map on the state of the health care market in different states, the plot does not convey all the information. For one, California is the most populous state and you would therefore expect the total dollar amount spent in that market to be comparably bigger than the other states. Also, the plot does not tell us how many physicians we have per unit of population in each state, something that would have an influence on the total industry spending in a particular state. Finally, in this analysis I have not excluded large teaching hospitals whose prescence in some states can help skew the data.
A more descriptive plot would have the total spending in each state normalized by the physicians per 100,000 of the population, and excluding the teaching hospitals in the state. To do this, it is necessary to join tables for data obtained from payments made to health care providers to tables on data about the number of physicians in each state, as well as the total population of each state at the time this report was compiled
In [26]:
'''From the General Pay table, select for each state the total dollar amount spent
as well as the maximum single payment made to a health care provider for all states
in the intercontinental US. Here, we exclude any payments made to any other entities
i.e., teaching hospitals and only select for payments made to physicians'''
payphys_df = pd.read_sql_query('SELECT DISTINCT Recipient_State AS State_Code, '
'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar, '
'MAX(Total_Amount_of_Payment_USDollars) AS Max_Dollar '
'FROM GeneralPay '
'WHERE Recipient_State IS NOT NULL '
'AND Recipient_Country = "United States" '
'AND Covered_Recipient_Type = "Covered Recipient Physician" '
'AND Recipient_State IN ("AL", "AK", "AZ", "AR", "CA", "CO", '
'"CT", "DE", "DC", "FL", "GA", '
'"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", '
'"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", '
'"NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", '
'"OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", '
'"TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY") '
'GROUP BY Recipient_State '
'ORDER BY Recipient_State', conn)
#force the DF to index from 1 to match indexing in mysql (makes life easier when doing JOIN)
payphys_df.index = payphys_df.index + 1
payphys_df.head()
Out[26]:
This table shows for each state, the total dollar amount spent by industry as well as the highest dollar amount paid to a physician in that state in a single payment. To be able to use this data with population and physician data from other tables, I write the dataframe to a MySQL database
In [27]:
#write dataframe to MySQL database. Will then JOIN this table to several others
#payphys_df.to_sql(con=conn, name='PhysicianPay', if_exists='replace', flavor='mysql')
To extract further insight from the data, I JOIN the table on physician pay in each state, to a table on statistics on number of physicians in a state as well as the total population, and a table on drug sales in each state.
In [28]:
'''JOIN TABLES on Physician pay, the number of physicians in the state and the states
total population, and the drug sales, in total dollar amounts, in each state'''
payandsales_df = pd.read_sql_query('SELECT StateName.State, PhysicianPay.State_Code, '
'ROUND(PhysicianPay.Total_Dollar/1E6, 1) '
'AS Total_Dollar_Millions, PhysicianPay.Max_Dollar, '
'ROUND(DrugSales.Total_Drug_Sales/1E9, 1) AS Drug_Sales_Billions, '
'PhysicianCount.Total_Pop, PhysicianCount.Physician_per_100000, '
'ROUND(PhysicianPay.Total_Dollar/Physician_per_100000, 1) '
'AS Pay_per_PhysicianPer100000 '
'FROM (StateName '
'JOIN PhysicianPay '
'ON StateName.State_Code = PhysicianPay.State_Code) '
'JOIN DrugSales '
'ON (StateName.State = DrugSales.State) '
'JOIN PhysicianCount '
'ON (StateName.State = PhysicianCount.State)', conn)
payandsales_df.head(4)
Out[28]:
In [29]:
print(payandsales_df.describe())
One thing I was curious about is how payments to physicians by drug companies and medical device manufacturers relate to drug sales. Would we see a correlation, an anticorrelation or no correlation at all? My null hypothesis is that there is no correlation between the two. To find out, I look at the Pearson correlation coefficient for the different columns.
In [30]:
print(payandsales_df.corr())
In [31]:
import scipy as sp
a = payandsales_df['Total_Dollar_Millions']
b = payandsales_df['Drug_Sales_Billions']
sp.stats.ttest_ind(a, b, axis=0, equal_var=True)
Out[31]:
Giving as a p-value of 1.6e-05 which allows us to reject the null hypothesis. We can therefore see that the payment to physicians and the sales from retail precription drugs are very highly correlated, and that this correlation is statistically significant.
Re-plotting the total payments to health care providers, with the exclusion of teaching hospitals, and having normalized the payments to the number of physicians in each state per 100,000 of the population.
In [32]:
import plotly.graph_objs as go
go.Choropleth
color = cl.scales['6']['seq']['Reds']
colorbns = cl.interp( color, 20 )
colorbns = cl.to_rgb( colorbns )
bin_array = np.linspace(0,1,len(colorbns))
bin_array.tolist()
scheme = []
for i in range(0, len(colorbns)):
temp = []
temp.append(bin_array[i])
temp.append(colorbns[i])
scheme.insert(i, temp)
data = [
go.Choropleth(
colorscale = scheme,
autocolorscale = False,
locations = payandsales_df['State_Code'],
z = payandsales_df['Pay_per_PhysicianPer100000'],
locationmode = 'USA-states',
text= payandsales_df['State_Code'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = '<b>Industry spending in US$</b>'
),
#color is picked from z value
showscale = True,
),
]
layout = dict(
title="Health Care Industry Spending on Physicians<br> "
"Normalized to Number of Physicians per 100000, 2014",
geo = dict(
scope = "usa",
projection = dict( type= "albers usa" ),
showcoastlines = False,
showlakes = False,
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
In [33]:
payperphy_df = payandsales_df.sort_values('Pay_per_PhysicianPer100000', axis=0, ascending=True, inplace=False)
import plotly.graph_objs as go
data = [
go.Bar(
x = payperphy_df['State'],
y = payperphy_df['Pay_per_PhysicianPer100000'],
hoverinfo = "all",
marker=dict(
color= '#04BCE0',
line=dict(
color='#013A45',
width=1.5,
)
),
)
]
layout = dict(
title = "Drug Companies Spending in US$ <br> Scaled to Number of Physicians Per 100000, 2014",
xaxis = dict(
#title = 'Manufacturer',
tickangle = 40
),
yaxis = dict(
title = 'Total Amount in USD per Capita'
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
With the data normalized to the number of physicians per 100,000, we find that California still saw the highest spending by industry in 2014, with the state of Texas coming in at second place.
In [34]:
'''Here, we select for each drug promoted to care providers
and rank them by the total dollar amount spent by their respective
manufacturers in promoting them to physicians and teaching hospitals.'''
pd.set_option('display.max_rows', 2000)
df = pd.read_sql_query('SELECT DISTINCT UCASE(Name_of_Associated_Covered_Drug_or_Biological1) AS Drug, '
'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar, '
'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name AS Manufacturer '
'FROM GeneralPay '
'WHERE Name_of_Associated_Covered_Drug_or_Biological1 IS NOT NULL '
'AND Name_of_Associated_Covered_Drug_or_Biological2 IS NULL '
'AND Name_of_Associated_Covered_Device_or_Medical_Supply1 IS NULL '
'AND Name_of_Associated_Covered_Drug_or_Biological1 NOT IN '
'("No Product Discussed", "Non-Product", "AccurusVitEquip", "NONCOVERED PRODUCT", '
'"NON BRAND", "None", "Product Candidate", "Spine Instrumentation", '
'"Business Development Activity", "OTHER PRODUCT", '
'"Specimen Indentification", "NO PRODUCT", "NOT APPLICABLE", "ENT Medical Devices" '
'"NA", "BioSurgery - Non Prod Related", "Non-Covered-Product", '
'"Non Franchise - Non Prod Related" '
'"RefractiveEquipment", "No Associated Product", "Product Development", '
'"Non Franchise - Research & Develop") '
'AND Name_of_Associated_Covered_Drug_or_Biological1 NOT LIKE "%TAP%" '
'GROUP BY Name_of_Associated_Covered_Drug_or_Biological1 '
'ORDER BY Total_Dollar DESC LIMIT 20', conn)
df = df.sort_values('Total_Dollar', axis=0, ascending=True, inplace=False,kind='quicksort',na_position='last')
df.head(4)
Out[34]:
In [35]:
import plotly.graph_objs as go
data = [
go.Bar(
x = df['Drug'],
y = df['Total_Dollar'],
hoverinfo = "all",
marker=dict(
color= '#F05513',
line=dict(
color='#662002',
width=1.5,
)
),
)
]
layout = dict(
title = "Top 20 Most Promoted Drugs By Dollar Value, 2014",
xaxis = dict(
#title = 'Drug Name',
),
yaxis = dict(
title = 'Total Amount in USD'
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
Of these drugs, five of them -- Humira, Rituxan, Avastin, Herceptin, and Copaxone, were also among the top 20 best-selling drugs of 2014 with combined sales of over 39 billion dollars. Interestingly, the first three highly promoted drugs, Rituxan, Avastin and Herceptin, are cancer drugs marketed by Roche pharmaceutical division (Rituxan is co-marketed by Biogen Idec too) with total combined sales in 2014 of 22.4 billion USD.
What about the drug and medical device manufacturers? How do the different companies rank in terms of total dollar amount spent on physicians and teaching hospitals while promoting their products?
In [36]:
'''Selection of drug and device manufacturers ranked by
total dollar amount spent on physicians and teaching hospitals'''
pd.set_option('display.max_rows', 2000)
df = pd.read_sql_query('SELECT DISTINCT '
'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name AS Manufacturer, '
'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar '
'FROM GeneralPay '
'WHERE Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name IS NOT NULL '
'GROUP BY Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name '
'ORDER BY Total_Dollar DESC LIMIT 20', conn)
df = df.sort_values('Total_Dollar', axis=0, ascending=True, inplace=False,kind='quicksort',na_position='last')
df.head(4)
Out[36]:
In [37]:
import plotly.graph_objs as go
data = [
go.Bar(
x = df['Manufacturer'],
y = df['Total_Dollar'],
hoverinfo = "all",
marker=dict(
color= '#04BCE0',
line=dict(
color='#013A45',
width=1.5,
)
),
)
]
layout = dict(
title = "Top 20 Drug and Medical Device Manufacturers<br> "
"By Dollars Spent on Physicians and Teaching Hospitals, 2014",
xaxis = dict(
#title = 'Manufacturer',
),
yaxis = dict(
title = 'Total Amount in USD'
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
Unsurprisingly, Genentech, Inc. which is now a subsidiary of Roche spent almost three times as much on health care providers as its closest rival, Topera, Inc. Topera, Inc. is a recent startup that has developed technology to map the electric signals of the heart.
How about retail sales of prescription drugs in each state?
In [38]:
pd.set_option('display.max_rows', 2000)
sales_df = pd.read_sql_query('SELECT DISTINCT State, ROUND(Total_Drug_Sales/1e9, 2) AS Total_Sales_Billion '
'FROM DrugSales '
'ORDER BY Total_Sales_Billion ASC',conn)
sales_df.head()
Out[38]:
In [39]:
import plotly.graph_objs as go
data = [
go.Bar(
x = sales_df['State'],
y = sales_df['Total_Sales_Billion'],
hoverinfo = "all",
marker=dict(
color= '#04BCE0',
line=dict(
color='#013A45',
width=1.5,
)
),
)
]
layout = dict(
title = "Retail Sales for Prescription Drugs<br>Filled at Pharmacies in US$ (Billions), 2014",
xaxis = dict(
#title = 'Manufacturer',
tickangle = 40
),
yaxis = dict(
title = 'Total Amount in USD (Billions)'
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
Here again, California, the most populous state, saw the highest sales in total dollar amount of prescription drugs. A better way to look at these numbers though would be to scale them by the population in each state so that we can get an idea of the spending in prescription drugs per capita.
In [40]:
salespercap_df = pd.read_sql_query('SELECT DrugSales.State, '
'ROUND(DrugSales.Total_Drug_Sales/PhysicianCount.Total_Pop, 1) '
'AS SalesPerCapita '
'FROM PhysicianCount '
'JOIN DrugSales '
'ON PhysicianCount.State = DrugSales.State '
'ORDER BY SalesPerCapita', conn)
salespercap_df.head()
Out[40]:
In [41]:
import plotly.graph_objs as go
data = [
go.Bar(
x = salespercap_df['State'],
y = salespercap_df['SalesPerCapita'],
hoverinfo = "all",
marker=dict(
color= '#037b93',
line=dict(
color='#013b46',
width=1.5,
)
),
)
]
layout = dict(
title = "Retail Sales for Prescription Drugs<br>Filled at Pharmacies Per Capita in US$, 2014",
xaxis = dict(
#title = 'Manufacturer',
tickangle = 40
),
yaxis = dict(
title = 'Total Amount in USD per Capita'
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
Here we now find that the sales of precription drugs per capita were highest in West Virginia, meaning that in West Virginia, on average, each resident spent about 1,287 USD on prescription medicine in 2014. That is 11% more than the next highest state, Maine, and 64% more than the state of New Mexico. This numbers could be indicative of the cost of prescription drugs in each state. The disparity in spending per state is interesting as is shown by the following statistics on the data where we see a standard deviation of USD 163.
In [42]:
#Statistics on the disitribution of drug sales in the different states
print(salespercap_df.describe())
In [43]:
pd.set_option('display.max_rows', 2000)
teaching_df = pd.read_sql_query('SELECT DISTINCT UCASE(Teaching_Hospital_Name) AS Teaching_Hospital, '
'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar, '
'MAX(Total_Amount_of_Payment_USDollars) AS Max_Dollar '
'FROM GeneralPay '
'WHERE Teaching_Hospital_Name IS NOT NULL '
'AND Covered_Recipient_Type = "Covered Recipient Teaching Hospital" '
'GROUP BY Teaching_Hospital_Name '
'ORDER BY Total_Dollar DESC LIMIT 20', conn)
teachin_df = teaching_df.sort_values('Total_Dollar', axis=0, ascending=False,
inplace=False,kind='quicksort',na_position='last')
teaching_df.head()
Out[43]:
In [44]:
pd.set_option('display.max_rows', 2000)
max_df = pd.read_sql_query('SELECT DISTINCT Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name '
'AS Manufacturer, '
'Teaching_Hospital_Name AS Teaching_Hospital, '
'MAX(Total_Amount_of_Payment_USDollars) AS Max_Dollar '
'FROM GeneralPay '
'WHERE Teaching_Hospital_Name = "CITY OF HOPE NATIONAL MEDICAL CENTER" '
'GROUP BY Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name '
'ORDER BY Max_Dollar DESC', conn)
max_df.head()
Out[44]:
In [45]:
import plotly.graph_objs as go
data = [
go.Bar(
x = teaching_df['Teaching_Hospital'],
y = teaching_df['Total_Dollar'],
hoverinfo = "all",
marker=dict(
color= '#A830E6',
line=dict(
color='#240236',
width=1.5,
)
),
)
]
layout = dict(
title = "Ranking of Teaching Hospitals by Amount of Dollars Recieved"
"<br>From Drug and Medical Device Manufacturers, 2014",
xaxis = dict(
#title = 'Manufacturer',
),
yaxis = dict(
title = 'Total Amount in USD'
),
)
figure = dict(data=data, layout=layout)
iplot(figure) #plot offline
The rankings of the different teaching hospitals show that the City of Hope National Medical Center, a major cancer treatment center, obtained more than USD 250 million from the pharmaceutical and medical device industry, with the highest single payment made being a USD 11.6 million contribution by Genentech, Inc.