Motivation

The recent release of data on payments made by drug and medical device companies to physicians and teaching hospitals provides for a unique opportunity to develop insights on trends in the health care industry. For example, what classes of drugs are heavily promoted? in what states are companies investing the most on promoting their products? how does industry spending on health care providers equate to sales within a region? These are just some of the questions that I believe can be answered from these disclosures.

Data in this analysis

The analysis outlined here utilizes the General Payments data for payments made to physicians and teaching hospitals, provided through the Open Payments website hosted by CMS, and covering the period January 2014 to December 2014. In addition, the analysis also includes data on total retail sales for prescription drugs in 2014 provided by the Henry J. Kaiser family foundation, and tables on state population and physician numbers per state for 2014 published by the Federation of State Medical Board. The following is a summary of the tables.

General Payments Data:

  • 10.78 million total number of records
  • 2.52 billion USD in total value of disclosures
  • 607,000 physicians receiving payments
  • 1,442 companies that made payments
  • 1,122 teaching hospitals receiving payments
  • 5.3 G file size

Drug Sales Data:

  • 259 billion USD in total value of sales
  • 51 total number of records

Physican Data on Demographics:

  • 318 million total US population
  • 916,264 licensed physicians
  • 51 states covered The tables are contained in a MySQL database

Analysis


In [21]:
import MySQLdb
import pandas as pd
import csv
import numpy as np
import scipy as sp
from sqlalchemy import create_engine
from IPython.display import display
import datetime as dt
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline

import plotly.plotly as py
from plotly.graph_objs import Bar, Scatter, Marker, Layout
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

import colorlover as cl #colorscale
from IPython.display import HTML

In [22]:
init_notebook_mode() #use plotly offline



In [23]:
# Connect to database
conn = MySQLdb.connect(host="localhost", user="****", passwd="*****", 
                       unix_socket= "/var/run/mysqld/mysqld.sock", db="pharma")
cursor = conn.cursor()

In [24]:
'''Create a data frame from sql queries where we select for each state
the total dollar amount spent by industry in the state, the highest single
contribution, and the physician specialty associated with the highest
dollar contribution. Since we are only interested in payments made to providers
in mainland US (i.e., excluding territories, military bases), we make that requirement
explicit in the sql query'''
pay_df = pd.read_sql_query('SELECT DISTINCT Recipient_State, '
                           'ROUND(SUM(Total_Amount_of_Payment_USDollars)/1E6, 1) AS `Total_Dollar_Millions`, '
                           'MAX(Total_Amount_of_Payment_USDollars) AS `Max_Dollar`, '
                           'Physician_Specialty '
                           'FROM GeneralPay '
                           'WHERE Recipient_State IS NOT NULL '
                           'AND Recipient_Country = "United States" ' 
                           'AND Recipient_State IN ("AL", "AK", "AZ", "AR", "CA", "CO", '
                           '"CT", "DE", "DC", "FL", "GA", '
                           '"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", '
                           '"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", '
                           '"NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", '
                           '"OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", '
                           '"TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY") '
                           'GROUP BY Recipient_State '
                           'ORDER BY SUM(Total_Amount_of_Payment_USDollars)', conn)
pay_df = pay_df.sort_values(by= 'Recipient_State', ascending = True )
pay_df.head(4)


Out[24]:
Recipient_State Total_Dollar_Millions Max_Dollar Physician_Specialty
0 AK 0.7 159062 Dental Providers/ Dentist/ General Practice
25 AL 19.5 327285 Dental Providers/ Dentist/ Oral and Maxillofac...
15 AR 7.9 471456 Dental Providers/ Dentist/ Oral and Maxillofac...
42 AZ 78.8 28361143 Dental Providers/ Dentist/ Oral and Maxillofac...

Note: Here I am only showing 4 rows of the total 51 generated

Data visualization

We can take this data and create a chloropeth map where states that received the highest payments from industry will have a darker shade and vice-versa, therefore giving as visual representation of what states industry is spending most of their money in. I create the plot using plotly, a handy online graphing and visualization tool based on the D3.js JavaScript visualization library.


In [25]:
import plotly.graph_objs as go
go.Choropleth


color = cl.scales['6']['seq']['Oranges'] #select rgb color scale
colorbns = cl.interp( color, 20 ) #map color scale to 100 bins
colorbns = cl.to_rgb( colorbns ) #convert back to RGB

bin_array = np.linspace(0,1,len(colorbns)) # array of equal-sized bins 
bin_array.tolist()

scheme = [] #This will be a list of lists containing the bin and color scheme

for i in range(0, len(colorbns)):
    temp = []
    temp.append(bin_array[i])
    temp.append(colorbns[i])
    scheme.insert(i, temp)


data = [
        go.Choropleth(
            colorscale = scheme,
            autocolorscale = False,
            locations = pay_df['Recipient_State'],
            z = pay_df['Total_Dollar_Millions'], #value-to-color mapping
            locationmode = 'USA-states',
            text= pay_df['Recipient_State'],
            marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 2
                )
            ),
            colorbar = dict(          
                title = '<b>Industry spending in US$ (millions)</b>'
            ),            
             #color is picked from z value
            showscale = True, #Shows the colorbar (I will set attributes later)
        ),

]



layout = dict(           
              title="Health Care Industry Spending<br> on Physicians and Teaching Hospitals, 2014", 
              geo = dict(
                         scope = "usa",
                         projection = dict( type= "albers usa" ),#The Albers USA projection
                         showcoastlines = False,
                         showlakes = False,
              ),
)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


What we find is that pharmaceutical companies and medical device manufacturers spent over 0.5 billion USD promoting their products with health care providers in the state of California. This is more than twice the second highest state, New York. While one can be tempted to draw conclusions from this map on the state of the health care market in different states, the plot does not convey all the information. For one, California is the most populous state and you would therefore expect the total dollar amount spent in that market to be comparably bigger than the other states. Also, the plot does not tell us how many physicians we have per unit of population in each state, something that would have an influence on the total industry spending in a particular state. Finally, in this analysis I have not excluded large teaching hospitals whose prescence in some states can help skew the data.

A more descriptive plot would have the total spending in each state normalized by the physicians per 100,000 of the population, and excluding the teaching hospitals in the state. To do this, it is necessary to join tables for data obtained from payments made to health care providers to tables on data about the number of physicians in each state, as well as the total population of each state at the time this report was compiled

Normalization of pharmaceutical and medical device spending per state to the number of physicians in the state, and the total population of the state.


In [26]:
'''From the General Pay table, select for each state the total dollar amount spent
 as well as the maximum single payment made to a health care provider for all states
 in the intercontinental US. Here, we exclude any payments made to any other entities
 i.e., teaching hospitals and only select for payments made to physicians'''

payphys_df = pd.read_sql_query('SELECT DISTINCT Recipient_State AS State_Code, '
                               'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar, '
                               'MAX(Total_Amount_of_Payment_USDollars) AS Max_Dollar '                                              
                               'FROM GeneralPay '
                               'WHERE Recipient_State IS NOT NULL '
                               'AND Recipient_Country = "United States" '
                               'AND Covered_Recipient_Type = "Covered Recipient Physician" '
                               'AND Recipient_State IN ("AL", "AK", "AZ", "AR", "CA", "CO", '
                               '"CT", "DE", "DC", "FL", "GA", '
                               '"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", '
                               '"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", '
                               '"NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", '
                               '"OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", '
                               '"TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY") '
                               'GROUP BY Recipient_State '
                               'ORDER BY Recipient_State', conn)

#force the DF to index from 1 to match indexing in mysql (makes life easier when doing JOIN)
payphys_df.index = payphys_df.index + 1 
payphys_df.head()


Out[26]:
State_Code Total_Dollar Max_Dollar
1 AK 712426 159062
2 AL 17805617 327285
3 AR 6634120 135005
4 AZ 76578934 28361143
5 CA 295743014 41414329

This table shows for each state, the total dollar amount spent by industry as well as the highest dollar amount paid to a physician in that state in a single payment. To be able to use this data with population and physician data from other tables, I write the dataframe to a MySQL database


In [27]:
#write dataframe to MySQL database. Will then JOIN this table to several others
#payphys_df.to_sql(con=conn, name='PhysicianPay', if_exists='replace', flavor='mysql')

To extract further insight from the data, I JOIN the table on physician pay in each state, to a table on statistics on number of physicians in a state as well as the total population, and a table on drug sales in each state.


In [28]:
'''JOIN TABLES on Physician pay, the number of physicians in the state and the states
total population, and the drug sales, in total dollar amounts, in each state'''

payandsales_df = pd.read_sql_query('SELECT StateName.State, PhysicianPay.State_Code, '
                                   'ROUND(PhysicianPay.Total_Dollar/1E6, 1) '
                                   'AS Total_Dollar_Millions, PhysicianPay.Max_Dollar, '
                                   'ROUND(DrugSales.Total_Drug_Sales/1E9, 1) AS Drug_Sales_Billions, '
                                   'PhysicianCount.Total_Pop, PhysicianCount.Physician_per_100000, '
                                   'ROUND(PhysicianPay.Total_Dollar/Physician_per_100000, 1) '
                                   'AS Pay_per_PhysicianPer100000 '
                                   'FROM (StateName '
                                       'JOIN PhysicianPay '
                                           'ON StateName.State_Code = PhysicianPay.State_Code) '
                                                'JOIN DrugSales '
                                                   'ON (StateName.State = DrugSales.State) '
                                                       'JOIN PhysicianCount '
                                                           'ON (StateName.State = PhysicianCount.State)', conn)
                                            
payandsales_df.head(4)


Out[28]:
State State_Code Total_Dollar_Millions Max_Dollar Drug_Sales_Billions Total_Pop Physician_per_100000 Pay_per_PhysicianPer100000
0 Alabama AL 17.8 327285 5.2 4849377 331 53793.4
1 Alaska AK 0.7 159062 0.5 736732 514 1386.0
2 Arizona AZ 76.6 28361143 4.3 6731484 370 206970.1
3 Arkansas AR 6.6 135005 2.6 2966369 321 20667.0

Some quick statistics on the data

With the additional features from joining the three tables, we can run some quick statistics to get a feel for the data


In [29]:
print(payandsales_df.describe())


       Total_Dollar_Millions       Max_Dollar  Drug_Sales_Billions  \
count              51.000000        51.000000            51.000000   
mean               39.378431   3372566.431373             5.080392   
std                53.690019   7918220.577706             5.606105   
min                 0.700000     50527.000000             0.500000   
25%                 5.850000    201414.500000             1.300000   
50%                18.300000    472946.000000             3.400000   
75%                50.150000   1635467.500000             5.850000   
max               295.700000  41414329.000000            29.700000   

       Pay_per_PhysicianPer100000  
count                   51.000000  
mean                104865.892157  
std                 149286.528199  
min                   1386.000000  
25%                  13410.450000  
50%                  51353.100000  
75%                 121030.950000  
max                 799305.400000  

One thing I was curious about is how payments to physicians by drug companies and medical device manufacturers relate to drug sales. Would we see a correlation, an anticorrelation or no correlation at all? My null hypothesis is that there is no correlation between the two. To find out, I look at the Pearson correlation coefficient for the different columns.

Pearson Correlation Coefficients


In [30]:
print(payandsales_df.corr())


                            Total_Dollar_Millions  Max_Dollar  \
Total_Dollar_Millions                    1.000000    0.782909   
Max_Dollar                               0.782909    1.000000   
Drug_Sales_Billions                      0.967341    0.650062   
Pay_per_PhysicianPer100000               0.981374    0.748141   

                            Drug_Sales_Billions  Pay_per_PhysicianPer100000  
Total_Dollar_Millions                  0.967341                    0.981374  
Max_Dollar                             0.650062                    0.748141  
Drug_Sales_Billions                    1.000000                    0.952157  
Pay_per_PhysicianPer100000             0.952157                    1.000000  

The total amount of dollars spent by industry (Total_Dollar_Millions) is strong correlated to the drug sales (Drug_Sales_Billions) with a correlation coefficient of 0.967. To accept or reject the null hypothesis we can run a t-test.

Running a t-test to determine the statistical significance


In [31]:
import scipy as sp
a = payandsales_df['Total_Dollar_Millions']
b = payandsales_df['Drug_Sales_Billions']
sp.stats.ttest_ind(a, b, axis=0, equal_var=True)


Out[31]:
Ttest_indResult(statistic=4.5373903110244127, pvalue=1.5878870617766979e-05)

Giving as a p-value of 1.6e-05 which allows us to reject the null hypothesis. We can therefore see that the payment to physicians and the sales from retail precription drugs are very highly correlated, and that this correlation is statistically significant.

Re-plotting the total payments to health care providers, with the exclusion of teaching hospitals, and having normalized the payments to the number of physicians in each state per 100,000 of the population.


In [32]:
import plotly.graph_objs as go
go.Choropleth


color = cl.scales['6']['seq']['Reds']
colorbns = cl.interp( color, 20 )
colorbns = cl.to_rgb( colorbns )

bin_array = np.linspace(0,1,len(colorbns))
bin_array.tolist()

scheme = []

for i in range(0, len(colorbns)):
    temp = []
    temp.append(bin_array[i])
    temp.append(colorbns[i])
    scheme.insert(i, temp)


data = [
        go.Choropleth(
            colorscale = scheme,
            autocolorscale = False,
            locations = payandsales_df['State_Code'],
            z = payandsales_df['Pay_per_PhysicianPer100000'],
            locationmode = 'USA-states',
            text= payandsales_df['State_Code'],
            marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 2
                )
            ),
            colorbar = dict(          
                title = '<b>Industry spending in US$</b>'
            ),            
             #color is picked from z value
            showscale = True,
        ),

]



layout = dict(           
              title="Health Care Industry Spending on Physicians<br> "
                    "Normalized to Number of Physicians per 100000, 2014", 
              geo = dict(
                         scope = "usa",
                         projection = dict( type= "albers usa" ),
                         showcoastlines = False,
                         showlakes = False,
              ),
)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline



In [33]:
payperphy_df = payandsales_df.sort_values('Pay_per_PhysicianPer100000', axis=0, ascending=True, inplace=False)
import plotly.graph_objs as go
data = [
    go.Bar(
        x = payperphy_df['State'],
        y = payperphy_df['Pay_per_PhysicianPer100000'],
   
        hoverinfo = "all",
        marker=dict(
            color= '#04BCE0',
            line=dict(
                color='#013A45',
                width=1.5,
            )
        ),
    )
]

layout = dict(
    title = "Drug Companies Spending in US$ <br> Scaled to Number of Physicians Per 100000, 2014", 
    xaxis = dict(
        #title = 'Manufacturer',
        tickangle = 40
    ),
    yaxis = dict(
        title = 'Total Amount in USD per Capita'
    ),

)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


With the data normalized to the number of physicians per 100,000, we find that California still saw the highest spending by industry in 2014, with the state of Texas coming in at second place.

Exploring the Data Further

What were the top marketed drugs in 2014?


In [34]:
'''Here, we select for each drug promoted to care providers
and rank them by the total dollar amount spent by their respective
manufacturers in promoting them to physicians and teaching hospitals.'''
pd.set_option('display.max_rows', 2000)
df = pd.read_sql_query('SELECT DISTINCT UCASE(Name_of_Associated_Covered_Drug_or_Biological1) AS Drug, '
                       'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar, '
                       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name AS Manufacturer '
                       'FROM GeneralPay '
                       'WHERE Name_of_Associated_Covered_Drug_or_Biological1 IS NOT NULL '
                       'AND Name_of_Associated_Covered_Drug_or_Biological2 IS NULL '
                       'AND Name_of_Associated_Covered_Device_or_Medical_Supply1 IS NULL '
                       'AND Name_of_Associated_Covered_Drug_or_Biological1 NOT IN '
                       '("No Product Discussed", "Non-Product", "AccurusVitEquip", "NONCOVERED PRODUCT", '
                       '"NON BRAND", "None", "Product Candidate", "Spine Instrumentation", '
                       '"Business Development Activity", "OTHER PRODUCT", '
                       '"Specimen Indentification", "NO PRODUCT", "NOT APPLICABLE", "ENT Medical Devices" '
                       '"NA", "BioSurgery - Non Prod Related", "Non-Covered-Product", ' 
                       '"Non Franchise - Non Prod Related" '
                       '"RefractiveEquipment", "No Associated Product", "Product Development", '
                       '"Non Franchise - Research & Develop") '
                       'AND Name_of_Associated_Covered_Drug_or_Biological1 NOT LIKE "%TAP%" '
                       'GROUP BY Name_of_Associated_Covered_Drug_or_Biological1 '
                       'ORDER BY Total_Dollar DESC LIMIT 20', conn)

df = df.sort_values('Total_Dollar', axis=0, ascending=True, inplace=False,kind='quicksort',na_position='last')
df.head(4)


Out[34]:
Drug Total_Dollar Manufacturer
19 BOTOX THERAPEUTIC 8673624 Allergan Inc.
18 COPAXONE 8734827 Teva Pharmaceuticals USA, Inc.
17 BELVIQ 9064101 Eisai Inc.
16 PERJETA 9262371 Genentech, Inc.

Ranking the top 20 marketed drugs in 2014 by dollar value spent on health care providers


In [35]:
import plotly.graph_objs as go
data = [
    go.Bar(
        x = df['Drug'],
        y = df['Total_Dollar'],
   
        hoverinfo = "all",
        marker=dict(
            color= '#F05513',
            line=dict(
                color='#662002',
                width=1.5,
            )
        ),
    )
]

layout = dict(
    title = "Top 20 Most Promoted Drugs By Dollar Value, 2014", 
    xaxis = dict(
        #title = 'Drug Name',
    ),
    yaxis = dict(
        title = 'Total Amount in USD'
    ),

)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


Of these drugs, five of them -- Humira, Rituxan, Avastin, Herceptin, and Copaxone, were also among the top 20 best-selling drugs of 2014 with combined sales of over 39 billion dollars. Interestingly, the first three highly promoted drugs, Rituxan, Avastin and Herceptin, are cancer drugs marketed by Roche pharmaceutical division (Rituxan is co-marketed by Biogen Idec too) with total combined sales in 2014 of 22.4 billion USD.


What about the drug and medical device manufacturers? How do the different companies rank in terms of total dollar amount spent on physicians and teaching hospitals while promoting their products?


In [36]:
'''Selection of drug and device manufacturers ranked by 
total dollar amount spent on physicians and teaching hospitals'''
pd.set_option('display.max_rows', 2000)
df = pd.read_sql_query('SELECT DISTINCT ' 
                       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name AS Manufacturer, '
                       'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar '
                       'FROM GeneralPay '
                       'WHERE Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name IS NOT NULL '
                       'GROUP BY Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name '
                       'ORDER BY Total_Dollar DESC LIMIT 20', conn)

df = df.sort_values('Total_Dollar', axis=0, ascending=True, inplace=False,kind='quicksort',na_position='last')
df.head(4)


Out[36]:
Manufacturer Total_Dollar
19 Intuitive Surgical, Inc. 28024245
18 Gilead Sciences Inc 31393905
17 Medtronic USA, Inc. 34539664
16 GlaxoSmithKline, LLC. 36038864

In [37]:
import plotly.graph_objs as go
data = [
    go.Bar(
        x = df['Manufacturer'],
        y = df['Total_Dollar'],
   
        hoverinfo = "all",
        marker=dict(
            color= '#04BCE0',
            line=dict(
                color='#013A45',
                width=1.5,
            )
        ),
    )
]

layout = dict(
    title = "Top 20 Drug and Medical Device Manufacturers<br> "
            "By Dollars Spent on Physicians and Teaching Hospitals, 2014", 
    xaxis = dict(
        #title = 'Manufacturer',
    ),
    yaxis = dict(
        title = 'Total Amount in USD'
    ),

)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


Unsurprisingly, Genentech, Inc. which is now a subsidiary of Roche spent almost three times as much on health care providers as its closest rival, Topera, Inc. Topera, Inc. is a recent startup that has developed technology to map the electric signals of the heart.


How about retail sales of prescription drugs in each state?


In [38]:
pd.set_option('display.max_rows', 2000)
sales_df = pd.read_sql_query('SELECT DISTINCT State, ROUND(Total_Drug_Sales/1e9, 2) AS Total_Sales_Billion '
                             'FROM DrugSales '
                             'ORDER BY Total_Sales_Billion ASC',conn) 
sales_df.head()


Out[38]:
State Total_Sales_Billion
0 Wyoming 0.45
1 Alaska 0.47
2 District of Columbia 0.61
3 Montana 0.68
4 North Dakota 0.69

In [39]:
import plotly.graph_objs as go
data = [
    go.Bar(
        x = sales_df['State'],
        y = sales_df['Total_Sales_Billion'],
   
        hoverinfo = "all",
        marker=dict(
            color= '#04BCE0',
            line=dict(
                color='#013A45',
                width=1.5,
            )
        ),
    )
]

layout = dict(
    title = "Retail Sales for Prescription Drugs<br>Filled at Pharmacies in US$ (Billions), 2014", 
    xaxis = dict(
        #title = 'Manufacturer',
    tickangle = 40
    ),
    yaxis = dict(
        title = 'Total Amount in USD (Billions)'
    ),

)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


Here again, California, the most populous state, saw the highest sales in total dollar amount of prescription drugs. A better way to look at these numbers though would be to scale them by the population in each state so that we can get an idea of the spending in prescription drugs per capita.

Spending in retail prescription drugs per capita in each state


In [40]:
salespercap_df = pd.read_sql_query('SELECT DrugSales.State, '
                                   'ROUND(DrugSales.Total_Drug_Sales/PhysicianCount.Total_Pop, 1) '
                                   'AS SalesPerCapita '
                                   'FROM PhysicianCount '
                                   'JOIN DrugSales '
                                   'ON PhysicianCount.State = DrugSales.State '
                                   'ORDER BY SalesPerCapita', conn)
salespercap_df.head()


Out[40]:
State SalesPerCapita
0 New Mexico 469.0
1 Utah 603.1
2 Colorado 613.7
3 Arizona 635.6
4 Alaska 636.7

In [41]:
import plotly.graph_objs as go
data = [
    go.Bar(
        x = salespercap_df['State'],
        y = salespercap_df['SalesPerCapita'],
   
        hoverinfo = "all",
        marker=dict(
            color= '#037b93',
            line=dict(
                color='#013b46',
                width=1.5,
            )
        ),
    )
]

layout = dict(
    title = "Retail Sales for Prescription Drugs<br>Filled at Pharmacies Per Capita in US$, 2014", 
    xaxis = dict(
        #title = 'Manufacturer',
    tickangle = 40
    ),
    yaxis = dict(
        title = 'Total Amount in USD per Capita'
    ),

)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


Here we now find that the sales of precription drugs per capita were highest in West Virginia, meaning that in West Virginia, on average, each resident spent about 1,287 USD on prescription medicine in 2014. That is 11% more than the next highest state, Maine, and 64% more than the state of New Mexico. This numbers could be indicative of the cost of prescription drugs in each state. The disparity in spending per state is interesting as is shown by the following statistics on the data where we see a standard deviation of USD 163.


In [42]:
#Statistics on the disitribution of drug sales in the different states
print(salespercap_df.describe())


       SalesPerCapita
count       51.000000
mean       831.947059
std        163.151829
min        469.000000
25%        725.700000
50%        813.100000
75%        915.300000
max       1287.300000

Teaching Hospitals

A ranking of the spending by industry on teaching hospitals


In [43]:
pd.set_option('display.max_rows', 2000)
teaching_df = pd.read_sql_query('SELECT DISTINCT UCASE(Teaching_Hospital_Name) AS Teaching_Hospital, '
                                'SUM(Total_Amount_of_Payment_USDollars) AS Total_Dollar, '
                                'MAX(Total_Amount_of_Payment_USDollars) AS Max_Dollar ' 
                                'FROM GeneralPay '
                                'WHERE Teaching_Hospital_Name IS NOT NULL '
                                'AND Covered_Recipient_Type = "Covered Recipient Teaching Hospital" '
                                'GROUP BY Teaching_Hospital_Name '
                                'ORDER BY Total_Dollar DESC LIMIT 20', conn)

teachin_df = teaching_df.sort_values('Total_Dollar', axis=0, ascending=False,
                                     inplace=False,kind='quicksort',na_position='last')
teaching_df.head()


Out[43]:
Teaching_Hospital Total_Dollar Max_Dollar
0 CITY OF HOPE NATIONAL MEDICAL CENTER 251199365 11646800
1 MASSACHUSETTS GENERAL HOSPITAL 33799060 5029420
2 THE UNITY HOSPITAL OF ROCHESTER 20958502 12196417
3 DENVER HEALTH MEDICAL CENTER 12929518 250000
4 CLEVELAND CLINIC HOSPITAL 12876299 1218462

In [44]:
pd.set_option('display.max_rows', 2000)
max_df = pd.read_sql_query('SELECT DISTINCT Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name '
                           'AS Manufacturer, '
                           'Teaching_Hospital_Name AS Teaching_Hospital, '
                           'MAX(Total_Amount_of_Payment_USDollars) AS Max_Dollar '
                           'FROM GeneralPay '
                           'WHERE Teaching_Hospital_Name = "CITY OF HOPE NATIONAL MEDICAL CENTER" '
                           'GROUP BY Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name '
                           'ORDER BY Max_Dollar DESC', conn)
max_df.head()


Out[44]:
Manufacturer Teaching_Hospital Max_Dollar
0 Genentech, Inc. CITY OF HOPE NATIONAL MEDICAL CENTER 11646800
1 Abbott Laboratories CITY OF HOPE NATIONAL MEDICAL CENTER 34809
2 NeuWave Medical, Inc. CITY OF HOPE NATIONAL MEDICAL CENTER 30720
3 MRI Interventions, Inc. City of Hope National Medical Center 25803
4 ACCURAY INCORPORATED CITY OF HOPE NATIONAL MEDICAL CENTER 22286

In [45]:
import plotly.graph_objs as go
data = [
    go.Bar(
        x = teaching_df['Teaching_Hospital'],
        y = teaching_df['Total_Dollar'],
   
        hoverinfo = "all",
        marker=dict(
            color= '#A830E6',
            line=dict(
                color='#240236',
                width=1.5,
            )
        ),
    )
]

layout = dict(
    title = "Ranking of Teaching Hospitals by Amount of Dollars Recieved"
             "<br>From Drug and Medical Device Manufacturers, 2014", 
    xaxis = dict(
        #title = 'Manufacturer',
    ),
    yaxis = dict(
        title = 'Total Amount in USD'
    ),

)
    

figure = dict(data=data, layout=layout)
iplot(figure) #plot offline


The rankings of the different teaching hospitals show that the City of Hope National Medical Center, a major cancer treatment center, obtained more than USD 250 million from the pharmaceutical and medical device industry, with the highest single payment made being a USD 11.6 million contribution by Genentech, Inc.