Exploratory work on the ABPI Disclosure data

This notebook sets out our initial exploratory analysis of the new ABPI Disclosure data on payments from drug companies to individual UK doctors and institutions. An iPython notebook allows full transparency on how data was analysed, with results presented alongside the analytic code. This notebook was written by Anna Powell-Smith of www.ebmdatalab.net in the University of Oxford, with input from the investigations team at the BMJ and Ben Goldacre at the DataLab. You are welcome to use any of the findings below, but please note that the code has not been peer-reviewed and may contain errors, use these results at your own peril!

The ABPI data is released in two Excel sheets: a "payments" sheet containing every payment to an individual Healthcare Organistion (HCO) or individual Healthcare Provider (HCP) that has not been redacted, and an "aggregates" sheet containing total redacted payments by pharmaceutical company and category. For ease, we include CSV copies of these in the accompanying data directory. Please refer to the ABPI for licence details.

Journalists were told at the press conference that around 50% of the payments had been redacted, or about 70% of the value of all payments. Also, details of research funding are not included in the Disclosure database.

Useful links:

Raw data, guidance notes about how the sums were calcluated, and methodological notes by company.
The ABPI's own search interface - only returns 100 rows at a time.

The ABPI licence (included with the raw data) seems to permit analysis. However it seems that it may explicitly forbid redistribution of the data. The explanatory notes for the data, provided by the ABPI, are unclear; and neither the ABPI nor their data team have responded to questions over the 24 hours since the data was released.



In [1]:

    
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.2f' % x)

dtype = {
    'Title': str,
    'First Name': str,
    'Last Name': str,
    'Speciality': str,
    'Institution Name': str
}
df = pd.read_csv('./data/payments.csv', dtype=dtype)

Basic statistics

Calculate number of rows, unique institutions, etc in the payments data.



In [2]:

    
print "{:,} rows".format(len(df))
print "£{:,} total declared".format(df['Amount'].sum())
print len(df['Company Name'].unique()), 'companies'
print len(df['Organisation Name'].unique()), 'organisations'
print len(df['Speciality'].unique()), 'specialities'
print len(df['Institution Name'].unique()), 'institutions'
# df.describe(include='all')









    



50,519 rows
£81,012,756.47 total declared
107 companies
4696 organisations
79 specialities
3468 institutions

Quickly calculate the breakdown between payments to individuals and payments to organisations.



In [3]:

    
df_hcps = df[pd.isnull(df['Organisation Name'])]
df_hcos = df[~pd.isnull(df['Organisation Name'])]
a = "{:,} payments to HCOs ({:,.2f}% of total payments)"
print a.format(len(df_hcos), 100 * len(df_hcos) / float(len(df)))
a = "{:,} payments to HCPs ({:,.2f}% of total payments)"
print a.format(len(df_hcps), 100 * len(df_hcps) / float(len(df)))
print 

a = "£{:,} paid to HCOs ({:,.2f}% of total payments)"
print a.format(df_hcos['Amount'].sum(), 100 * df_hcos['Amount'].sum() / df['Amount'].sum())
a = "£{:,} paid to HCPs ({:,.2f}% of total payments)"
print a.format(df_hcps['Amount'].sum(), 100 * df_hcps['Amount'].sum() / df['Amount'].sum())









    



20,040 payments to HCOs (39.67% of total payments)
30,479 payments to HCPs (60.33% of total payments)

£56,450,186.61 paid to HCOs (69.68% of total payments)
£24,562,569.86 paid to HCPs (30.32% of total payments)



In [4]:

    
# This is what the raw payments data looks like!
df.head()









    Out[4]:






  
    
      
      Company Name
      Title
      First Name
      Last Name
      Organisation Name
      Location
      Speciality
      City
      Country of Principal Practice
      Institution Name
      Address Line1
      Address Line2
      Post Code
      County
      TOV Category
      TOV Type
      Amount
      Joint Working Link
    
  
  
    
      0
      Teva UK Limited
      Mrs
      Francesca
      Aaen
      NaN
      NHS Lanarkshire - Pharmacy Department
      Pharmacist
      Airdrie
      United Kingdom
      Monklands General Hospital
      Monkscourt Avenue
      NaN
      ML6 0JS
      Lanarkshire
      Fee for service and consultancy
      Fees
      650.00
      NaN
    
    
      1
      Teva UK Limited
      Mrs
      Francesca
      Aaen
      NaN
      NHS Lanarkshire - Pharmacy Department
      Pharmacist
      Airdrie
      United Kingdom
      Monklands General Hospital
      Monkscourt Avenue
      NaN
      ML6 0JS
      Lanarkshire
      Fee for service and consultancy
      Related expenses agreed in the fee for service...
      5.30
      NaN
    
    
      2
      Eli Lilly & Company Ltd
      Mrs
      Susan
      Aarons
      NaN
      Anchor Trust
      Nurse
      Bradford
      United Kingdom
      Ashcroft
      Kelvin Way
      NaN
      BD2 3EF
      West Yorkshire
      Contribution to costs of Events
      Travel & Accommodation
      15.00
      NaN
    
    
      3
      Meda Pharma
      Doctor
      Dawn
      Aaronson
      NaN
      Dr R E Adams & Partners
      General Medicine
      Redhill
      United Kingdom
      Woodlands Surgery
      5 Woodlands Road
      NaN
      RH1 6EY
      Surrey
      Contribution to costs of Events
      Travel & Accommodation
      60.00
      NaN
    
    
      4
      Teva UK Limited
      _
      Christina
      Abarno
      NaN
      NaN
      Miscellaneous
      Oxford
      United Kingdom
      Nuffield Department of Clinical Neurosciences
      Level 6, West Wing, John Radcliffe Hospital
      NaN
      OX3 9DU
      Undefined
      Contribution to costs of Events
      Travel & Accommodation
      399.86
      NaN

Most payments are small! 50% are below £355, 75% are below £800. Only 1,164 payments are more than £10,000.



In [5]:

    
print df['Amount'].describe()
print len(df[df['Amount'] > 10000]), 'payments overall are more than £10,000'

print '\nPayments to organisations:'
print df_hcos['Amount'].describe()
print '\nPayments to individuals:'
print df_hcps['Amount'].describe()









    



count     50519.00
mean       1603.61
std       11807.18
min      -62399.71
25%         180.00
50%         355.32
75%         800.00
max     1125000.00
Name: Amount, dtype: float64
1164 payments overall are more than £10,000

Payments to organisations:
count     20040.00
mean       2816.88
std       18544.75
min      -62399.71
25%         200.00
50%         300.00
75%         858.95
max     1125000.00
Name: Amount, dtype: float64

Payments to individuals:
count   30479.00
mean      805.89
std      1831.43
min     -1500.00
25%       170.00
50%       394.11
75%       800.00
max     87954.56
Name: Amount, dtype: float64

What are the biggest single payments to individuals?

Print the top individual payments... they're all 'Fees for service and consultancy'.



In [6]:

    
df_hcps.sort_values('Amount', ascending=False).head()









    Out[6]:






  
    
      
      Company Name
      Title
      First Name
      Last Name
      Organisation Name
      Location
      Speciality
      City
      Country of Principal Practice
      Institution Name
      Address Line1
      Address Line2
      Post Code
      County
      TOV Category
      TOV Type
      Amount
      Joint Working Link
    
  
  
    
      151
      Allergan Ltd
      Doctor
      Rajeev
      Acquilla
      NaN
      Dr Raj Acquilla
      General Medicine
      Northwich
      United Kingdom
      The Gatehouse
      1 St Marys Drive
      NaN
      CW8 2EZ
      Cheshire
      Fee for service and consultancy
      Fees
      87954.56
      NaN
    
    
      16046
      Teva UK Limited
      Professor
      Brian
      Lipworth
      NaN
      NHS Tayside - Respiratory Medicine Dept
      Pneumology
      Dundee
      United Kingdom
      Ninewells Hosp & Med School
      Dundee
      NaN
      DD1 9SY
      Angus
      Fee for service and consultancy
      Fees
      83112.00
      NaN
    
    
      23057
      GlaxoSmithKline plc
      Doctor
      Peter
      Ratcliffe
      NaN
      Oxford Univ Hospitals NHS FT - Renal Medicine ...
      Nephrology
      Oxford
      United Kingdom
      Churchill Hospital
      Old Road
      NaN
      OX3 7LE
      Oxfordshire
      Fee for service and consultancy
      Fees
      54700.86
      NaN
    
    
      11957
      Alexion Pharma UK Ltd
      Doctor
      Anita
      Hill
      NaN
      Bradford Teaching Hosp FT - Haematology Depart...
      Haematology
      Bradford
      United Kingdom
      Bradford Royal Infirmary
      Duckworth Lane
      NaN
      BD9 6RJ
      West Yorkshire
      Fee for service and consultancy
      Fees
      46100.00
      NaN
    
    
      10616
      Lundbeck Ltd
      Doctor
      Carsten
      Grimm
      NaN
      On Trak
      Substance Abuse
      Huddersfield
      United Kingdom
      On Trak Huddersfield
      2 St. Peters Street
      NaN
      HD1 1RA
      West Yorkshire
      Fee for service and consultancy
      Fees
      45971.98
      NaN

Why are there a few negative payments?



In [7]:

    
df.sort_values('Amount').head()[['Company Name', 'Organisation Name', 'TOV Type', 'Amount']]









    Out[7]:






  
    
      
      Company Name
      Organisation Name
      TOV Type
      Amount
    
  
  
    
      37882
      Pfizer Ltd
      Institute of Cancer Policy
      Donations and Grants to HCOs
      -62399.71
    
    
      32882
      Merck Sharp & Dohme Ltd
      British Society for Rheumatology
      Sponsorship agreements with HCOs /3rd parties
      -30000.00
    
    
      37409
      Pfizer Ltd
      Healthcare At Home
      Fees
      -3351.55
    
    
      44593
      Pfizer Ltd
      Ryl College Emergency Medicine
      Sponsorship agreements with HCOs /3rd parties
      -3000.00
    
    
      48524
      Pfizer Ltd
      University College London
      Sponsorship agreements with HCOs /3rd parties
      -3000.00

Payment distribution by each TOV category/type, in the payments data.



In [8]:

    
df.groupby(('TOV Category', 'TOV Type')).sum().sort_values('Amount', ascending=False)









    Out[8]:






  
    
      
      
      Amount
    
    
      TOV Category
      TOV Type
      
    
  
  
    
      Donations and Grants to HCOs
      Donations and Grants to HCOs
      28764932.33
    
    
      Fee for service and consultancy
      Fees
      22371761.15
    
    
      Contribution to costs of Events
      Sponsorship agreements with HCOs /3rd parties
      15999826.91
    
    
      Travel & Accommodation
      5480353.79
    
    
      Joint working
      Joint working
      3234248.94
    
    
      Fee for service and consultancy
      Related expenses agreed in the fee for services or consultancy contract
      2657719.07
    
    
      Contribution to costs of Events
      Registration Fees
      2503572.50
    
    
      Fee for service and consultancy
      Travel & Accommodation
      341.78

Start to compare the declared and undeclared data

Now we can start comparing the "payments" (declared) data with the "aggregate" (undeclared) data. We know that the aggregate data sheet only contains information about suppressed payments.

Start by looking at the raw data on undeclared payments.



In [9]:

    
df_aggregate = pd.read_csv('./data/aggregates.csv')
df_aggregate.head()









    Out[9]:






  
    
      
      Company Name
      Title
      First Name
      Last Name
      Organisation Name
      Location
      Speciality
      City
      Country of Principal Practice
      Institution Name
      Address Line1
      Address Line2
      Post Code
      County
      TOV Category
      TOV Type
      Amount
      Joint Working Link
      % of all HCP HCO in Aggregate
      No of HCP HCO in Aggregate
    
  
  
    
      0
      A. Menarini Farmaceutica Internazionale S.r.l.
      _
      A. Menarini Farmaceutica Inter
      A. Menarini Farmaceutica Inter
      NaN
      nan
      Miscellaneous
      A. Menarini Farmaceutica Internazionale S.r.l.
      United Kingdom
      nan
      A. Menarini Farmaceutica Internazionale S.r.l.
      nan
      XX
      Undefined
      Contribution to costs of Events
      Registration Fees
      0.00
      nan
      2.79
      nan
    
    
      1
      A. Menarini Farmaceutica Internazionale S.r.l.
      _
      A. Menarini Farmaceutica Inter
      A. Menarini Farmaceutica Inter
      NaN
      nan
      Miscellaneous
      A. Menarini Farmaceutica Internazionale S.r.l.
      United Kingdom
      nan
      A. Menarini Farmaceutica Internazionale S.r.l.
      nan
      XX
      Undefined
      Contribution to costs of Events
      Registration Fees
      0.00
      nan
      nan
      7.00
    
    
      2
      A. Menarini Farmaceutica Internazionale S.r.l.
      _
      A. Menarini Farmaceutica Inter
      A. Menarini Farmaceutica Inter
      NaN
      nan
      Miscellaneous
      A. Menarini Farmaceutica Internazionale S.r.l.
      United Kingdom
      nan
      A. Menarini Farmaceutica Internazionale S.r.l.
      nan
      XX
      Undefined
      Contribution to costs of Events
      Registration Fees
      3406.45
      nan
      nan
      nan
    
    
      3
      A. Menarini Farmaceutica Internazionale S.r.l.
      _
      A. Menarini Farmaceutica Inter
      A. Menarini Farmaceutica Inter
      NaN
      nan
      Miscellaneous
      A. Menarini Farmaceutica Internazionale S.r.l.
      United Kingdom
      nan
      A. Menarini Farmaceutica Internazionale S.r.l.
      nan
      XX
      Undefined
      Contribution to costs of Events
      Travel & Accommodation
      0.00
      nan
      17.53
      nan
    
    
      4
      A. Menarini Farmaceutica Internazionale S.r.l.
      _
      A. Menarini Farmaceutica Inter
      A. Menarini Farmaceutica Inter
      NaN
      nan
      Miscellaneous
      A. Menarini Farmaceutica Internazionale S.r.l.
      United Kingdom
      nan
      A. Menarini Farmaceutica Internazionale S.r.l.
      nan
      XX
      Undefined
      Contribution to costs of Events
      Travel & Accommodation
      0.00
      nan
      nan
      44.00

How much is missing?

By my calculations, the declared payments cover £81m, and the undeclared payments cover £282m, so 78% of all spending is undeclared. Note that "R&D" makes up a big chunk of this and we know from the press conference that no R&D payments are declared.



In [48]:

    
undeclared_payments = df_aggregate['Amount'].sum()
declared_payments = df['Amount'].sum()
total_payments = undeclared_payments + declared_payments

print 'Total payments undeclared:', "£{:,}".format(undeclared_payments)
print 'Total payments declared:', "£{:,}".format(declared_payments)
print 'Percentage undeclared: {:.2f}%'.format(undeclared_payments / total_payments * 100)

df_agg_ex_rd = df_aggregate[df_aggregate['TOV Category'] != 'Research & Development']
undeclared_payments_ex_rd = df_agg_ex_rd['Amount'].sum()
print 'Total payments on R&D (none of which are declared):', "£{:,}".format(undeclared_payments - undeclared_payments_ex_rd)
print 'Total payments undeclared ex R&D:', "£{:,}".format(undeclared_payments_ex_rd)









    



Total payments undeclared: £282,797,449.25
Total payments declared: £81,012,756.47
Percentage undeclared: 77.73%
Total payments on R&D (none of which are declared): £253,613,146.83
Total payments undeclared ex R&D: £29,184,302.42

Work out how many HCP payments are missing, by company

We know that rows with a blank 'Organisation Name' column in the aggregate data relate to HCPs. And we know that rows with a blank 'Organisation Name' column in the declared data relate to HCPs.

Both sets of data are grouped by company, HCP/HCO, and payment category.

We can compare the numbers by company, and overall, to get a sense of how many HCP payments are missing overall.

However, we can't work this out per individual HCP/HCO, because it's possible that the same HCP/HCO is listed twice in different payment categories in the aggregate data.



In [42]:

    
# Get aggregate figures for HCPs only. 
df_agg_hcps = df_agg_ex_rd[pd.isnull(df['Organisation Name'])].groupby(('Company Name')).sum().reset_index()
df_agg_hcps = df_agg_hcps[['Company Name', 'Amount', 'No of HCP HCO in Aggregate']]
df_agg_hcps.rename(columns={'No of HCP HCO in Aggregate': 'No of HCP'}, inplace=True)
# print df_agg_hcps.head()

# Count total undeclared payments by company/category/HCP
declared_hcps = df_agg_hcps['No of HCP'].sum()
print 'No of HCP payments by company/category, undeclared:', declared_hcps #
undeclared_hcps = len(df_hcps)
print 'No of HCP payments by company/category, declared:', undeclared_hcps
v = undeclared_hcps / (undeclared_hcps + declared_hcps) 
print 'Overall % of payments to HCPs undeclared: {:.2f}%'.format(v*100)









    



No of HCP payments by company/category, undeclared: 25642.0
No of HCP payments by company/category, declared: 30479
Overall % of payments to HCPs undeclared: 54.31%






    



/Users/anna/.virtualenvs/hscic/lib/python2.7/site-packages/ipykernel/__main__.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  from ipykernel import kernelapp as app



In [48]:

    
# Group the declared HCP payments by company, then merge with the aggregate data.
df_hcps_by_co = df_hcps.groupby('Company Name')['Amount'].agg(['sum','count']).reset_index()
df_hcps_by_co.rename(columns={'sum': 'Amount', 'count': 'No of HCP'}, inplace=True)
df_m = pd.merge(df_agg_hcps, df_hcps_by_co, on='Company Name', suffixes=(' Undeclared', ' Declared'))
df_m['Total HCPs'] = df_m['No of HCP Undeclared'] + df_m['No of HCP Declared']
df_m['% HCP Payments Undeclared'] = df_m['No of HCP Undeclared'] / df_m['Total HCPs'] * 100
df_m.sort_values('% HCP Payments Undeclared', ascending=False)









    Out[48]:






  
    
      
      Company Name
      Amount Undeclared
      No of HCP Undeclared
      Amount Declared
      No of HCP Declared
      Total HCPs
      % HCP Payments Undeclared
    
  
  
    
      54
      Merck Sharp & Dohme Ltd
      3041484.00
      2524.00
      33650.14
      41
      2565.00
      98.40
    
    
      37
      Eisai Ltd
      213758.78
      305.00
      3312.00
      6
      311.00
      98.07
    
    
      83
      Sunovion Pharmaceuticals Europe Ltd
      177471.41
      341.00
      22207.13
      18
      359.00
      94.99
    
    
      58
      Napp Pharmaceuticals Ltd
      1074987.46
      1341.00
      129139.81
      142
      1483.00
      90.42
    
    
      55
      Merz Pharma UK Ltd
      138675.25
      287.00
      50786.99
      34
      321.00
      89.41
    
    
      47
      Guerbet Laboratories Ltd
      21458.24
      35.00
      4402.75
      5
      40.00
      87.50
    
    
      7
      Allergan Ltd
      1040011.45
      1271.00
      279476.20
      189
      1460.00
      87.05
    
    
      79
      Shionogi Limited
      27957.71
      40.00
      11261.63
      6
      46.00
      86.96
    
    
      67
      PTC Therapeutics Limited
      10802.60
      17.00
      1594.60
      3
      20.00
      85.00
    
    
      12
      ApoPharma Inc
      87730.81
      10.00
      2000.00
      2
      12.00
      83.33
    
    
      25
      Boehringer Ingelheim Ltd
      826901.12
      815.00
      223415.62
      167
      982.00
      82.99
    
    
      82
      Stirling Anglian Pharmaceuticals Ltd
      253867.99
      116.00
      44074.87
      25
      141.00
      82.27
    
    
      30
      CSL Behring
      109566.73
      169.00
      17716.13
      39
      208.00
      81.25
    
    
      15
      BGP Products Ltd
      47398.73
      73.00
      6157.77
      18
      91.00
      80.22
    
    
      26
      Bracco UK Ltd
      5845.95
      20.00
      3352.03
      6
      26.00
      76.92
    
    
      52
      Meda Pharma
      87959.14
      109.00
      12637.60
      35
      144.00
      75.69
    
    
      27
      Bristol-Myers Squibb Pharmaceuticals Ltd
      705584.00
      741.00
      272650.00
      248
      989.00
      74.92
    
    
      33
      Chugai Pharma UK
      461297.13
      194.00
      72442.07
      66
      260.00
      74.62
    
    
      19
      Bayer Plc
      3165949.91
      2080.00
      785976.49
      738
      2818.00
      73.81
    
    
      28
      Britannia Pharmaceuticals
      298845.58
      397.00
      97067.70
      145
      542.00
      73.25
    
    
      48
      Ipsen Developments Ltd
      243458.53
      298.00
      57865.37
      110
      408.00
      73.04
    
    
      21
      Bio Products Laboratory Ltd
      12003.58
      24.00
      1632.94
      10
      34.00
      70.59
    
    
      61
      Novex Pharma
      29000.00
      7.00
      15000.00
      3
      10.00
      70.00
    
    
      63
      Octapharma Ltd
      129615.38
      55.00
      6377.03
      25
      80.00
      68.75
    
    
      38
      Eli Lilly & Company Ltd
      1158566.00
      1645.00
      588484.00
      803
      2448.00
      67.20
    
    
      88
      Vifor Fresenius Medical Care Renal Pharma
      16460.33
      8.00
      1085.60
      4
      12.00
      66.67
    
    
      62
      Novo Nordisk Limited
      1608824.37
      1378.00
      575842.64
      718
      2096.00
      65.74
    
    
      9
      Almirall Ltd
      176548.96
      238.00
      114288.70
      127
      365.00
      65.21
    
    
      23
      Biogen Idec Ltd
      357962.73
      303.00
      230685.27
      176
      479.00
      63.26
    
    
      4
      Actavis
      24631.71
      72.00
      10785.81
      43
      115.00
      62.61
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      76
      Sanofi Aventis
      781830.07
      256.00
      666095.38
      750
      1006.00
      25.45
    
    
      3
      Accretio
      810.00
      2.00
      1650.00
      6
      8.00
      25.00
    
    
      74
      Rosemont Pharmaceuticals Ltd
      3581.00
      1.00
      1654.17
      3
      4.00
      25.00
    
    
      24
      Biotest UK
      26511.00
      25.00
      27905.34
      87
      112.00
      22.32
    
    
      11
      Amgen Ltd
      155271.97
      128.00
      365637.27
      457
      585.00
      21.88
    
    
      44
      Gilead
      344354.14
      208.00
      969995.89
      847
      1055.00
      19.72
    
    
      51
      Lundbeck Ltd
      183291.55
      101.00
      457382.54
      445
      546.00
      18.50
    
    
      60
      Novartis Pharmaceuticals UK Ltd
      813557.28
      384.00
      1275939.57
      1808
      2192.00
      17.52
    
    
      50
      Leo Pharma Laboratories Ltd
      82499.50
      81.00
      325185.81
      396
      477.00
      16.98
    
    
      72
      RB
      3160.00
      5.00
      33299.00
      25
      30.00
      16.67
    
    
      29
      CEB Pharma Limited
      2488.17
      10.00
      12943.81
      52
      62.00
      16.13
    
    
      81
      Sobi Ltd
      18132.54
      21.00
      98250.82
      124
      145.00
      14.48
    
    
      49
      Janssen-Cilag Ltd
      910224.29
      505.00
      2007903.54
      3021
      3526.00
      14.32
    
    
      56
      Mitsubishi Tanabe Pharma Europe Ltd
      2000.00
      1.00
      2988.00
      6
      7.00
      14.29
    
    
      71
      Profile Pharma Ltd
      10349.25
      3.00
      30480.26
      19
      22.00
      13.64
    
    
      1
      ALK-Abello Ltd
      10675.02
      15.00
      75417.05
      96
      111.00
      13.51
    
    
      17
      Baxalta UK Ltd
      19861.05
      12.00
      83052.49
      91
      103.00
      11.65
    
    
      14
      AstraZeneca
      1161037.42
      300.00
      2377248.06
      2329
      2629.00
      11.41
    
    
      59
      Norgine
      23951.00
      46.00
      225728.00
      457
      503.00
      9.15
    
    
      40
      Flynn Pharma
      2936.05
      10.00
      107073.94
      100
      110.00
      9.09
    
    
      31
      Celgene Ltd
      144145.79
      52.00
      478443.89
      627
      679.00
      7.66
    
    
      8
      Alliance Pharmaceuticals Ltd
      4527.20
      6.00
      24305.90
      73
      79.00
      7.59
    
    
      16
      Bausch & Lomb UK Ltd
      10364.80
      4.00
      88850.14
      68
      72.00
      5.56
    
    
      85
      Teva UK Limited
      77688.26
      55.00
      589744.32
      943
      998.00
      5.51
    
    
      45
      GlaxoSmithKline plc
      962926.20
      48.00
      1366163.22
      883
      931.00
      5.16
    
    
      22
      BioMarin Europe Ltd
      15758.00
      4.00
      83853.00
      75
      79.00
      5.06
    
    
      87
      UCB Pharma Ltd
      80841.44
      21.00
      308896.68
      449
      470.00
      4.47
    
    
      39
      Ferring Pharma
      9230.37
      14.00
      298494.37
      666
      680.00
      2.06
    
    
      41
      Galen Ltd
      4050.00
      1.00
      66047.81
      57
      58.00
      1.72
    
    
      70
      Pierre Fabre Ltd
      690.46
      2.00
      138678.81
      185
      187.00
      1.07
    
  

90 rows × 7 columns

Work out how many HCOs/HCPs are missing, by company

With the limited and redacted data we have, perhaps the most meaningful thing we can look at is the amount of data that is undeclared!

The only thing we know about the undeclared payments is the amount of it for each company, in each TOV category, and the number of HCOs/HCPs which the undeclared payments cover.

We can therefore derive, by comparing this with the payments that are declared, per-company values for the total amount that was undeclared, and the total number of HCOs/HCPs that chose not to declare themsleves.

Perhaps this will reveal interesting differences.



In [19]:

    
# Get all aggregate payments but EXCLUDE R&D. 
# This is because NO R&D payments are included in the payments sheet, so 
# including them in our per-company analyses will distort results.  
# Lump both HCPs and HCOs together.
df_agg_by_co = df_agg_ex_rd.groupby(('Company Name')).sum().reset_index()
df_agg_by_co.rename(columns={'No of HCP HCO in Aggregate': 'No of HCP HCO'}, inplace=True)
df_agg = df_agg_by_co[['Company Name', 'Amount', 'No of HCP HCO']]

# We've already created dataframes for HCOs and HCPs: now get the sums
# and counts of these, and merge them. There's probably a better way to do this. 
df_hcos_by_co = df_hcos.groupby('Company Name')['Amount'].agg(['sum','count']).reset_index()
df_hcps_by_co = df_hcps.groupby('Company Name')['Amount'].agg(['sum','count']).reset_index()
df_m = pd.merge(df_hcos_by_co, df_hcps_by_co, on='Company Name', suffixes=('_hco', '_hcp'))
df_m['Amount'] = df_m['sum_hco'] + df_m['sum_hcp']
df_m['No of HCP HCO'] = df_m['count_hco'] + df_m['count_hcp']
df_declared = df_m[['Company Name', 'Amount', 'No of HCP HCO']]

df_combined = pd.merge(df_declared, df_agg, on='Company Name', 
                        suffixes=(' Declared', ' Undeclared'))

# Calculate totals, proportions etc. 
df_combined['Total Amount'] = df_combined['Amount Declared'] + df_combined['Amount Undeclared']
df_combined['Total HCOs/HCPs'] = df_combined['No of HCP HCO Declared'] + df_combined['No of HCP HCO Undeclared']
df_combined['Proportion of names undeclared'] = df_combined['No of HCP HCO Undeclared'] / \
    df_combined['Total HCOs/HCPs'] * 100
df_combined['Proportion of total amount undeclared'] = df_combined['Amount Undeclared'] / \
    df_combined['Total Amount'] * 100
    
# Print summary stats about the proportion of undeclared payments.
# Note that this number is lower than the 77% above, because we've excluded R&D. 
print 'Total % names undeclared:', "{:.2f}%".format(df_combined['No of HCP HCO Undeclared'].sum() / \
    df_combined['Total HCOs/HCPs'].sum() * 100)
print 'Total % amount undeclared:', "{:.2f}%".format(df_combined['Amount Undeclared'].sum() / \
    df_combined['Total Amount'].sum() * 100)









    



Total % names undeclared: 33.96%
Total % amount undeclared: 26.92%



In [12]:

    
# Show all companies for which total amount > £1 million,
# sorted by the total amount undeclared.
# Again, these are flattering numbers because we've excluded R&D.
# In most cases, it looks as though the payments that are undeclared are the 
# smaller payments - i.e. probably those to individuals. Sanofi seems to be an exception.
# Note how GSK is an outlier too. 
df_combined.sort_values('Proportion of total amount undeclared', inplace=True, ascending=False)
df_combined[df_combined['Total Amount'] > 1000000]









    Out[12]:






  
    
      
      Company Name
      Amount Declared
      No of HCP HCO Declared
      Amount Undeclared
      No of HCP HCO Undeclared
      Total Amount
      Total HCOs/HCPs
      Proportion of names undeclared
      Proportion of total amount undeclared
    
  
  
    
      53
      Merck Sharp & Dohme Ltd
      1353418.62
      1105
      3041484.00
      2524.00
      4394902.62
      3629.00
      69.55
      69.20
    
    
      57
      Napp Pharmaceuticals Ltd
      704232.35
      541
      1074987.46
      1341.00
      1779219.81
      1882.00
      71.25
      60.42
    
    
      7
      Allergan Ltd
      883775.20
      391
      1040011.45
      1271.00
      1923786.65
      1662.00
      76.47
      54.06
    
    
      25
      Boehringer Ingelheim Ltd
      948732.57
      815
      826901.12
      815.00
      1775633.69
      1630.00
      50.00
      46.57
    
    
      37
      Eli Lilly & Company Ltd
      1410567.01
      1889
      1158566.00
      1645.00
      2569133.01
      3534.00
      46.55
      45.10
    
    
      19
      Bayer Plc
      4309603.44
      3184
      3165949.91
      2080.00
      7475553.35
      5264.00
      39.51
      42.35
    
    
      60
      Novo Nordisk Limited
      2229700.56
      1064
      1608824.37
      1378.00
      3838524.93
      2442.00
      56.43
      41.91
    
    
      75
      Servier Laboratories Ltd
      790342.00
      530
      569220.00
      573.00
      1359562.00
      1103.00
      51.95
      41.87
    
    
      13
      Astellas Pharma Ltd
      1534556.98
      2074
      809591.70
      1300.00
      2344148.68
      3374.00
      38.53
      34.54
    
    
      81
      Takeda UK Ltd
      1166643.88
      1139
      567654.40
      573.00
      1734298.28
      1712.00
      33.47
      32.73
    
    
      27
      Bristol-Myers Squibb Pharmaceuticals Ltd
      1464924.00
      675
      705584.00
      741.00
      2170508.00
      1416.00
      52.33
      32.51
    
    
      73
      Sanofi Aventis
      2144857.20
      1458
      781830.07
      256.00
      2926687.27
      1714.00
      14.94
      26.71
    
    
      52
      Merck Serono Ltd
      1354371.14
      477
      457600.12
      402.00
      1811971.26
      879.00
      45.73
      25.25
    
    
      2
      AbbVie Limited
      1857481.90
      1236
      627200.13
      532.00
      2484682.03
      1768.00
      30.09
      25.24
    
    
      31
      Chiesi Ltd
      918531.93
      1914
      231972.42
      442.00
      1150504.35
      2356.00
      18.76
      20.16
    
    
      43
      Gilead
      1468482.14
      923
      344354.14
      208.00
      1812836.28
      1131.00
      18.39
      19.00
    
    
      71
      Roche Products Limited
      2857887.54
      971
      641933.39
      699.00
      3499820.93
      1670.00
      41.86
      18.34
    
    
      50
      Lundbeck Ltd
      823455.56
      636
      183291.55
      101.00
      1006747.11
      737.00
      13.70
      18.21
    
    
      14
      AstraZeneca
      5505380.24
      3771
      1161037.42
      300.00
      6666417.66
      4071.00
      7.37
      17.42
    
    
      48
      Janssen-Cilag Ltd
      4847919.43
      3761
      910224.29
      505.00
      5758143.72
      4266.00
      11.84
      15.81
    
    
      66
      Pfizer Ltd
      6905728.17
      3805
      1211279.59
      1215.00
      8117007.76
      5020.00
      24.20
      14.92
    
    
      44
      GlaxoSmithKline plc
      5704678.15
      2108
      962926.20
      48.00
      6667604.35
      2156.00
      2.23
      14.44
    
    
      59
      Novartis Pharmaceuticals UK Ltd
      5015272.58
      2253
      813557.28
      384.00
      5828829.86
      2637.00
      14.56
      13.96
    
    
      30
      Celgene Ltd
      1015286.79
      735
      144145.79
      52.00
      1159432.58
      787.00
      6.61
      12.43
    
    
      42
      Genzyme
      1669806.35
      350
      186860.00
      87.00
      1856666.35
      437.00
      19.91
      10.06
    
    
      23
      Biogen Idec Ltd
      3465163.04
      263
      357962.73
      303.00
      3823125.77
      566.00
      53.53
      9.36
    
    
      84
      UCB Pharma Ltd
      2954987.53
      513
      80841.44
      21.00
      3035828.97
      534.00
      3.93
      2.66



In [13]:

    
# Double-check a sample row in the table above, against the raw data,
# to make sure we've got things right....
print df_agg_by_co[df_agg_by_co['Company Name'] == 'Napp Pharmaceuticals Ltd'][['Amount', 'No of HCP HCO']]
print df[df['Company Name'] == 'Napp Pharmaceuticals Ltd'].sum()['Amount']
print len(df_hcps[df_hcps['Company Name'] == 'Napp Pharmaceuticals Ltd']['Last Name'])
print len(df_hcos[df_hcos['Company Name'] == 'Napp Pharmaceuticals Ltd']['Organisation Name'])









    



       Amount  No of HCP HCO
62 1074987.46        1341.00
704232.35
142
399



In [14]:

    
# Save all data to CSV.
df_combined.to_csv('declared_vs_aggregated_payments_by_company.csv')

Compare undeclared payments by category

As above, but looking for which categories have the largest share of undeclared payments.



In [15]:

    
# As above, EXCLUDE R&D. 
df_agg_by_cat = df_agg_ex_rd.groupby('TOV Category').sum().reset_index()
df_agg_cat = df_agg_by_cat[['TOV Category',  'Amount']]
df_by_cat = df.groupby(('TOV Category')).sum().reset_index()
df_merged_cat = pd.merge(df_agg_cat, df_by_cat, on=['TOV Category'], 
                     suffixes=(' Aggregate', ' Declared'))
df_merged_cat['Total Amount'] = df_merged_cat['Amount Declared'] + \
    df_merged_cat['Amount Aggregate']
df_merged_cat['Proportion Undeclared'] = df_merged_cat['Amount Aggregate'] / \
    df_merged_cat['Total Amount']
df_merged_cat.sort_values('Proportion Undeclared', ascending=False, inplace=True)
df_merged_cat.to_csv('declared_vs_aggregated_payments_by_category.csv')
df_merged_cat









    Out[15]:






  
    
      
      TOV Category
      Amount Aggregate
      Amount Declared
      Total Amount
      Proportion Undeclared
    
  
  
    
      2
      Fee for service and consultancy
      21018335.46
      25029822.00
      46048157.46
      0.46
    
    
      0
      Contribution to costs of Events
      7468459.12
      23983753.20
      31452212.32
      0.24
    
    
      1
      Donations and Grants to HCOs
      697507.84
      28764932.33
      29462440.17
      0.02

The individuals with the most declared payments

Anyone using this should check manually that each of these is truly the same person, rather than several people with the same name. We could further reduce this possibility by grouping by institution too, but then might miss individuals working across different institutions, so best to check manually I think.

Note that individuals who don't remove themselves from the dataset are good and should be celebrated!



In [16]:

    
df_individual = df.groupby(('Last Name', 'First Name', 'Speciality')).sum().reset_index()
df_individual.sort_values(by='Amount', ascending=False).head(20)









    Out[16]:






  
    
      
      Last Name
      First Name
      Speciality
      Amount
    
  
  
    
      11135
      Pavord
      Ian
      Pneumology
      98702.92
    
    
      8214
      Lipworth
      Brian
      Pneumology
      96647.00
    
    
      92
      Acquilla
      Rajeev
      General Medicine
      93287.66
    
    
      5395
      Gribben
      John
      Haematology
      90108.66
    
    
      10283
      Nelson
      Mark
      Venereology
      75115.45
    
    
      2599
      Choy
      Ernest
      Rheumatology
      68935.17
    
    
      3837
      Drake
      Marcus
      Urology
      68897.30
    
    
      6151
      Hill
      Anita
      Haematology
      68711.04
    
    
      1833
      Brown
      Ashley
      Hepatology
      64847.32
    
    
      7025
      Jones
      Paul
      Pneumology
      64134.89
    
    
      11531
      Pozniak
      Anton
      Venereology
      62142.54
    
    
      4254
      Evans
      Lyndon
      Diabetes
      61567.85
    
    
      13335
      Singh
      Sukh
      Pneumology
      60479.52
    
    
      11892
      Ratcliffe
      Peter
      Nephrology
      55463.15
    
    
      6174
      Hillmen
      Peter
      Haematology
      52714.31
    
    
      14484
      Travis
      Simon
      Gastroenterology
      52400.99
    
    
      5862
      Harrison
      John
      Miscellaneous
      51739.91
    
    
      14214
      Thatcher
      Nicholas
      Oncology
      51489.80
    
    
      5814
      Harmer
      Catherine
      Psychiatry
      51046.04
    
    
      5434
      Grimm
      Carsten
      Substance Abuse
      47817.73

Payments by pharma co

These are the declared payments, not the undeclared payments. Not sure how meaningful this is really.



In [17]:

    
df_by_co = df.groupby('Company Name').agg(['sum', 'mean', 'median', 'count'])
df_by_co.sort_values(by=('Amount', 'sum'), ascending=False).head(10)









    Out[17]:






  
    
      
      Amount
    
    
      
      sum
      mean
      median
      count
    
    
      Company Name
      
      
      
      
    
  
  
    
      Pfizer Ltd
      6905728.17
      1814.91
      363.44
      3805
    
    
      GlaxoSmithKline plc
      5704678.15
      2706.20
      360.00
      2108
    
    
      AstraZeneca
      5505380.24
      1459.93
      350.00
      3771
    
    
      Novartis Pharmaceuticals UK Ltd
      5015272.58
      2226.04
      467.00
      2253
    
    
      Janssen-Cilag Ltd
      4847919.43
      1289.00
      346.80
      3761
    
    
      Bayer Plc
      4309603.44
      1353.52
      434.50
      3184
    
    
      Biogen Idec Ltd
      3465163.04
      13175.52
      970.70
      263
    
    
      UCB Pharma Ltd
      2954987.53
      5760.21
      400.00
      513
    
    
      Roche Products Limited
      2857887.54
      2943.24
      420.80
      971
    
    
      Novo Nordisk Limited
      2229700.56
      2095.58
      357.00
      1064

Payments by pharma companies by category

Break down the above table by category, too. Again, this is only declared payments.



In [18]:

    
#  df[['Company Name', 'TOV Category', 'Amount']].head()
df.groupby(('Company Name', 'TOV Category')).sum().sort_values('Amount', ascending=False).head(10)









    Out[18]:






  
    
      
      
      Amount
    
    
      Company Name
      TOV Category
      
    
  
  
    
      GlaxoSmithKline plc
      Fee for service and consultancy
      2998707.38
    
    
      AstraZeneca
      Fee for service and consultancy
      2542336.32
    
    
      UCB Pharma Ltd
      Donations and Grants to HCOs
      2524037.00
    
    
      Bayer Plc
      Donations and Grants to HCOs
      2523765.28
    
    
      Pfizer Ltd
      Donations and Grants to HCOs
      2442880.18
    
    
      Fee for service and consultancy
      2245509.04
    
    
      AstraZeneca
      Contribution to costs of Events
      2240105.57
    
    
      Pfizer Ltd
      Contribution to costs of Events
      2164160.45
    
    
      Biogen Idec Ltd
      Donations and Grants to HCOs
      1993005.57
    
    
      GlaxoSmithKline plc
      Donations and Grants to HCOs
      1757652.73

Payments by pharma companies by speciality

Break down the above table by speciality, too. Again, this is only declared payments.



In [19]:

    
df_by_co_and_speciality = df.groupby(('Company Name', 'Speciality')).sum()\
    .sort_values('Amount', ascending=False).reset_index()

total = df_by_co_and_speciality.groupby('Company Name')['Amount'].transform('sum')
df_by_co_and_speciality['% of Co Spend'] = df_by_co_and_speciality['Amount']/total * 100

print df_by_co_and_speciality.head(10)
# Validate % calculation
# print df_by_co_and_speciality[df_by_co_and_speciality['Company Name'] == 'AstraZeneca']['% of Company Total Spend'].sum()
df_by_co_and_speciality.to_csv('payments_by_co_and_speciality.csv')









    



             Company Name        Speciality    Amount  % of Co Spend
0             AstraZeneca          Diabetes 475170.97          19.99
1             AstraZeneca        Pneumology 444114.96          18.68
2     GlaxoSmithKline plc        Pneumology 439736.05          32.19
3       Janssen-Cilag Ltd             Nurse 343192.11          17.09
4    Novo Nordisk Limited          Diabetes 333990.23          58.00
5             AstraZeneca        Cardiology 332412.72          13.98
6             AstraZeneca             Nurse 308429.89          12.97
7  Roche Products Limited          Oncology 282609.38          35.27
8          Sanofi Aventis          Diabetes 267567.26          40.17
9             AstraZeneca  General Medicine 261823.47          11.01

Payments by organisation

Again, these are the declared payments - we can't break down the aggregate payments this way.



In [20]:

    
df_by_org = df.groupby('Organisation Name').agg(['sum', 'mean', 'median', 'count'])
df_by_org.sort_values(by=('Amount', 'sum'), ascending=False).head(10)









    Out[20]:






  
    
      
      Amount
    
    
      
      sum
      mean
      median
      count
    
    
      Organisation Name
      
      
      
      
    
  
  
    
      King's College London
      2425028.93
      186540.69
      15506.03
      13
    
    
      School of Hygiene & Trop Med
      1166557.48
      72909.84
      462.38
      16
    
    
      PeerVoice
      1162535.37
      105685.03
      78438.24
      11
    
    
      Healthcare At Home
      1084103.99
      60228.00
      96879.00
      18
    
    
      University College London
      972864.76
      15201.01
      3000.00
      64
    
    
      Quintiles
      853296.76
      142216.13
      10377.52
      6
    
    
      British Society for Rheumatology
      613988.91
      19806.09
      9000.00
      31
    
    
      Royal Hants County Hospital
      607773.51
      37985.84
      125.00
      16
    
    
      King's College Hospital
      539243.72
      7703.48
      550.00
      70
    
    
      Adelphi Group Ltd
      515606.70
      85934.45
      25164.40
      6

Payments by speciality

Again, these are the declared payments - we can't break down the aggregate payments this way.

Nurses get more payments than anyone else, but they certainly don't get the highest mean payment.

Write the data to a CSV file for further analysis.



In [21]:

    
df_by_spec = df.groupby('Speciality').agg(['sum', 'mean', 'median', 'count']).sort_values(by=('Amount', 'sum'), ascending=False)
print df_by_spec.head(10)
df_by_spec.to_csv('by_speciality.csv')









    



                     Amount                     
                        sum    mean median count
Speciality                                      
Nurse            2620201.76  420.31 233.07  6234
Oncology         1890781.09  999.88 514.32  1891
Haematology      1882802.26 1114.08 500.00  1690
Pneumology       1619318.92 1281.11 480.00  1264
Diabetes         1613718.55 1070.11 535.38  1508
General Medicine 1549744.07  808.00 350.00  1918
Miscellaneous    1214805.23  722.24 300.00  1682
Cardiology       1095473.54  991.38 500.00  1105
Rheumatology      949008.33 1022.64 508.50   928
Gastroenterology  929236.33  743.39 450.32  1250

Payments by specialty, with custom groupings

Create three groups: Nurses, Pharmacists, Doctors, and Other, and run the same analysis



In [44]:

    
# Normalise the specialties
def specialty_classifier(row):
    other = ['Healthcare Administration',
             'Microbiology',
             'Laboratory - Medical Analysis',
             'Research',
             'Occupational Therapist',
             'Wholesaler', 
             'Miscellaneous']
    pharmacists = ['Clinical Pharmacology', 'Pharmacist']
    nurses = ['Nurse']
    classification = 'HCO'
    if row['Speciality'] in other:
        classification = 'Other'
    elif row['Speciality'] in pharmacists:
        classification = 'Pharmacists'
    elif row['Speciality'] in nurses:
        classification = 'Nurses'
    elif str(row['Speciality']) != "nan":
        # XXX there must be a nicer way to filter out pd.nan values...
        classification = 'Doctors'
    return classification

df['Normalised specialty'] = df.apply(specialty_classifier, axis=1)

# Re-run the analysis   
df.groupby('Normalised specialty').agg(['sum', 'mean', 'median', 'count']).sort_values(by=('Amount', 'sum'), ascending=False)









    Out[44]:






  
    
      
      Amount
    
    
      
      sum
      mean
      median
      count
    
    
      Normalised specialty
      
      
      
      
    
  
  
    
      HCO
      56450186.61
      2816.88
      300.00
      20040
    
    
      Doctors
      19001227.48
      949.78
      473.57
      20006
    
    
      Nurses
      2620201.76
      420.31
      233.07
      6234
    
    
      Other
      2245768.16
      741.91
      350.00
      3027
    
    
      Pharmacists
      695372.46
      573.74
      302.62
      1212

Unique individuals by speciality

We can work out how many uniquely-named individuals there are in each speciality as follows.

"General Medicine" probably != GPs, since Raj Acquilla is in this category and we know him to be a plastic surgeon.



In [56]:

    
df_individual = df.groupby(('Last Name', 'First Name', 'Speciality')).sum().reset_index()
df_individual.groupby('Speciality').count()[['Amount']].sort_values("Amount", ascending=False)









    Out[56]:






  
    
      
      Amount
    
    
      Speciality
      
    
  
  
    
      Nurse
      4002
    
    
      Miscellaneous
      1214
    
    
      General Medicine
      1028
    
    
      Oncology
      703
    
    
      Pharmacist
      638
    
    
      Haematology
      622
    
    
      Diabetes
      585
    
    
      Cardiology
      564
    
    
      Gastroenterology
      519
    
    
      Healthcare Administration
      489
    
    
      Pneumology
      487
    
    
      Rheumatology
      399
    
    
      Urology
      380
    
    
      Psychiatry
      374
    
    
      Geriatrics
      345
    
    
      Neurology
      324
    
    
      Obstetrics Gynaecology
      324
    
    
      Dermatology
      311
    
    
      Paediatrics
      307
    
    
      Ophthalmology
      293
    
    
      Anaesthesiology
      235
    
    
      Nephrology
      178
    
    
      Venereology
      165
    
    
      Microbiology
      108
    
    
      General Surgery
      103
    
    
      Pathology
      76
    
    
      Dietetics
      73
    
    
      Physical Therapy
      72
    
    
      Infectious Diseases
      58
    
    
      Hepatology
      57
    
    
      ...
      ...
    
    
      Cardiothoracic/ Thoracic Surgery
      18
    
    
      Colorectal Surgery
      17
    
    
      Physical Medicine & Rehabilitation
      17
    
    
      Metabolism Disease
      16
    
    
      Speech And Language
      16
    
    
      Plastic Surgery
      15
    
    
      Forensic Psychiatry
      13
    
    
      Allergy
      12
    
    
      Neonatology
      11
    
    
      Intensive Care Medicine
      11
    
    
      Podiatry
      10
    
    
      Clinical Pharmacology
      8
    
    
      Research
      8
    
    
      Virology
      7
    
    
      Public Health
      7
    
    
      Oral & Maxillofacial Surgery
      6
    
    
      Neurosurgery
      6
    
    
      Radiotherapy
      5
    
    
      Occupational Therapist
      5
    
    
      Nuclear Medicine
      5
    
    
      Physiology
      5
    
    
      Family Planning
      5
    
    
      Cardiovascular/Vascular Surgery
      4
    
    
      Pain Management
      4
    
    
      Child Surgery
      3
    
    
      Neurophysiology
      3
    
    
      Psychotherapy
      3
    
    
      Epilepsy
      2
    
    
      Tropical Medicine
      1
    
    
      Wholesaler
      1
    
  

78 rows × 1 columns

The same thing, with custom specialty groupings



In [55]:

    
df_individual = df.groupby(('Last Name', 'First Name', 'Normalised specialty')).sum().reset_index()
df_individual.groupby('Normalised specialty').count()[['Amount']].sort_values("Amount", ascending=False)









    Out[55]:






  
    
      
      Amount
    
    
      Normalised specialty
      
    
  
  
    
      Doctors
      9217
    
    
      Nurses
      4002
    
    
      Other
      1852
    
    
      Pharmacists
      646

Exploratory work to understand undeclared payments

This isn't interesting of itself, but I'm trying to work out what the various columns in the Aggregates tab mean.

This just here as an example of data in the two tabs so I can try to work out how they fit together.



In [24]:

    
df_sample = df[(df['Company Name'] == 'Genzyme') & (df['TOV Type'] == 'Fees')]
df_sample.describe(include='all')
# print len(df_sample), 'rows'
# df_sample['Amount'].sum()









    Out[24]:






  
    
      
      Company Name
      Title
      First Name
      Last Name
      Organisation Name
      Location
      Speciality
      City
      Country of Principal Practice
      Institution Name
      Address Line1
      Address Line2
      Post Code
      County
      TOV Category
      TOV Type
      Amount
      Joint Working Link
      Normalised specialty
    
  
  
    
      count
      58
      56
      56
      56
      2
      58
      56
      58
      58
      55
      58
      3
      58
      58
      58
      58
      58.00
      0.00
      58
    
    
      unique
      1
      5
      53
      55
      2
      49
      15
      26
      1
      33
      36
      3
      36
      25
      1
      1
      nan
      0.00
      3
    
    
      top
      Genzyme
      Doctor
      Alison
      Wilson
      Medical Sciences
      Salford Royal NHS FT - Neurology Department
      Neurology
      London
      United Kingdom
      The Royal Free Hospital
      Pond Street
      Nottingham University Hospital
      NW3 2QG
      Greater London
      Fee for service and consultancy
      Fees
      nan
      nan
      Doctors
    
    
      freq
      58
      34
      3
      2
      1
      3
      19
      17
      58
      6
      6
      1
      6
      17
      58
      58
      nan
      nan
      41
    
    
      mean
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      2793.53
      nan
      NaN
    
    
      std
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      6997.00
      nan
      NaN
    
    
      min
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      100.00
      nan
      NaN
    
    
      25%
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      500.00
      nan
      NaN
    
    
      50%
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      808.00
      nan
      NaN
    
    
      75%
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1500.00
      nan
      NaN
    
    
      max
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      46800.00
      nan
      NaN



In [ ]:

Distribution of payments

(Not very interesting) Most individual payments are very small. 25% are below £180, 50% are below £355, 70% are below £800.

See histogram and note the log scale on the y-axis.



In [ ]:



In [25]:

    
%pylab inline
plt.figure()
bin_range = np.arange(0, 1100000, 10000)
df['Amount'].plot(kind='hist', bins=bin_range, color='blue', alpha=0.6)
plt.ylabel("Number of payments")
plt.xlabel("Payment size")
plt.yscale('log', nonposy='clip')
plt.title("Payments by amount")
plt.grid()









    



Populating the interactive namespace from numpy and matplotlib






    



WARNING: pylab import has clobbered these variables: ['dtype']
`%matplotlib` prevents importing * from pylab and numpy



In [ ]:

	Company Name	Title	First Name	Last Name	Organisation Name	Location	Speciality	City	Country of Principal Practice	Institution Name	Address Line1	Address Line2	Post Code	County	TOV Category	TOV Type	Amount	Joint Working Link
0	Teva UK Limited	Mrs	Francesca	Aaen	NaN	NHS Lanarkshire - Pharmacy Department	Pharmacist	Airdrie	United Kingdom	Monklands General Hospital	Monkscourt Avenue	NaN	ML6 0JS	Lanarkshire	Fee for service and consultancy	Fees	650.00	NaN
1	Teva UK Limited	Mrs	Francesca	Aaen	NaN	NHS Lanarkshire - Pharmacy Department	Pharmacist	Airdrie	United Kingdom	Monklands General Hospital	Monkscourt Avenue	NaN	ML6 0JS	Lanarkshire	Fee for service and consultancy	Related expenses agreed in the fee for service...	5.30	NaN
2	Eli Lilly & Company Ltd	Mrs	Susan	Aarons	NaN	Anchor Trust	Nurse	Bradford	United Kingdom	Ashcroft	Kelvin Way	NaN	BD2 3EF	West Yorkshire	Contribution to costs of Events	Travel & Accommodation	15.00	NaN
3	Meda Pharma	Doctor	Dawn	Aaronson	NaN	Dr R E Adams & Partners	General Medicine	Redhill	United Kingdom	Woodlands Surgery	5 Woodlands Road	NaN	RH1 6EY	Surrey	Contribution to costs of Events	Travel & Accommodation	60.00	NaN
4	Teva UK Limited	_	Christina	Abarno	NaN	NaN	Miscellaneous	Oxford	United Kingdom	Nuffield Department of Clinical Neurosciences	Level 6, West Wing, John Radcliffe Hospital	NaN	OX3 9DU	Undefined	Contribution to costs of Events	Travel & Accommodation	399.86	NaN

	Company Name	Title	First Name	Last Name	Organisation Name	Location	Speciality	City	Country of Principal Practice	Institution Name	Address Line1	Address Line2	Post Code	County	TOV Category	TOV Type	Amount	Joint Working Link
151	Allergan Ltd	Doctor	Rajeev	Acquilla	NaN	Dr Raj Acquilla	General Medicine	Northwich	United Kingdom	The Gatehouse	1 St Marys Drive	NaN	CW8 2EZ	Cheshire	Fee for service and consultancy	Fees	87954.56	NaN
16046	Teva UK Limited	Professor	Brian	Lipworth	NaN	NHS Tayside - Respiratory Medicine Dept	Pneumology	Dundee	United Kingdom	Ninewells Hosp & Med School	Dundee	NaN	DD1 9SY	Angus	Fee for service and consultancy	Fees	83112.00	NaN
23057	GlaxoSmithKline plc	Doctor	Peter	Ratcliffe	NaN	Oxford Univ Hospitals NHS FT - Renal Medicine ...	Nephrology	Oxford	United Kingdom	Churchill Hospital	Old Road	NaN	OX3 7LE	Oxfordshire	Fee for service and consultancy	Fees	54700.86	NaN
11957	Alexion Pharma UK Ltd	Doctor	Anita	Hill	NaN	Bradford Teaching Hosp FT - Haematology Depart...	Haematology	Bradford	United Kingdom	Bradford Royal Infirmary	Duckworth Lane	NaN	BD9 6RJ	West Yorkshire	Fee for service and consultancy	Fees	46100.00	NaN
10616	Lundbeck Ltd	Doctor	Carsten	Grimm	NaN	On Trak	Substance Abuse	Huddersfield	United Kingdom	On Trak Huddersfield	2 St. Peters Street	NaN	HD1 1RA	West Yorkshire	Fee for service and consultancy	Fees	45971.98	NaN

	Company Name	Organisation Name	TOV Type	Amount
37882	Pfizer Ltd	Institute of Cancer Policy	Donations and Grants to HCOs	-62399.71
32882	Merck Sharp & Dohme Ltd	British Society for Rheumatology	Sponsorship agreements with HCOs /3rd parties	-30000.00
37409	Pfizer Ltd	Healthcare At Home	Fees	-3351.55
44593	Pfizer Ltd	Ryl College Emergency Medicine	Sponsorship agreements with HCOs /3rd parties	-3000.00
48524	Pfizer Ltd	University College London	Sponsorship agreements with HCOs /3rd parties	-3000.00

		Amount
TOV Category	TOV Type
Donations and Grants to HCOs	Donations and Grants to HCOs	28764932.33
Fee for service and consultancy	Fees	22371761.15
Contribution to costs of Events	Sponsorship agreements with HCOs /3rd parties	15999826.91
Contribution to costs of Events	Travel & Accommodation	5480353.79
Joint working	Joint working	3234248.94
Fee for service and consultancy	Related expenses agreed in the fee for services or consultancy contract	2657719.07
Contribution to costs of Events	Registration Fees	2503572.50
Fee for service and consultancy	Travel & Accommodation	341.78

	Company Name	Title	First Name	Last Name	Organisation Name	Location	Speciality	City	Country of Principal Practice	Institution Name	Address Line1	Address Line2	Post Code	County	TOV Category	TOV Type	Amount	Joint Working Link	% of all HCP HCO in Aggregate	No of HCP HCO in Aggregate
0	A. Menarini Farmaceutica Internazionale S.r.l.	_	A. Menarini Farmaceutica Inter	A. Menarini Farmaceutica Inter	NaN	nan	Miscellaneous	A. Menarini Farmaceutica Internazionale S.r.l.	United Kingdom	nan	A. Menarini Farmaceutica Internazionale S.r.l.	nan	XX	Undefined	Contribution to costs of Events	Registration Fees	0.00	nan	2.79	nan
1	A. Menarini Farmaceutica Internazionale S.r.l.	_	A. Menarini Farmaceutica Inter	A. Menarini Farmaceutica Inter	NaN	nan	Miscellaneous	A. Menarini Farmaceutica Internazionale S.r.l.	United Kingdom	nan	A. Menarini Farmaceutica Internazionale S.r.l.	nan	XX	Undefined	Contribution to costs of Events	Registration Fees	0.00	nan	nan	7.00
2	A. Menarini Farmaceutica Internazionale S.r.l.	_	A. Menarini Farmaceutica Inter	A. Menarini Farmaceutica Inter	NaN	nan	Miscellaneous	A. Menarini Farmaceutica Internazionale S.r.l.	United Kingdom	nan	A. Menarini Farmaceutica Internazionale S.r.l.	nan	XX	Undefined	Contribution to costs of Events	Registration Fees	3406.45	nan	nan	nan
3	A. Menarini Farmaceutica Internazionale S.r.l.	_	A. Menarini Farmaceutica Inter	A. Menarini Farmaceutica Inter	NaN	nan	Miscellaneous	A. Menarini Farmaceutica Internazionale S.r.l.	United Kingdom	nan	A. Menarini Farmaceutica Internazionale S.r.l.	nan	XX	Undefined	Contribution to costs of Events	Travel & Accommodation	0.00	nan	17.53	nan
4	A. Menarini Farmaceutica Internazionale S.r.l.	_	A. Menarini Farmaceutica Inter	A. Menarini Farmaceutica Inter	NaN	nan	Miscellaneous	A. Menarini Farmaceutica Internazionale S.r.l.	United Kingdom	nan	A. Menarini Farmaceutica Internazionale S.r.l.	nan	XX	Undefined	Contribution to costs of Events	Travel & Accommodation	0.00	nan	nan	44.00

	Company Name	Amount Undeclared	No of HCP Undeclared	Amount Declared	No of HCP Declared	Total HCPs	% HCP Payments Undeclared
54	Merck Sharp & Dohme Ltd	3041484.00	2524.00	33650.14	41	2565.00	98.40
37	Eisai Ltd	213758.78	305.00	3312.00	6	311.00	98.07
83	Sunovion Pharmaceuticals Europe Ltd	177471.41	341.00	22207.13	18	359.00	94.99
58	Napp Pharmaceuticals Ltd	1074987.46	1341.00	129139.81	142	1483.00	90.42
55	Merz Pharma UK Ltd	138675.25	287.00	50786.99	34	321.00	89.41
47	Guerbet Laboratories Ltd	21458.24	35.00	4402.75	5	40.00	87.50
7	Allergan Ltd	1040011.45	1271.00	279476.20	189	1460.00	87.05
79	Shionogi Limited	27957.71	40.00	11261.63	6	46.00	86.96
67	PTC Therapeutics Limited	10802.60	17.00	1594.60	3	20.00	85.00
12	ApoPharma Inc	87730.81	10.00	2000.00	2	12.00	83.33
25	Boehringer Ingelheim Ltd	826901.12	815.00	223415.62	167	982.00	82.99
82	Stirling Anglian Pharmaceuticals Ltd	253867.99	116.00	44074.87	25	141.00	82.27
30	CSL Behring	109566.73	169.00	17716.13	39	208.00	81.25
15	BGP Products Ltd	47398.73	73.00	6157.77	18	91.00	80.22
26	Bracco UK Ltd	5845.95	20.00	3352.03	6	26.00	76.92
52	Meda Pharma	87959.14	109.00	12637.60	35	144.00	75.69
27	Bristol-Myers Squibb Pharmaceuticals Ltd	705584.00	741.00	272650.00	248	989.00	74.92
33	Chugai Pharma UK	461297.13	194.00	72442.07	66	260.00	74.62
19	Bayer Plc	3165949.91	2080.00	785976.49	738	2818.00	73.81
28	Britannia Pharmaceuticals	298845.58	397.00	97067.70	145	542.00	73.25
48	Ipsen Developments Ltd	243458.53	298.00	57865.37	110	408.00	73.04
21	Bio Products Laboratory Ltd	12003.58	24.00	1632.94	10	34.00	70.59
61	Novex Pharma	29000.00	7.00	15000.00	3	10.00	70.00
63	Octapharma Ltd	129615.38	55.00	6377.03	25	80.00	68.75
38	Eli Lilly & Company Ltd	1158566.00	1645.00	588484.00	803	2448.00	67.20
88	Vifor Fresenius Medical Care Renal Pharma	16460.33	8.00	1085.60	4	12.00	66.67
62	Novo Nordisk Limited	1608824.37	1378.00	575842.64	718	2096.00	65.74
9	Almirall Ltd	176548.96	238.00	114288.70	127	365.00	65.21
23	Biogen Idec Ltd	357962.73	303.00	230685.27	176	479.00	63.26
4	Actavis	24631.71	72.00	10785.81	43	115.00	62.61
...	...	...	...	...	...	...	...
76	Sanofi Aventis	781830.07	256.00	666095.38	750	1006.00	25.45
3	Accretio	810.00	2.00	1650.00	6	8.00	25.00
74	Rosemont Pharmaceuticals Ltd	3581.00	1.00	1654.17	3	4.00	25.00
24	Biotest UK	26511.00	25.00	27905.34	87	112.00	22.32
11	Amgen Ltd	155271.97	128.00	365637.27	457	585.00	21.88
44	Gilead	344354.14	208.00	969995.89	847	1055.00	19.72
51	Lundbeck Ltd	183291.55	101.00	457382.54	445	546.00	18.50
60	Novartis Pharmaceuticals UK Ltd	813557.28	384.00	1275939.57	1808	2192.00	17.52
50	Leo Pharma Laboratories Ltd	82499.50	81.00	325185.81	396	477.00	16.98
72	RB	3160.00	5.00	33299.00	25	30.00	16.67
29	CEB Pharma Limited	2488.17	10.00	12943.81	52	62.00	16.13
81	Sobi Ltd	18132.54	21.00	98250.82	124	145.00	14.48
49	Janssen-Cilag Ltd	910224.29	505.00	2007903.54	3021	3526.00	14.32
56	Mitsubishi Tanabe Pharma Europe Ltd	2000.00	1.00	2988.00	6	7.00	14.29
71	Profile Pharma Ltd	10349.25	3.00	30480.26	19	22.00	13.64
1	ALK-Abello Ltd	10675.02	15.00	75417.05	96	111.00	13.51
17	Baxalta UK Ltd	19861.05	12.00	83052.49	91	103.00	11.65
14	AstraZeneca	1161037.42	300.00	2377248.06	2329	2629.00	11.41
59	Norgine	23951.00	46.00	225728.00	457	503.00	9.15
40	Flynn Pharma	2936.05	10.00	107073.94	100	110.00	9.09
31	Celgene Ltd	144145.79	52.00	478443.89	627	679.00	7.66
8	Alliance Pharmaceuticals Ltd	4527.20	6.00	24305.90	73	79.00	7.59
16	Bausch & Lomb UK Ltd	10364.80	4.00	88850.14	68	72.00	5.56
85	Teva UK Limited	77688.26	55.00	589744.32	943	998.00	5.51
45	GlaxoSmithKline plc	962926.20	48.00	1366163.22	883	931.00	5.16
22	BioMarin Europe Ltd	15758.00	4.00	83853.00	75	79.00	5.06
87	UCB Pharma Ltd	80841.44	21.00	308896.68	449	470.00	4.47
39	Ferring Pharma	9230.37	14.00	298494.37	666	680.00	2.06
41	Galen Ltd	4050.00	1.00	66047.81	57	58.00	1.72
70	Pierre Fabre Ltd	690.46	2.00	138678.81	185	187.00	1.07

	Company Name	Amount Declared	No of HCP HCO Declared	Amount Undeclared	No of HCP HCO Undeclared	Total Amount	Total HCOs/HCPs	Proportion of names undeclared	Proportion of total amount undeclared
53	Merck Sharp & Dohme Ltd	1353418.62	1105	3041484.00	2524.00	4394902.62	3629.00	69.55	69.20
57	Napp Pharmaceuticals Ltd	704232.35	541	1074987.46	1341.00	1779219.81	1882.00	71.25	60.42
7	Allergan Ltd	883775.20	391	1040011.45	1271.00	1923786.65	1662.00	76.47	54.06
25	Boehringer Ingelheim Ltd	948732.57	815	826901.12	815.00	1775633.69	1630.00	50.00	46.57
37	Eli Lilly & Company Ltd	1410567.01	1889	1158566.00	1645.00	2569133.01	3534.00	46.55	45.10
19	Bayer Plc	4309603.44	3184	3165949.91	2080.00	7475553.35	5264.00	39.51	42.35
60	Novo Nordisk Limited	2229700.56	1064	1608824.37	1378.00	3838524.93	2442.00	56.43	41.91
75	Servier Laboratories Ltd	790342.00	530	569220.00	573.00	1359562.00	1103.00	51.95	41.87
13	Astellas Pharma Ltd	1534556.98	2074	809591.70	1300.00	2344148.68	3374.00	38.53	34.54
81	Takeda UK Ltd	1166643.88	1139	567654.40	573.00	1734298.28	1712.00	33.47	32.73
27	Bristol-Myers Squibb Pharmaceuticals Ltd	1464924.00	675	705584.00	741.00	2170508.00	1416.00	52.33	32.51
73	Sanofi Aventis	2144857.20	1458	781830.07	256.00	2926687.27	1714.00	14.94	26.71
52	Merck Serono Ltd	1354371.14	477	457600.12	402.00	1811971.26	879.00	45.73	25.25
2	AbbVie Limited	1857481.90	1236	627200.13	532.00	2484682.03	1768.00	30.09	25.24
31	Chiesi Ltd	918531.93	1914	231972.42	442.00	1150504.35	2356.00	18.76	20.16
43	Gilead	1468482.14	923	344354.14	208.00	1812836.28	1131.00	18.39	19.00
71	Roche Products Limited	2857887.54	971	641933.39	699.00	3499820.93	1670.00	41.86	18.34
50	Lundbeck Ltd	823455.56	636	183291.55	101.00	1006747.11	737.00	13.70	18.21
14	AstraZeneca	5505380.24	3771	1161037.42	300.00	6666417.66	4071.00	7.37	17.42
48	Janssen-Cilag Ltd	4847919.43	3761	910224.29	505.00	5758143.72	4266.00	11.84	15.81
66	Pfizer Ltd	6905728.17	3805	1211279.59	1215.00	8117007.76	5020.00	24.20	14.92
44	GlaxoSmithKline plc	5704678.15	2108	962926.20	48.00	6667604.35	2156.00	2.23	14.44
59	Novartis Pharmaceuticals UK Ltd	5015272.58	2253	813557.28	384.00	5828829.86	2637.00	14.56	13.96
30	Celgene Ltd	1015286.79	735	144145.79	52.00	1159432.58	787.00	6.61	12.43
42	Genzyme	1669806.35	350	186860.00	87.00	1856666.35	437.00	19.91	10.06
23	Biogen Idec Ltd	3465163.04	263	357962.73	303.00	3823125.77	566.00	53.53	9.36
84	UCB Pharma Ltd	2954987.53	513	80841.44	21.00	3035828.97	534.00	3.93	2.66

	TOV Category	Amount Aggregate	Amount Declared	Total Amount	Proportion Undeclared
2	Fee for service and consultancy	21018335.46	25029822.00	46048157.46	0.46
0	Contribution to costs of Events	7468459.12	23983753.20	31452212.32	0.24
1	Donations and Grants to HCOs	697507.84	28764932.33	29462440.17	0.02

	Last Name	First Name	Speciality	Amount
11135	Pavord	Ian	Pneumology	98702.92
8214	Lipworth	Brian	Pneumology	96647.00
92	Acquilla	Rajeev	General Medicine	93287.66
5395	Gribben	John	Haematology	90108.66
10283	Nelson	Mark	Venereology	75115.45
2599	Choy	Ernest	Rheumatology	68935.17
3837	Drake	Marcus	Urology	68897.30
6151	Hill	Anita	Haematology	68711.04
1833	Brown	Ashley	Hepatology	64847.32
7025	Jones	Paul	Pneumology	64134.89
11531	Pozniak	Anton	Venereology	62142.54
4254	Evans	Lyndon	Diabetes	61567.85
13335	Singh	Sukh	Pneumology	60479.52
11892	Ratcliffe	Peter	Nephrology	55463.15
6174	Hillmen	Peter	Haematology	52714.31
14484	Travis	Simon	Gastroenterology	52400.99
5862	Harrison	John	Miscellaneous	51739.91
14214	Thatcher	Nicholas	Oncology	51489.80
5814	Harmer	Catherine	Psychiatry	51046.04
5434	Grimm	Carsten	Substance Abuse	47817.73

	Amount
	sum	mean	median	count
Company Name
Pfizer Ltd	6905728.17	1814.91	363.44	3805
GlaxoSmithKline plc	5704678.15	2706.20	360.00	2108
AstraZeneca	5505380.24	1459.93	350.00	3771
Novartis Pharmaceuticals UK Ltd	5015272.58	2226.04	467.00	2253
Janssen-Cilag Ltd	4847919.43	1289.00	346.80	3761
Bayer Plc	4309603.44	1353.52	434.50	3184
Biogen Idec Ltd	3465163.04	13175.52	970.70	263
UCB Pharma Ltd	2954987.53	5760.21	400.00	513
Roche Products Limited	2857887.54	2943.24	420.80	971
Novo Nordisk Limited	2229700.56	2095.58	357.00	1064

	Amount
Speciality
Nurse	4002
Miscellaneous	1214
General Medicine	1028
Oncology	703
Pharmacist	638
Haematology	622
Diabetes	585
Cardiology	564
Gastroenterology	519
Healthcare Administration	489
Pneumology	487
Rheumatology	399
Urology	380
Psychiatry	374
Geriatrics	345
Neurology	324
Obstetrics Gynaecology	324
Dermatology	311
Paediatrics	307
Ophthalmology	293
Anaesthesiology	235
Nephrology	178
Venereology	165
Microbiology	108
General Surgery	103
Pathology	76
Dietetics	73
Physical Therapy	72
Infectious Diseases	58
Hepatology	57
...	...
Cardiothoracic/ Thoracic Surgery	18
Colorectal Surgery	17
Physical Medicine & Rehabilitation	17
Metabolism Disease	16
Speech And Language	16
Plastic Surgery	15
Forensic Psychiatry	13
Allergy	12
Neonatology	11
Intensive Care Medicine	11
Podiatry	10
Clinical Pharmacology	8
Research	8
Virology	7
Public Health	7
Oral & Maxillofacial Surgery	6
Neurosurgery	6
Radiotherapy	5
Occupational Therapist	5
Nuclear Medicine	5
Physiology	5
Family Planning	5
Cardiovascular/Vascular Surgery	4
Pain Management	4
Child Surgery	3
Neurophysiology	3
Psychotherapy	3
Epilepsy	2
Tropical Medicine	1
Wholesaler	1

	Amount
Normalised specialty
Doctors	9217
Nurses	4002
Other	1852
Pharmacists	646