Reporting of clinical trial results became mandatory for many trials in 2008. However this paper and this investigation both find that substantial numbers of clinical trials have not reported results, even for those trials where the FDAAA has made reporting mandatory.
This notebook examines how many trials on ClinicalTrials.gov have had their results publicly reported. We have a broader definition of a trial that should report its results than the FDAAA. We count a trial as eligible for our analysis if:
We then classify it as overdue if it has no summary results attached on ClinicalTrials.gov, and no results on PubMed that are linked by NCT ID (see below).
This is substantially broader than FDAAA, which covers only US-based trials of FDA-approved drugs. However, we think all trials should report their results, not just US-based trials, or FDA-approved drugs. In addition, FDAAA requires results to be reported within 12 months of completion, and we allow 24 months.
ClinicalTrials.gov supplies notes on how to find studies with results and results in general.
In [1]:
import csv
from datetime import datetime
from dateutil.relativedelta import relativedelta
import glob
from pprint import pprint
from slugify import slugify
import sqlite3
import numpy as np
import pandas as pd
import utils
The raw XML trial summaries from ClinicalTrials.gov are supplied as a single very large zip file, containing more than 200,000 XML files. This section assumes that that these have already been downloaded and unzipped in the search_result
directory.
Extract the fields of interest from the XML summaries, and save them to a CSV file, which we'll use as our source data for the rest of this exercise. ClinicalTrials.gov supplies field definitions.
Toggle REGENERATE_SUMMARY
to False
for the purposes of development.
In [2]:
fname = './data/trials.csv'
REGENERATE_SUMMARY = False # False
if REGENERATE_SUMMARY:
files = glob.glob('./search_result/*.xml')
print len(files), 'files found'
fieldnames = ['nct_id', 'title', 'overall_status',
'study_type', 'completion_date',
'lead_sponsor', 'lead_sponsor_class',
'collaborator', 'collaborator_class',
'phase', 'locations', 'has_drug_intervention', 'drugs',
'disposition_date', 'results_date',
'enrollment']
trials = csv.DictWriter(open(fname, 'wb'), fieldnames=fieldnames)
trials.writeheader()
for i, f in enumerate(files):
if i % 50000 == 0:
print i, f
text = open(f, 'r').read()
data = utils.extract_ctgov_xml(text)
trials.writerow(data)
print 'done'
In [17]:
dtype = {'has_drug_intervention': bool,
'phase': str}
converters = {'enrollment': lambda x: x and int(x) or 0}
datefields = ['completion_date', 'results_date', 'disposition_date']
df = pd.read_csv(fname,
parse_dates=datefields,
infer_datetime_format=True,
keep_default_na=False,
na_values=None,
converters=converters,
dtype=dtype)
In [18]:
df['phase_normalised'] = df.phase.apply(utils.normalise_phase)
In [19]:
df.tail()
Out[19]:
In [20]:
startdate = datetime.strptime('01 January 2006', '%d %B %Y')
cutoff = datetime.now() - relativedelta(years=2)
print 'Cutoff date', cutoff
df['is_completed'] = (df.overall_status == 'Completed') & \
(df.study_type.str.startswith('Interventional')) & \
(df.completion_date >= startdate) & \
(df.completion_date <= cutoff) & \
(df.phase_normalised >= 2) & \
(df.disposition_date.isnull())
df['is_overdue'] = (df.is_completed & \
df.results_date.isnull())
df_completed = df[df.is_completed]
df_overdue = df[df.is_completed & df.results_date.isnull()]
print len(df), 'total trials found'
print len(df[~df.disposition_date.isnull()]), 'trials have dispositions filed'
print len(df_completed), 'are completed and due results, by our definition'
print len(df[df.is_completed & ~df.results_date.isnull()]), \
'trials due results have submitted results on clinicaltrials.gov'
print len(df_overdue), \
'trials due results have not submitted results on clinicaltrials.gov'
If trials have reported their results on PubMed, and if it's possible to find them on PubMed using a linked NCT ID, then we count those trials as having submitted results.
So, for all trials that we regard as completed and due results, and that haven't already reported results on clinicaltrials.gov, we search PubMed, looking for the NCT ID either as a Secondary Source ID, or in the title/abstract. We look for anything published between the completion date and now, that doesn't have the words "study protocol" in the title, and that is classified as results of a trial (using the "therapy" clinical keyword, broad version).
At the time of writing, about 9,000 of the 34,000 trials have results on PubMed. An example of an NCT ID with results on PubMed: NCT02460380. (TODO: Update this).
Note 1: we know from the BMJ paper that there are trials that do have results on PubMed, but that aren't linked using the NCT ID. The BMJ authors found these using a manual search. Some examples: NCT00002762: 19487378
, NCT00002879: 18470909
, NCT00003134: 19066728
, NCT00003596: 18430910
. We regard these as invalid, because you can only find results via an exhaustive manual search. We only count results as published for our purposes if they are either (i) submitted on ClinicalTrials.gov or (ii) retrievable on PubMed using the NCT ID. See more on this below.
Note 2: we know there are some trials that have results PMIDs directly in ClinicalTrials.gov, in the results_reference
field of the XML. After discussion with Jess here, and Annice at ClinicalTrials.gov, I decided that these results are too often meaningless to be useful - lots of the time they aren't truly results, but are studies from years ago.
Note 3: we also experimented with retrieving the results using the narrow version of the "therapy" clinical keyword, and using no clinical keyword at all. We evaluated these by using multiple PubMed matches as surrogate measures for false identification. At the time of writing on 2016/10/24, we examined 34677 trial registry IDs: the PubMed broad keyword yielded 7815 matches with 1706 multiple matches; the PubMed narrow keyword yielded 6448 matches with 1238 multiple matches, and using no keyword yielded 7981 matches with 1860 multiple matches. We chose the broad keyword for our final results.
In [21]:
# Store results locally.
conn = sqlite3.connect('./data/trials-abstract.db')
cur = conn.cursor()
c = "CREATE TABLE IF NOT EXISTS trials(nct_id TEXT PRIMARY KEY, "
c += "pubmed_results INT, pubmed_results_broad INT, pubmed_results_narrow INT)"
cur.execute(c)
conn.commit()
REGENERATE_PUBMED_LINKS = False
count = 0
df['pubmed_results'] = False
for i, row in df_overdue.iterrows():
if count % 10000 == 0:
print count, row.nct_id
count += 1
# First, check for results stored in the local db.
c = "SELECT nct_id, pubmed_results, pubmed_results_broad, "
c += "pubmed_results_narrow FROM trials WHERE nct_id='%s'" % row.nct_id
cur.execute(c)
data = cur.fetchone()
has_results = False
if data and (not REGENERATE_PUBMED_LINKS):
has_results = bool(int(data[2]))
else:
# No local results, or we want to regenerate them: check PubMed.
broad_results = \
utils.get_pubmed_linked_articles(row.nct_id,
row.completion_date, 'broad')
# Used in the past (see note 3 above).
simple_results = \
utils.get_pubmed_linked_articles(row.nct_id,
row.completion_date, '')
narrow_results = \
utils.get_pubmed_linked_articles(row.nct_id,
row.completion_date, 'narrow')
c = "INSERT OR REPLACE INTO trials VALUES('%s', %s, %s, %s)" % \
(row.nct_id, len(simple_results), len(broad_results), len(narrow_results))
cur.execute(c)
conn.commit()
has_results = broad_results > 0
df.set_value(i, 'pubmed_results', has_results)
cur.close()
conn.close()
print 'done'
In [22]:
# Reset dataframes now we have the results from PubMed.
df['is_overdue'] = (df.is_completed & df.results_date.isnull() & ~df.pubmed_results)
print 'How many of the unreported trials were found on PubMed:'
print df[df.is_completed & df.results_date.isnull()].pubmed_results.value_counts()
df_completed = df[df.is_completed]
df_overdue = df[df.is_overdue]
In [25]:
# Print summary stats for the entire dataset.
print len(df_completed), 'trials should have published results'
print len(df_overdue), 'trials have not published results'
percent_submitted = (1 - (len(df_overdue) / float(len(df_completed)))) * 100
print '%s%% of completed trials have published results' % \
'{:,.2f}'.format(percent_submitted)
print int(df_overdue.enrollment.sum()), 'total patients are enrolled in overdue trials'
In [26]:
# Print summary stats for major trial sponsors only.
NUM_TRIALS = 30
df_major = df_completed[
df_completed.groupby('lead_sponsor').nct_id.transform(len) >= NUM_TRIALS]
print len(df_major), 'trials by major sponsors should have published results'
print len(df_major[df_major.is_overdue]), 'trials by major sponsors have not published results'
percent_submitted = (1 - (len(df_major[df_major.is_overdue]) / float(len(df_major)))) * 100
print '%s%% of completed trials by major sponsors have published results' % \
'{:,.2f}'.format(percent_submitted)
print int(df_major[df_major.is_overdue].enrollment.sum()), 'total patients are enrolled in overdue trials'
In [27]:
df_completed.groupby('lead_sponsor_class').sum()[['is_overdue', 'is_completed']]
Out[27]:
In [28]:
# Calculate publication rates by sector (raw data)
df_by_sector = df_completed.groupby('lead_sponsor_class').sum()[['is_overdue', 'is_completed']]
df_by_sector['percent_overdue'] = df_by_sector.is_overdue / df_by_sector.is_completed * 100
df_by_sector
Out[28]:
In [30]:
# Calculate publication rates by sector (major sponsors only)
df_major_gp = df_major.groupby('lead_sponsor_class').sum()[['is_overdue', 'is_completed']]
df_major_gp['percent_overdue'] = df_major_gp.is_overdue / df_major_gp.is_completed * 100
df_major_gp
Out[30]:
In [31]:
df_completed['year_completed'] = df_completed['completion_date'].dt.year.dropna().astype(int)
df_completed['year_completed'] = df_completed.year_completed.astype(int)
# Drop all sponsors with fewer than N completed trials.
df_final = df_completed[
df_completed.groupby('lead_sponsor').nct_id.transform(len) >= NUM_TRIALS]
# Now reshape the data: a row for each sponsor, columns by year:
# lead_sponsor,2008_overdue,2008_total,2009_overdue,2009_total...
df_temp = df_final.set_index(['lead_sponsor', 'lead_sponsor_class', 'year_completed'])
gb = df_temp.groupby(level=[0, 1, 2]).is_overdue
df2 = gb.agg({'overdue': 'sum', 'total': 'count'}) \
.unstack().swaplevel(0, 1, 1).sort_index(1)
df2.columns = df2.columns.to_series().apply(lambda x: '{}_{}'.format(*x))
df3 = df2.reset_index()
df3['lead_sponsor_slug'] = df3.lead_sponsor.apply(slugify)
In [33]:
df3.to_csv('./data/completed.csv', index=None)
print len(df3), 'sponsors found with cutoff point at %s trials' % NUM_TRIALS
# Write the raw output to a full spreadsheet.
df.to_csv('./data/all.csv', index=None)
TODO: Make this a separate notebook?
A 2016 BMJ paper found that around 65% of papers reprted results. "Overall, 2892 of the 4347 clinical trials (66.5%) had been published or reported results as of July 2014."
Excellently, the BMJ authors publish their raw data on DataDryad so we can compare our results with theirs, to get an idea of the difference between our automated strategy and their partially manual strategy. (However, in their reported data it looks to me like the matched PMID rate is 59.9% of all NCT IDs.)
The BMJ authors were looking at a much smaller set of papers than us, because they focussed on academic medical centres. Their set is slightly different, because they include pre-Phase-2 trials, and 'Terminated' as well as 'Completed' trials. They also used a manual search strategy which involved searching Scopus and manually comparing results.
In [15]:
from openpyxl import load_workbook
import sys
bmj_results = load_workbook(filename = './data/chen-bmj.xlsx')
In [16]:
nct_ids = {}
count = 0
has_pmid = 0
# The Excel data has multiple worksheets, sigh.
# And NCT IDs can occur more than once with different results, sigh.
# We only care about where there's at least one result.
# Fiddle about and reshape the data so that we know whether
# each NCT ID has a result.
for sheet in bmj_results.worksheets:
for i, row in enumerate(sheet.rows):
if i == 0:
continue
if row[0].value:
count += 1
if isinstance(row[6].value, long):
val = str(row[6].value)
else:
val = row[6].value
if val:
has_pmid += 1
# Always set val if it exists.
# Otherwise, only set val if there's no current value
# for this NCT ID.
if val:
nct_ids[row[0].value] = val
else:
if not row[0].value in nct_ids:
nct_ids[row[0].value] = val
print count, 'rows found in total'
print has_pmid, 'of those rows have a PMID'
print has_pmid / float(count) * 100, 'per cent of their NCT IDs have a PMID, including duplicates'
print
unique_nct_ids = len(nct_ids.keys())
print unique_nct_ids, '*unique* NCT IDs found in all rows'
pmids_found = sum(1 for x in nct_ids.values() if x)
print pmids_found, 'of these have PMIDs'
print pmids_found / float(unique_nct_ids) * 100, 'per cent of unique NCT IDs have a PMID'
In [17]:
df_bmj = pd.Series(nct_ids).to_frame(name='pmid')
df_bmj['pubmed_results'] = ~df_bmj.pmid.isnull()
df_bmj.index.name = 'nct_id'
df_bmj.reset_index(inplace=True)
print len(df_bmj), 'NCT IDs in the full BMJ dataset'
# df_bmj.head(20)
merged_results = \
pd.merge(df_bmj, df_completed, #[['nct_id', 'pubmed_results']],
on='nct_id', how='inner', suffixes=('_bmj', '_ours'))
# NB I tried this first with a left join: but 1521 out of the 4500 papers
# don't appear in our dataset, because the BMJ authors' inclusion criteria are
# different from ours. To get a sample after a left join...
# merged_results[merged_results.we_have_results.isnull()].head()
merged_results['we_have_results'] = ~merged_results.is_overdue
merged_results.we_have_results.value_counts(dropna=False)
# merged_results.head()
print len(merged_results), 'NCT IDs are in both the BMJ dataset and ours'
papers_both_find_pm_results = \
merged_results[merged_results.pubmed_results_bmj & merged_results.we_have_results]
papers_both_find_pm_results.head()
print len(papers_both_find_pm_results), 'we both find results for'
papers_only_they_find_results = \
merged_results[merged_results.pubmed_results_bmj & ~merged_results.we_have_results]
print len(papers_only_they_find_results), 'only they find results for'
papers_only_we_find_results = \
merged_results[~merged_results.pubmed_results_bmj & merged_results.we_have_results]
print len(papers_only_we_find_results), 'only we find results for'
noone_finds_results = \
merged_results[~merged_results.pubmed_results_bmj & ~merged_results.we_have_results]
print len(noone_finds_results), 'neither of us find results for'
In [18]:
# Examine a sample of the papers only they find results for.
cols = ['nct_id', 'title', 'pubmed_results_bmj', 'pmid', 'we_have_results']
papers_only_they_find_results.sample(10)[cols]
Out[18]:
In [19]:
# Papers only we find results for. If the `results_date` field exists, it
# means that the results are published on ClinicalTrials.gov. Otherwise
# we found results on PubMed but they did not - perhaps because
# it's been a couple of years since they did their search.
# We find 43 papers on PubMed that the BMJ authors don't:
print len(papers_only_we_find_results), 'papers for which only we find results'
print len(papers_only_we_find_results[papers_only_we_find_results.results_date.isnull()]),\
'of those we find on PubMed, the rest on ClinicalTrials.gov'
cols = ['nct_id', 'title', 'completion_date', 'pubmed_results_bmj',
'pmid', 'we_have_results', 'results_date']
# papers_only_we_find_results.sample(20)[cols]
papers_only_we_find_results[papers_only_we_find_results.results_date.isnull()].sample(10)[cols]
Out[19]:
In [20]: