Matching Publications to PubMed IDs

2017-12-10

Earlier this week, I got an e-mail from Lucia Santamaria from the "Gender Gap in Science" project from the International Council for Science. They are trying to systematically measure the gender gap in a number of ways, including looking at publication records. One of the important parts of their effort is to find ways to validate methods of gender assignment or inference.

Lucia was writing about some data found in the paper I wrote with with Melanie Stefan about women in computational biology. In particular, she wanted the dataset that we got from Filardo et. al. that had a list of ~3000 journal articles with known first-author genders. This is a great dataset to have, since the authors of that paper did all the hard work of actually going one-by-one through papers and finding the author genders by hand (often searching institutional websites and social media profiles for pictures etc).

This allowed us to validate our gender inference based on author first name against a dataset where the truth is known (it did pretty well).

And that's what Lucia wants to do as well. There's just one problem: the form of the data from Filardo and colleagues is human-readable, but not machine- readable. For example, one rwo of the data looks like this:

Article_ID Full_Title Link_to_Article Journal Year Month First Author Gender
Aaby et al. (2010) Non-specific effects of standard measles vaccine at 4.5 and 9 months of age on childhood mortality: randomised controlled trial NA BMJ 2010 12 - Dec Male
Aaron et al. (2007) Tiotropium in Combination with Placebo Salmeterol or Fluticasone–Salmeterol for Treatment of Chronic Obstructive Pulmonary Disease: A Randomized Trial http://annals.org/article.aspx?articleid=734106 Annals of Internal Medicine 2007 04 - Apr Male

What we'd really like to have is some unique identifier (eg the Pubmed ID or doi) associated with each record. That would make it much easier to cross- reference with other datasets, including the table of hundreds of thousands of publications that we downloaded as part of this study. I had this when we published - it's how we validated our method, but I'm embarassed to say it didn't get included with the paper, and I recently had to delete my old computer backups which is the only place it lived.

I told her how we initially did this, and it sounds like she managed to make it work herself, but I thought it would be worth documenting how I did it orginially (and maybe improve it slightly). So we've got a table with titles, author last names, years, and journals, and what we want is a pubmed id or doi.

I'm using julia as my go-to language these days, and the first step is to load in the table as a dataframe that we can manipulate:


In [4]:
using DataFrames
using CSV

df = CSV.read("../data/known-gender.csv")

# get an array with the "article id" as a string
ids = Vector{String}(df[:Article_ID])
@show ids[1]


ids[1] = "Aaby et al. (2010)"
Out[4]:
"Aaby et al. (2010)"

We can see that the current form of the :Article_ID has both author name and the year, but we want to be able to handle these separately. But they all take the same form, the author name, sometimes followed by "et al.", followed by the year in parentheses[^c90d47d1]. So I'm going to use regular expressions, which can look like gobbledegook if you're not familiar with it. It's beyond the scope of this post to describe it, but I highly recomend regexr as a resource for learning and testing. The regex I used for this is composed of 3 parts:

  • The author name, which can have some number of letters, spaces, -, and ': ([\w\s\-']+)
    • these are also at the beginning of the line, so I add a ^ (just to be safe)
  • Then "et al.": et al\.
    • the . is a special character, so it's escaped with \
    • this is also optional, so I wrap it in parentheses and add ?
  • The year, which is 4 digits, and wrapped in parentheses: \((\d{4})\)
    • the inner parentheses are so I can grab it as a group
    • it should be the end, so I finish with $

So, the complete regex: ^([\w\s\-']+)( et al\.)? \((\d{4})\)$

And now I want to apply that search to everything in the Article_ID column:


In [7]:
# confirm this matches every row
sum(.!ismatch.(r"^([\w\s\-']+)( et al\.)? \((\d{4})\)$", ids)) # 3204

# apply match across the whole ids array with `match.()`
matches = match.(r"^([\w\s\-']+)( et al\.)? \((\d{4})\)$", ids)
# get the last name of first authors

firstauthors = [m.captures[1] for m in matches]
# get the year... yeah there's a column for this, but since we have it...
years = [parse(Int, m.captures[3]) for m in matches]


Out[7]:
3204-element Array{Int64,1}:
 2010
 2007
 2004
 2008
 2007
 2011
 2002
 2005
 2003
 2013
 2013
 2005
 1998
    ⋮
 2012
 2013
 2000
 2000
 2010
 2013
 2008
 2005
 2001
 1999
 2007
 2011

We also have a column for the journal names, but pubmed searching is a bit idiosyncratic and works better if you use the right abreviations. So next, I built a dictionary of abbreviations, and used an array comprehension to get a new array with the journal names:


In [9]:
journals = Vector{String}(df[:Journal])

replacements = Dict(
    "Archives of Internal Medicine" => "Arch Intern Med",
    "Annals of Internal Medicine" => "Ann Intern Med",
    "The Lancet" => "Lancet",
    "NEJM" => "N Engl J Med"
    )

journals = [in(x, keys(replacements)) ? replacements[x] : x for x in journals]


Out[9]:
3204-element Array{String,1}:
 "BMJ"            
 "Ann Intern Med" 
 "Arch Intern Med"
 "N Engl J Med"   
 "JAMA"           
 "Ann Intern Med" 
 "N Engl J Med"   
 "BMJ"            
 "Ann Intern Med" 
 "BMJ"            
 "BMJ"            
 "BMJ"            
 "Arch Intern Med"
 ⋮                
 "Arch Intern Med"
 "Lancet"         
 "Arch Intern Med"
 "Arch Intern Med"
 "Ann Intern Med" 
 "Arch Intern Med"
 "JAMA"           
 "Arch Intern Med"
 "JAMA"           
 "Lancet"         
 "Lancet"         
 "Ann Intern Med" 

Finally, words like "and", "in", "over" don't help a lot when searching, and the titles currently have special characters (like : or ()) that also don't help or could even hurt our ability to search. So I took the title column, and built new strings using only words that are 5 characters or more. I did this all in one go, but to explain, the matchall() function finds all of the 5 or more letter words (that's \w{5,} in regex) and returns an array of matches. Then the join() function puts them together in a single string (separated by a space since I passed ' ' as an argument):


In [10]:
titles = join.(matchall.(r"\w{5,}", Vector{String}(df[:Full_Title])), ' ')


Out[10]:
3204-element Array{String,1}:
 "specific effects standard measles vaccine months childhood mortality randomised controlled trial"                                                     
 "Tiotropium Combination Placebo Salmeterol Fluticasone Salmeterol Treatment Chronic Obstructive Pulmonary Disease Randomized Trial"                    
 "Blocker Dialysis Patients Association Hospitalized Heart Failure Mortality"                                                                           
 "Safety Immunogenicity AS02D Malaria Vaccine Infants"                                                                                                  
 "Strain Acute Recurrent Coronary Heart Disease Events"                                                                                                 
 "Comparative Effectiveness Management Interventions Fracture Systematic Review"                                                                        
 "Cardiac Resynchronization Chronic Heart Failure"                                                                                                      
 "Utility testing monoclonal bands serum patients suspected osteoporosis retrospective cross sectional study"                                           
 "Short Effects Cannabinoids Patients Infection Randomized Placebo Controlled Clinical Trial"                                                           
 "Effect lower sodium intake health systematic review analyses"                                                                                         
 "Effect increased potassium intake cardiovascular factors disease systematic review analyses"                                                          
 "Rectal artemether versus intravenous quinine treatment cerebral malaria children Uganda randomised clinical trial"                                    
 "Hyperkalemia Hospitalized Patients Causes Adequacy Treatment Results Attempt Improve Physician Compliance Published Therapy Guidelines"               
 ⋮                                                                                                                                                      
 "Supratherapeutic Dosing Acetaminophen Among Hospitalized Patients"                                                                                    
 "Efficacy safety immunology inactivated adjuvant enterovirus vaccine children China multicentre randomised double blind placebo controlled phase trial"
 "Frequency Major Hemorrhage Patients Treated Unfractionated Intravenous Heparin Venous Thrombosis Pulmonary Embolism Study Routine Clinical Practice"  
 "Patients Depression Likely Follow Recommendations Reduce Cardiac During Recovery Myocardial Infarction"                                               
 "Glucose Independent Black White Differences Hemoglobin Levels Cross sectional Analysis Studies"                                                       
 "Health Associated Infections analysis Costs Financial Impact Health System"                                                                           
 "Effectiveness Specialized Palliative Systematic Review"                                                                                               
 "Regional Institutional Variation Initiation Early Resuscitate Orders"                                                                                 
 "Association Between Polymorphism Transforming Growth Factor Breast Cancer Among Elderly White Women"                                                  
 "Effect vitamin frequency reflex sympathetic dystrophy wrist fractures randomised trial"                                                               
 "Artemether lumefantrine versus amodiaquine sulfadoxine pyrimethamine uncomplicated falciparum malaria Burkina randomised inferiority trial"           
 "Patient Interest Sharing Personal Health Record Information Based Survey"                                                                             

So now we've got all the elements of our search, and I just put them into an array of tuples to make them easier to deal with:


In [12]:
searches = collect(zip(titles, firstauthors, journals, years))


Out[12]:
3204-element Array{Tuple{String,SubString{String},String,Int64},1}:
 ("specific effects standard measles vaccine months childhood mortality randomised controlled trial", "Aaby", "BMJ", 2010)                                                                 
 ("Tiotropium Combination Placebo Salmeterol Fluticasone Salmeterol Treatment Chronic Obstructive Pulmonary Disease Randomized Trial", "Aaron", "Ann Intern Med", 2007)                    
 ("Blocker Dialysis Patients Association Hospitalized Heart Failure Mortality", "Abbott", "Arch Intern Med", 2004)                                                                         
 ("Safety Immunogenicity AS02D Malaria Vaccine Infants", "Abdulla", "N Engl J Med", 2008)                                                                                                  
 ("Strain Acute Recurrent Coronary Heart Disease Events", "Aboa-Eboule", "JAMA", 2007)                                                                                                     
 ("Comparative Effectiveness Management Interventions Fracture Systematic Review", "Abou-Setta", "Ann Intern Med", 2011)                                                                   
 ("Cardiac Resynchronization Chronic Heart Failure", "Abraham", "N Engl J Med", 2002)                                                                                                      
 ("Utility testing monoclonal bands serum patients suspected osteoporosis retrospective cross sectional study", "Abrahamsen", "BMJ", 2005)                                                 
 ("Short Effects Cannabinoids Patients Infection Randomized Placebo Controlled Clinical Trial", "Abrams", "Ann Intern Med", 2003)                                                          
 ("Effect lower sodium intake health systematic review analyses", "Aburto", "BMJ", 2013)                                                                                                   
 ("Effect increased potassium intake cardiovascular factors disease systematic review analyses", "Aburto", "BMJ", 2013)                                                                    
 ("Rectal artemether versus intravenous quinine treatment cerebral malaria children Uganda randomised clinical trial", "Aceng", "BMJ", 2005)                                               
 ("Hyperkalemia Hospitalized Patients Causes Adequacy Treatment Results Attempt Improve Physician Compliance Published Therapy Guidelines", "Acker", "Arch Intern Med", 1998)              
 ⋮                                                                                                                                                                                         
 ("Supratherapeutic Dosing Acetaminophen Among Hospitalized Patients", "Zhou", "Arch Intern Med", 2012)                                                                                    
 ("Efficacy safety immunology inactivated adjuvant enterovirus vaccine children China multicentre randomised double blind placebo controlled phase trial", "Zhu", "Lancet", 2013)          
 ("Frequency Major Hemorrhage Patients Treated Unfractionated Intravenous Heparin Venous Thrombosis Pulmonary Embolism Study Routine Clinical Practice", "Zidane", "Arch Intern Med", 2000)
 ("Patients Depression Likely Follow Recommendations Reduce Cardiac During Recovery Myocardial Infarction", "Ziegelstein", "Arch Intern Med", 2000)                                        
 ("Glucose Independent Black White Differences Hemoglobin Levels Cross sectional Analysis Studies", "Ziemer", "Ann Intern Med", 2010)                                                      
 ("Health Associated Infections analysis Costs Financial Impact Health System", "Zimlichman", "Arch Intern Med", 2013)                                                                     
 ("Effectiveness Specialized Palliative Systematic Review", "Zimmermann", "JAMA", 2008)                                                                                                    
 ("Regional Institutional Variation Initiation Early Resuscitate Orders", "Zingmond", "Arch Intern Med", 2005)                                                                             
 ("Association Between Polymorphism Transforming Growth Factor Breast Cancer Among Elderly White Women", "Ziv", "JAMA", 2001)                                                              
 ("Effect vitamin frequency reflex sympathetic dystrophy wrist fractures randomised trial", "Zollinger", "Lancet", 1999)                                                                   
 ("Artemether lumefantrine versus amodiaquine sulfadoxine pyrimethamine uncomplicated falciparum malaria Burkina randomised inferiority trial", "Zongo", "Lancet", 2007)                   
 ("Patient Interest Sharing Personal Health Record Information Based Survey", "Zulman", "Ann Intern Med", 2011)                                                                            

Finally, I iterated through this array and composed searchs, using the BioServices.EUtils package to do the search and retrieval.


In [ ]:
using BioServices.EUtils

for s in searches
    #= if you do too many queries in a row, esearch raises an exception. The solution
    is to pause (here for 10 sections) and then try again. =#
    try
        res = esearch(db="pubmed", term="($(s[1]) [title]) AND ($(s[2]) [author]) AND ($(s[3]) [ journal]) AND ($(s[4]) [pdat])")
    catch
        sleep(10)
        res = esearch(db="pubmed", term="($(s[1]) [title]) AND ($(s[2]) [author]) AND ($(s[3]) [ journal]) AND ($(s[4]) [pdat])")
    end


    doc = parsexml(res.data)
    #= this returns an array of pmids, since in the xml, they're separated by
    newlines. The `strip` function removes leading and trailing newlines, but
    not ones in between ids. =#
    i = split(content(findfirst(doc, "//IdList")) |> strip, '\n')

    if length(i) == 1
        #= if no ids are returned, there's an array with just an empty string,
        in which case I add 0 to the array =#
        length(i[1]) == 0 ? push!(pmids, 0) : push!(pmids, parse(Int, i[1]))
    else
        # if there are more than 1 pmids returned, I just add 9 to the array
        push!(pmids, 9)
    end
end

I actually did better than the last time I tried this - 2447 records associated with only 1 pmid. 13 of the searches had more than 1 pmid, and the rest (744) didn't return any hits.

@show sum(pmids .> 9) @show sum(pmids .== 9) @show sum(pmids .== 0)

I added the ids to the dataframe and saved it as a new file (so I don't have to do the searching again). Since I have the urls for many of the papers, I was thinking I could try to identify the doi's associated with them from their webpages, but that will have to wait until another time.

For now, the last step is to save it as a csv and send it off to Lucia:


In [ ]:
CSV.write("../data/withpmids.csv", df)