I spend a lot of time looking at academic papers, and I've noticed that journals tend to name their files poorly (in my opinion). Usually, the filename of a paper is named by its publication number, which is practically useless for finding a paper quickly or for scanning a directory for a particular name.

I wrote a short Python script to mass-convert the filenames of a PDF document to be the title of the PDF instead. To use the script, move all your PDF documents to be converted into a single directory, and then modify the "pdf_dir" parameter in the script below. You will also need to install PyPDF2 from pip, if you don't already have it:

pip install PyPDF2


In [1]:
import PyPDF2
import os
import string

# Modify this directory
pdf_dir = r"C:\Test"

# Iterate over each file in the directory
for filename in os.listdir(pdf_dir):
    full_name = os.path.join(pdf_dir, filename)
    # Open each file and read it using PyPDF2
    f = open(full_name, "rb")
    pdf = PyPDF2.PdfFileReader(f)
    title = pdf.getDocumentInfo().title + '.pdf'
    #title = pdf.getOutlines()[0].title + '.pdf'
    f.close()
    # Only allow valid characters in the string   
    valid = "-_.() %s%s" % (string.ascii_letters, string.digits)
    new_filename = ''.join(c for c in title if c in valid)   
    # Make sure the filename is unique
    if os.path.exists(os.path.join(pdf_dir, new_filename)):
        base, ext = os.path.splitext(new_filename)
        ii=1
        while os.path.exists(os.path.join(pdf_dir,base + "_" + str(ii) + ext)):
            ii+=1
        new_filename = base + "_" + str(ii) + ext
    # Rename the file
    full_new = os.path.join(pdf_dir, new_filename)
    os.rename(full_name, full_new)

There are several caveats to using this script. First, the title of the paper is often not correctly marked as the PDF "title". Sometimes, in fact, the "title" of the PDF is the journal name, which is useless for renaming.

I've found that if you uncomment this line:

title = pdf.getOutlines()[0].title + '.pdf'

Sometimes the title of the PDF is contained in the title of the first outline. Fortunately, each journal is typically consistent about where they put the title of the paper, so if you're converting files from the same journal, it's easy to choose between these two methods.