Extracting text from PDFs

PDF stands for "Portable Document Format" and PDF files contain images, text, and page layout information. PDF files are actually programs in a very simple programming language and, hence, can display just about anything. Much of what you see inside a PDF file is text, however, and we can grab that text without the layout information using poppler. (I used to use pdfminer but somehow no longer works on OS X.) Install it with:

brew install poppler

brew upgrade poppler

Then use pdftotext as a command from the commandline, which will extract out the text and save in a text file. First download a sample PDF, such as Tesla model S, which we can easily do from the command line using curl (which you might have to install):



In [5]:

    
! curl https://www.tesla.com/sites/default/files/tesla-model-s.pdf > /tmp/tsla.pdf









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9438k    0 9438k    0     0  4375k      0 --:--:--  0:00:02 --:--:-- 4375k

That command downloads the file and because of the redirection operator, >, the output gets written to tsla.pdf up in /tmp directory.

Once we have the data, we can pass the filename to pdftotext to extract the text:



In [14]:

    
! pdftotext /tmp/tsla.pdf # saves into /tmp/tsla.txt









    



Syntax Warning: Invalid Font Weight
Syntax Warning: Invalid Font Weight
Syntax Warning: Invalid Font Weight
Syntax Warning: Invalid Font Weight
Syntax Warning: Invalid Font Weight

(Don't worry about those warnings.)



In [7]:

    
! head -10 /tmp/tsla.txt









    



Model S
Premium Electric Sedan

An evolution
in automobile
engineering
Tesla’s advanced electric powertrain
delivers exhilarating performance.
Unlike a gasoline internal combustion
engine with hundreds of moving

Once you have text output, you can perform whatever analysis you'd like without having to worry about the data coming in PDF form. For example, you might want to run some analysis on financial documents but they are all in PDF. First, convert to text and then perform your analysis.

Exercise

Read that text file with a Python script and split the document into a list of words. Print out the first 100 words. It should look like:

['Model', 'S', 'Premium', 'Electric', 'Sedan', 'An', ...]



In [8]:

    
with open('/tmp/tsla.txt') as f:
    print(f.read().split()[:100])









    



['Model', 'S', 'Premium', 'Electric', 'Sedan', 'An', 'evolution', 'in', 'automobile', 'engineering', 'Tesla’s', 'advanced', 'electric', 'powertrain', 'delivers', 'exhilarating', 'performance.', 'Unlike', 'a', 'gasoline', 'internal', 'combustion', 'engine', 'with', 'hundreds', 'of', 'moving', 'parts,', 'Tesla', 'electric', 'motors', 'have', 'only', 'one', 'moving', 'piece:', 'the', 'rotor.', 'As', 'a', 'result,', 'Model', 'S', 'acceleration', 'is', 'instantaneous,', 'silent', 'and', 'smooth.', 'Step', 'on', 'the', 'accelerator', 'and', 'in', 'as', 'little', 'as', '3.1', 'seconds', 'Model', 'S', 'is', 'travelling', '60', 'miles', 'per', 'hour,', 'without', 'hesitation,', 'and', 'without', 'a', 'drop', 'of', 'gasoline.', 'Model', 'S', 'is', 'an', 'evolution', 'in', 'automobile', 'engineering.', 'All-Wheel', 'Drive', 'Dual', 'Motor', 'Rear', 'Wheel', 'Drive', 'All-Wheel', 'Drive', 'Dual', 'Motor', 'Dual', 'Motor', 'Model', 'S', 'is']

Text processing from the command line

It's often the case that we can do a huge amount of cleanup on unstructured text before using Python to process it more formally. We can delete unwanted characters, squeeze repeated characters, reformat, etc... In this section you will do a number of exercises that get you use to processing files from the command line. If you'd like to dig further, you can see this link.

The operating system launches all commands in a pipeline sequence as separate processes, which means they can run on multiple processors simultaneously. This gives us parallel processing without having to write complicated code. As data is completed by one stage, it passes it to the next stage of the pipeline, and continues to work on its input. The next stage consumes that input in parallel. Consequently, processing text from the command line can be extremely efficient, much more so than doing it in Python.

Exercise

Using the tr (translate) command from the terminal, strip all of the new lines from the text file you created above (/tmp/tsla.txt). Look at the manual page with this command:

$ man tr

You can pipe the output of tr to head -c 150 to only print out the first 150 characters of the output.



In [9]:

    
! tr -s '\n' ' ' < /tmp/tsla.txt | head -c 200









    



Model S Premium Electric Sedan An evolution in automobile engineering Tesla’s advanced electric powertrain delivers exhilarating performance. Unlike a gasoline internal combustion engine with hundr

Exercise

Reformat the text using tr and fold. The fold command wraps lines at 80 characters; use its -s option to making break lines at spaces between words.



In [10]:

    
! tr -s '\n' ' ' < /tmp/tsla.txt | fold -s | head -10









    



Model S Premium Electric Sedan An evolution in automobile engineering Tesla’s 
advanced electric powertrain delivers exhilarating performance. Unlike a 
gasoline internal combustion engine with hundreds of moving parts, Tesla 
electric motors have only one moving piece: the rotor. As a result, Model S 
acceleration is instantaneous, silent and smooth. Step on the accelerator and 
in as little as 3.1 seconds Model S is travelling 60 miles per hour, without 
hesitation, and without a drop of gasoline. Model S is an evolution in 
automobile engineering. All-Wheel Drive Dual Motor Rear Wheel Drive All-Wheel 
Drive Dual Motor Dual Motor Model S is a categorical improvement on 
conventional all-wheel drive systems. With two motors, one in the front and one

Exercise

It is sometimes useful to put a line number at the left edge of all lines. For example, you might want to create a unique ID number for each row of a CSV file. Pipe the output of the previous command to nl so that you get the line number on the left edge.



In [11]:

    
! tr -s '\n' ' ' < /tmp/tsla.txt | fold -s | nl | head -10









    



     1	Model S Premium Electric Sedan An evolution in automobile engineering Tesla’s 
     2	advanced electric powertrain delivers exhilarating performance. Unlike a 
     3	gasoline internal combustion engine with hundreds of moving parts, Tesla 
     4	electric motors have only one moving piece: the rotor. As a result, Model S 
     5	acceleration is instantaneous, silent and smooth. Step on the accelerator and 
     6	in as little as 3.1 seconds Model S is travelling 60 miles per hour, without 
     7	hesitation, and without a drop of gasoline. Model S is an evolution in 
     8	automobile engineering. All-Wheel Drive Dual Motor Rear Wheel Drive All-Wheel 
     9	Drive Dual Motor Dual Motor Model S is a categorical improvement on 
    10	conventional all-wheel drive systems. With two motors, one in the front and one

Exercise

Convert the text to all lowercase using tr. Hint: a-z and A-Z are regular expressions that describe English characters and uppercase English characters.



In [12]:

    
! tr 'A-Z' 'a-z' < /tmp/tsla.txt | head -c 150









    



model s
premium electric sedan

an evolution
in automobile
engineering
tesla’s advanced electric powertrain
delivers exhilarating performance.
unli

Exercise

Do the same thing but on the text that has the new lines removed.



In [13]:

    
! tr -s '\n' ' ' < /tmp/tsla.txt | tr 'A-Z' 'a-z' | head -c 150









    



model s premium electric sedan an evolution in automobile engineering tesla’s advanced electric powertrain delivers exhilarating performance. unlik