PDF stands for "Portable Document Format" and PDF files contain images, text, and page layout information. PDF files are actually programs in a very simple programming language and, hence, can display just about anything. Much of what you see inside a PDF file is text, however, and we can grab that text without the layout information using poppler. (I used to use pdfminer
but somehow no longer works on OS X.) Install it with:
brew install poppler
or
brew upgrade poppler
Then use pdftotext
as a command from the commandline, which will extract out the text and save in a text file. First download a sample PDF, such as Tesla model S, which we can easily do from the command line using curl
(which you might have to install):
In [5]:
! curl https://www.tesla.com/sites/default/files/tesla-model-s.pdf > /tmp/tsla.pdf
That command downloads the file and because of the redirection operator, >
, the output gets written to tsla.pdf
up in /tmp
directory.
Once we have the data, we can pass the filename to pdftotext
to extract the text:
In [14]:
! pdftotext /tmp/tsla.pdf # saves into /tmp/tsla.txt
(Don't worry about those warnings.)
In [7]:
! head -10 /tmp/tsla.txt
Once you have text output, you can perform whatever analysis you'd like without having to worry about the data coming in PDF form. For example, you might want to run some analysis on financial documents but they are all in PDF. First, convert to text and then perform your analysis.
In [8]:
with open('/tmp/tsla.txt') as f:
print(f.read().split()[:100])
It's often the case that we can do a huge amount of cleanup on unstructured text before using Python to process it more formally. We can delete unwanted characters, squeeze repeated characters, reformat, etc... In this section you will do a number of exercises that get you use to processing files from the command line. If you'd like to dig further, you can see this link.
The operating system launches all commands in a pipeline sequence as separate processes, which means they can run on multiple processors simultaneously. This gives us parallel processing without having to write complicated code. As data is completed by one stage, it passes it to the next stage of the pipeline, and continues to work on its input. The next stage consumes that input in parallel. Consequently, processing text from the command line can be extremely efficient, much more so than doing it in Python.
In [9]:
! tr -s '\n' ' ' < /tmp/tsla.txt | head -c 200
In [10]:
! tr -s '\n' ' ' < /tmp/tsla.txt | fold -s | head -10
In [11]:
! tr -s '\n' ' ' < /tmp/tsla.txt | fold -s | nl | head -10
Convert the text to all lowercase using tr
. Hint: a-z
and A-Z
are regular expressions that describe English characters and uppercase English characters.
In [12]:
! tr 'A-Z' 'a-z' < /tmp/tsla.txt | head -c 150
In [13]:
! tr -s '\n' ' ' < /tmp/tsla.txt | tr 'A-Z' 'a-z' | head -c 150