In [1]:
%pylab inline
In [3]:
import slate
import codecs
import re
import os
import numpy as np
import unidecode
In Module 1, we briefly touched on the problem of extracting usable plain-text content from PDF documents. In this notebook, we will use a Python package called slate to extract text from PDF documents that have embedded text (see Module 1 for details on what that means).
Like most things in life, you should try to do the whole job all the way through until you're fairly certain of the procedure. So, we'll start with a single file.
First we have to open the file. The open() function creates a file
object. You can open a file like this:
In [4]:
f = codecs.open('../data/example.pdf')
But you have to remember to close the file!! If you don't, Bad Things Will Happen.
In [5]:
f.close()
A better way to open a file is to use a with
statement. The basic idea is that we can open the file, do some procedures (in the indented block), and then the file will be automatically closed after the end of the indented block. So:
In [32]:
with codecs.open('../data/example.pdf') as f:
pass # do something
To read the PDF document with slate, we use slate's PDF method.
In [8]:
with codecs.open('../data/example.pdf') as f:
extracted_text = slate.PDF(f)
Now we have a PDF object.
In [9]:
extracted_text
Out[9]:
Hey, it's text! Notice that the text is actually represented by a list of strings. Each page is represented as a separate string; how nice! If I wanted to keep these pages separate, I could save each one as a separate text file. Or I could stitch them together into a single string, and save the document as one text file.
To stitch the pages together, I could use the the join() method.
In [11]:
# This concatenates the pages, with two newlines intervening between them.
joined = "\n\n".join(extracted_text)
joined
Out[11]:
Notice all of those pesky special characters? e.g. \xe2\x80\x99
? \xe2\x80\x99
is a fancy-looking apostrophe. That means that there was some unicode data in the document -- that is, characters that aren't represented in the default ASCII encoding (basically just the keys on your keyboard). When we use the default open
command to read a file, we just read everything in as ASCII. If we don't convert this string to its proper encoding, we'll run into Big Problems in the future.
We will decode the string that we retrieved from our document using the decode()
method. Well first try the UTF-8 encoding.
In [15]:
joined.decode('utf-8')
Out[15]:
Now instead of \xe2\x80\x99
, we see \u2019
. Success! If we had gotten the encoding wrong, those characters would not have been transformed successfully. For example:
In [16]:
joined.decode('latin-1')
Out[16]:
Our last step is to write this text to a file. We can open()
a new file (one that doesn't exist yet), and write to it. Note the 'w'
that we are passing to open()
-- that means that we want to open the file for writing. We also state that we want to use UTF-8 encoding.
In [22]:
with codecs.open('../data/example.txt', 'w', encoding='utf-8') as f:
f.write(joined.decode('utf-8'))
You should now be able to find and open the text file at the path above.
Now let's scale up. We'll assume that we have a folder containing all of our PDFs. First we have to generate a list of all of the files in that folder. Then we have to iterate over all of those files, and extract the text (as above), and write the text into a new set of files that we can use in our computational workflow.
The os package has a handy method called listdir() that will give us a list of files in a directory.
In [18]:
import os
In [19]:
os.listdir('../data/PDFs/')
Out[19]:
We can generate the full paths to these files using the os.join()
method.
In [4]:
base_path = '../data/PDFs/'
for filename in os.listdir(base_path):
print os.path.join(base_path, filename)
So now we want to apply our procedure to each of these files:
slate
,It's a good idea to create a new folder to hold the new text files.
In [5]:
text_basepath = '../data/PDFs_extracted'
Now the action.
Update: Most of the problems that we encountered were the result of errors in an older version of slate
. Make sure that you have the most recent version of slate by running the following on your command line:
$ pip install -U slate
...
$ pip show slate
If that doesn't work, you can also install the most recent version of the package directly from GitHub:
$ pip install git+git://github.com/timClicks/slate.git
The code below has been modified to correct a few common errors. This takes considerably longer to run, but should result in much cleaner extractions.
In [15]:
# 9, 10, and 13 are \t, \n, and \r, respectively.
m = dict.fromkeys([c for c in range(32) if c not in [9, 10, 13]])
problems = []
basepath = '../data/PDFs/' # Folder containing PDF files.
for filename in os.listdir(basepath):
filepath = os.path.join(basepath, filename)
try:
# Open the file.
with codecs.open(filepath, 'r') as f:
# Extract the text.
extracted_text = slate.PDF(f)
except:
problems.append((filename, "Slate couldn't open the PDF"))
continue
# Correct over-spaced text (i.e. individual characters are separated by spaces).
for i, page in enumerate(extracted_text):
if np.mean([len(c) for c in page.split()]) < 2.:
extracted_text[i] = re.sub('[ \t]{2,}', '_', page).replace(' ', '').replace('_', ' ')
extracted_text[i] = re.sub('[ \t]{2,}', ' ', re.sub('[\n]{2,}', ' ', extracted_text[i]))
# Join the pages, and decode.
joined_text = '\n\n\n'.join(extracted_text)
# The ``.translate(m)`` bit removes control characters.
joined_text = joined_text.decode('utf-8').translate(m)
# Figure out where to put the new text file.
# Updated to clean the filename -- NLTK chokes on non-ASCII filenames.
textname = unidecode.unidecode(filename.replace('pdf', 'txt').decode('utf-8').translate(m))
textpath = os.path.join(text_basepath, textname)
# Open (create) the new text file
with codecs.open(textpath, 'w', encoding='utf-8') as f:
# Write the string to the file.
f.write(joined_text)
Now we can try to open the corpus with NLTK.
In [16]:
import nltk
In [17]:
documents = nltk.corpus.PlaintextCorpusReader(text_basepath, '.*\.txt')
In [18]:
whoops = nltk.Text(documents.words())
In [21]:
whoops.concordance('change')
In [ ]: