In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [3]:
import slate
import codecs
import re
import os
import numpy as np
import unidecode

Extracting plain text from text-embedded PDFs

In Module 1, we briefly touched on the problem of extracting usable plain-text content from PDF documents. In this notebook, we will use a Python package called slate to extract text from PDF documents that have embedded text (see Module 1 for details on what that means).

Start with one file

Like most things in life, you should try to do the whole job all the way through until you're fairly certain of the procedure. So, we'll start with a single file.

First we have to open the file. The open() function creates a file object. You can open a file like this:


In [4]:
f = codecs.open('../data/example.pdf')

But you have to remember to close the file!! If you don't, Bad Things Will Happen.


In [5]:
f.close()

A better way to open a file is to use a with statement. The basic idea is that we can open the file, do some procedures (in the indented block), and then the file will be automatically closed after the end of the indented block. So:


In [32]:
with codecs.open('../data/example.pdf') as f:
    pass    # do something

To read the PDF document with slate, we use slate's PDF method.


In [8]:
with codecs.open('../data/example.pdf') as f:
    extracted_text = slate.PDF(f)

Now we have a PDF object.


In [9]:
extracted_text


Out[9]:
['I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\xe2\x80\x99s kick and the wind\xe2\x80\x99s song and the white sail\xe2\x80\x99s shaking,And a grey mist on the sea\xe2\x80\x99s face, and a grey dawn breaking.\x0c',
 'I must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \xef\xac\x82ying,And the \xef\xac\x82ung spray and the blown spume, and the sea-gulls crying. \x0c',
 'I must go down to the seas again, to the vagrant gypsy life,To the gull\xe2\x80\x99s way and the whale\xe2\x80\x99s way where the wind\xe2\x80\x99s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\xe2\x80\x99s over.\x0c']

Hey, it's text! Notice that the text is actually represented by a list of strings. Each page is represented as a separate string; how nice! If I wanted to keep these pages separate, I could save each one as a separate text file. Or I could stitch them together into a single string, and save the document as one text file.

To stitch the pages together, I could use the the join() method.


In [11]:
# This concatenates the pages, with two newlines intervening between them.
joined = "\n\n".join(extracted_text)    
joined


Out[11]:
'I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\xe2\x80\x99s kick and the wind\xe2\x80\x99s song and the white sail\xe2\x80\x99s shaking,And a grey mist on the sea\xe2\x80\x99s face, and a grey dawn breaking.\x0c\n\nI must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \xef\xac\x82ying,And the \xef\xac\x82ung spray and the blown spume, and the sea-gulls crying. \x0c\n\nI must go down to the seas again, to the vagrant gypsy life,To the gull\xe2\x80\x99s way and the whale\xe2\x80\x99s way where the wind\xe2\x80\x99s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\xe2\x80\x99s over.\x0c'

Encoding issues

Notice all of those pesky special characters? e.g. \xe2\x80\x99 ? \xe2\x80\x99 is a fancy-looking apostrophe. That means that there was some unicode data in the document -- that is, characters that aren't represented in the default ASCII encoding (basically just the keys on your keyboard). When we use the default open command to read a file, we just read everything in as ASCII. If we don't convert this string to its proper encoding, we'll run into Big Problems in the future.

We will decode the string that we retrieved from our document using the decode() method. Well first try the UTF-8 encoding.


In [15]:
joined.decode('utf-8')


Out[15]:
u'I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\u2019s kick and the wind\u2019s song and the white sail\u2019s shaking,And a grey mist on the sea\u2019s face, and a grey dawn breaking.\x0c\n\nI must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \ufb02ying,And the \ufb02ung spray and the blown spume, and the sea-gulls crying. \x0c\n\nI must go down to the seas again, to the vagrant gypsy life,To the gull\u2019s way and the whale\u2019s way where the wind\u2019s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\u2019s over.\x0c'

Now instead of \xe2\x80\x99, we see \u2019. Success! If we had gotten the encoding wrong, those characters would not have been transformed successfully. For example:


In [16]:
joined.decode('latin-1')


Out[16]:
u'I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\xe2\x80\x99s kick and the wind\xe2\x80\x99s song and the white sail\xe2\x80\x99s shaking,And a grey mist on the sea\xe2\x80\x99s face, and a grey dawn breaking.\x0c\n\nI must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \xef\xac\x82ying,And the \xef\xac\x82ung spray and the blown spume, and the sea-gulls crying. \x0c\n\nI must go down to the seas again, to the vagrant gypsy life,To the gull\xe2\x80\x99s way and the whale\xe2\x80\x99s way where the wind\xe2\x80\x99s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\xe2\x80\x99s over.\x0c'

Our last step is to write this text to a file. We can open() a new file (one that doesn't exist yet), and write to it. Note the 'w' that we are passing to open() -- that means that we want to open the file for writing. We also state that we want to use UTF-8 encoding.


In [22]:
with codecs.open('../data/example.txt', 'w', encoding='utf-8') as f:
    f.write(joined.decode('utf-8'))

You should now be able to find and open the text file at the path above.

Converting a whole bunch of PDFs

Now let's scale up. We'll assume that we have a folder containing all of our PDFs. First we have to generate a list of all of the files in that folder. Then we have to iterate over all of those files, and extract the text (as above), and write the text into a new set of files that we can use in our computational workflow.

The os package has a handy method called listdir() that will give us a list of files in a directory.


In [18]:
import os

In [19]:
os.listdir('../data/PDFs/')


Out[19]:
['example2.pdf', 'example3.pdf', 'example4.pdf', 'example5.pdf']

We can generate the full paths to these files using the os.join() method.


In [4]:
base_path = '../data/PDFs/'
for filename in os.listdir(base_path):
    print os.path.join(base_path, filename)


../data/PDFs/example2.pdf
../data/PDFs/example3.pdf
../data/PDFs/example4.pdf
../data/PDFs/example5.pdf

So now we want to apply our procedure to each of these files:

  • open the file,
  • extract the text with slate,
  • join the pages together into a single string,
  • create a new text file,
  • and write the string into that file.

It's a good idea to create a new folder to hold the new text files.


In [5]:
text_basepath = '../data/PDFs_extracted'

Now the action.


Update: Most of the problems that we encountered were the result of errors in an older version of slate. Make sure that you have the most recent version of slate by running the following on your command line:

$ pip install -U slate
...
$ pip show slate

If that doesn't work, you can also install the most recent version of the package directly from GitHub:

$ pip install git+git://github.com/timClicks/slate.git

The code below has been modified to correct a few common errors. This takes considerably longer to run, but should result in much cleaner extractions.



In [15]:
# 9, 10, and 13 are \t, \n, and \r, respectively.
m = dict.fromkeys([c for c in range(32) if c not in [9, 10, 13]])
problems = []
basepath = '../data/PDFs/'   # Folder containing PDF files.
for filename in os.listdir(basepath):
    filepath = os.path.join(basepath, filename)
    
    try:
        # Open the file.
        with codecs.open(filepath, 'r') as f:
            # Extract the text.
            extracted_text = slate.PDF(f)
    except:
        problems.append((filename, "Slate couldn't open the PDF"))
        continue
        
    # Correct over-spaced text (i.e. individual characters are separated by spaces).
    for i, page in enumerate(extracted_text):
        if np.mean([len(c) for c in page.split()]) < 2.:
            extracted_text[i] = re.sub('[ \t]{2,}', '_', page).replace(' ', '').replace('_', ' ')
        extracted_text[i] = re.sub('[ \t]{2,}', ' ', re.sub('[\n]{2,}', ' ', extracted_text[i]))
    
    # Join the pages, and decode.
    joined_text = '\n\n\n'.join(extracted_text)
    # The ``.translate(m)`` bit removes control characters.
    joined_text = joined_text.decode('utf-8').translate(m)
    
    # Figure out where to put the new text file.
    #  Updated to clean the filename -- NLTK chokes on non-ASCII filenames.
    textname = unidecode.unidecode(filename.replace('pdf', 'txt').decode('utf-8').translate(m))
    textpath = os.path.join(text_basepath, textname)
    
    # Open (create) the new text file
    with codecs.open(textpath, 'w', encoding='utf-8') as f:
        # Write the string to the file.
        f.write(joined_text)

Now we can try to open the corpus with NLTK.


In [16]:
import nltk

In [17]:
documents = nltk.corpus.PlaintextCorpusReader(text_basepath, '.*\.txt')

In [18]:
whoops = nltk.Text(documents.words())

In [21]:
whoops.concordance('change')


Displaying 25 of 38 matches:
 is originally rid of a cover . The change in that is that red weakens an hour
t is that red weakens an hour . The change has come . There is no search . But
ng . A SUBSTANCE IN A CUSHION . The change of color is likely and a difference
nt as many girls as men . Does this change . It shows that dirt is clean when 
over . Supposing you do not like to change , supposing it is very clean that t
g it is very clean that there is no change in appearance , supposing that ther
nd the same red with purple makes a change . It shows that there is no mistake
f inside is let in and there places change then certainly something is upright
nd no coffee , not even a card or a change to incline each way , a plan that h
tering counting . It does , it does change in more water . Supposing a single 
the bottom . A piece of crystal . A change , in a change that is remarkable th
 piece of crystal . A change , in a change that is remarkable there is no reas
re so is relaxed and yet there is a change , a news is pressing . A LITTLE CAL
le steadiness . Is it likely that a change . A table means more than a glass e
ing there is no wagon . There is no change lighter . It was done . And then th
were not all in place . They had no change . They were not respected . They we
ndensed . It was spread there . Any change was in the ends of the centre . A h
e . A heap was heavy . There was no change . Burnt and behind and lifting a te
e and sweet and yet there comes the change , there comes the time to press mor
s perfect denial does make the time change all the time . The sister was not a
there would be if there had been no change . A little sign of an entrance is t
d uselessly . Any little thing is a change that is if nothing is wasted in tha
ot and the insistence is marked . A change is in a current and there is no hab
corners are gathered together . The change is mercenary that settles whitening
it is and where it is in place . No change is not needed . That does show desi

In [ ]: