HathiTrust Research Center (HTRC)

The HathiTrust Digital Library contains over 14 million volumes scanned from academic libraries around the world (primarily in North America). The HathiTrust Research Center allows researchers to access almost all of those texts in a few different modes for computational text analysis.

This notebook will walk us through getting set-up to analyze HTRC Extracted Features for volumes in HathiTrust in a Jupyter/Python environment. Extracted Features are currently (as of August 2017) the most robust way to access in-copyright works from the HT Library for computational analysis.

For more information on HTRC:

Installation

To start we'll need to install a few things:

Install the HTRC Feature Reader to work with Extracted Features:



In [ ]:

    
%%capture
!pip install htrc-feature-reader
import os
from htrc_features import FeatureReader
from datascience import *
import pandas as pd
%matplotlib inline

Adding volumes from HathiTrust

To build your own corpus, you first need to find the volumes you'd like to include in the HathiTrust Library. Alternately, you can access volumes from existing public HT collections, or use one of the sample datasets included below under the Sample datasets heading. To access extracted features from HathiTrust:

Install and configure the HT + HTRC mashup browser extension.
Once the extension is running, go to the HathiTrust Library, and search for the titles you want to include.
You can manually download extracted features one result at a time by simply choosing the Download Extracted Features link for any item in your search results. Save the .json.bz2 file or files and skip to the next section, Working with Extracted Features below to load them into your workspace.
If you plan to work with a large number of texts, you might choose instead to create a collection in HathiTrust, and then download the Extracted Features for the entire collection at once. This requires a valid CalNet ID.

To create a collection:

Login to HathiTrust
Change the HathiTrust search tab to Full-Text or go to the Advanced Full-Text search.
Check the boxes to the left of any search results you want to add to your collection (or select all), and use the Select Collection dropdown to Add Selected volumes to collections of your own design.
Choose My Collections from the top of the HathiTrust interface, choose your collection, and from the Download Metadata button/dropdown choose the TSV option.
Open the TSV file, and then delete all of the columns except for the first column, htitem_id. Delete the htitem_id header row as well and then save the file to your working directory.

Loading Extracted Features

Go to the directory where you plan to do your work.

Add a single volume

If you're planning to analyze only a few volumes you can use the following command, replacing {{volume_id}} with your own:

htid2rsync {{volume_id}} | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

Add multiple volumes

If you have a file of volume ids in a .txt file, with one ID per line, use --from-file filename, or just -f filename, and point to a text file with one volume ID on each line.

htid2rsync --f volumeids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

Sample datasets

Complete Novels of Jane Austen (1 volume)

htid2rsync mdp.39015004788835 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

Nigerian Authors (30 volumes)

authors-nigerian.txt includes volume IDs for 30 texts with the Library of Congress subject heading Authors, Nigerian.

htid2rsync --f authors-nigerian.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

San Francisco (Calif.) - History (111 volumes)

sf-history.txt includes the volume ID for 111 texts with the Library of Congress subject heading San Francisco (Calif.) - History.

htid2rsync --f sf-history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

Congressional Record (1200 volumes)

congressional_record_ids.txt includes the volume ID for every Congressional Record volume that HathiTrust could share with us.

htid2rsync --f congressional_record_ids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

Full 4TB library

It's also possible to work with the entire library (4TB, so beware):

rsync -rv data.analytics.hathitrust.org::features/ .

Or to use existing lists of public-domain fiction, drama, and poetry (Underwood 2014).

San Francisco History Example

In the example, below, we have five volume IDs on San Francisco history from HathiTrust, which are listed in the file vol_ids_5.txt. You can modify the command to include your own list of volume ids or a single volume id of your choosing. (If you choose your own volume/s here, you will also need to modify the filepaths in the next step to point to those files).



In [ ]:

    
!rm -rf local-folder/
!rm -rf local-folder/
!rm -rf data/coo*
!rm -rf data/mdp*
!rm -rf data/uc1*
download_output = !htid2rsync --f data/vol_ids_5.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ data/
download_output

Working with Extracted Features

All of the examples of code below are taken directly, or adapted, from the Programming Historian tutorial or the FeatureReader's Readme.md file.

You'll notice, from the output above, that the content for each volume is stored in a compressed JSON file, in a rather lengthy file directory. We can import and initialize FeatureReader with file paths pointing to those JSON files (using the full paths from the output above). If you chose to work with your own volumes in the previous step you can edit the cell above and re-run the cells below.

First we'll get all the data filepaths from the output of our command above:



In [ ]:

    
suffix = '.json.bz2'
file_paths = ['data/' + path for path in download_output if path.endswith(suffix)]
file_paths

Now we'll feed these paths into the FeatureReader method which will create a FeatureReader object:



In [ ]:

    
fr = FeatureReader(file_paths)

We can now cycle through properties of the FeatureReader:



In [ ]:

    
for vol in fr.volumes():
    print(vol.id, vol.title, vol.author)
    print()

Let's try to pull out some more metadata about these titles, using the Volume object in FeatureReader. We'll get the HT URL, year, and page count for each volume.



In [ ]:

    
for vol in fr.volumes():
    print("URL: %s Year: %s Page count: %s " % (vol.handle_url, vol.year, vol.page_count))

The source_institution tells us where the volumes were scanned:



In [ ]:

    
for vol in fr.volumes():
    print("Source institution: %s " % (vol.source_institution))

Let's take a closer look at the first volume:



In [ ]:

    
vol = fr.first()
vol.title

The tokens_per_page method will give us the words in the volume:



In [ ]:

    
tokens = vol.tokens_per_page()
tokens.head()

We can easily plot the number of tokens across every page of the book



In [ ]:

    
tokens.plot()

Now let's look at some specific pages, using the Page object in FeatureReader. We'll take the first 200 pages in this volume:



In [ ]:

    
pages = [page for page in vol.pages()]

Then we'll index the 200th page:



In [ ]:

    
page_200 = pages[199]



In [ ]:

    
print("The body has %s lines, %s empty lines, and %s sentences" % (page_200.line_count(),
                                                                   page_200.empty_line_count(),
                                                                   page_200.sentence_count()))

We can get a list of the tokens with the tokenlist method:



In [ ]:

    
Table.from_df(page_200.tokenlist().reset_index())

We can do this for every page and get a huge table!



In [ ]:

    
all_pages_meta = Table.from_df(pd.concat([p.tokenlist().reset_index() for p in pages]))
all_pages_meta.show(10)

Challenge

Play around with this table and see what you can learn about the book:



In [ ]:

Try typing vol. and then tab to see everything that's provided in the volume object:



In [ ]:

    
vol.