The HathiTrust Digital Library contains over 14 million volumes scanned from academic libraries around the world (primarily in North America). The HathiTrust Research Center allows researchers to access almost all of those texts in a few different modes for computational text analysis.
This notebook will walk us through getting set-up to analyze HTRC Extracted Features for volumes in HathiTrust in a Jupyter/Python environment. Extracted Features are currently (as of August 2017) the most robust way to access in-copyright works from the HT Library for computational analysis.
For more information on HTRC:
To start we'll need to install a few things:
In [ ]:
%%capture
!pip install htrc-feature-reader
import os
from htrc_features import FeatureReader
from datascience import *
import pandas as pd
%matplotlib inline
To build your own corpus, you first need to find the volumes you'd like to include in the HathiTrust Library. Alternately, you can access volumes from existing public HT collections, or use one of the sample datasets included below under the Sample datasets heading. To access extracted features from HathiTrust:
Go to the directory where you plan to do your work.
If you're planning to analyze only a few volumes you can use the following command, replacing {{volume_id}} with your own:
htid2rsync {{volume_id}} | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
If you have a file of volume ids in a .txt file, with one ID per line, use --from-file filename, or just -f filename, and point to a text file with one volume ID on each line.
htid2rsync --f volumeids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
htid2rsync mdp.39015004788835 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
authors-nigerian.txt includes volume IDs for 30 texts with the Library of Congress subject heading Authors, Nigerian.
htid2rsync --f authors-nigerian.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
sf-history.txt includes the volume ID for 111 texts with the Library of Congress subject heading San Francisco (Calif.) - History.
htid2rsync --f sf-history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
congressional_record_ids.txt includes the volume ID for every Congressional Record volume that HathiTrust could share with us.
htid2rsync --f congressional_record_ids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
It's also possible to work with the entire library (4TB, so beware):
rsync -rv data.analytics.hathitrust.org::features/ .
Or to use existing lists of public-domain fiction, drama, and poetry (Underwood 2014).
In the example, below, we have five volume IDs on San Francisco history from HathiTrust, which are listed in the file vol_ids_5.txt. You can modify the command to include your own list of volume ids or a single volume id of your choosing. (If you choose your own volume/s here, you will also need to modify the filepaths in the next step to point to those files).
In [ ]:
!rm -rf local-folder/
!rm -rf local-folder/
!rm -rf data/coo*
!rm -rf data/mdp*
!rm -rf data/uc1*
download_output = !htid2rsync --f data/vol_ids_5.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ data/
download_output
All of the examples of code below are taken directly, or adapted, from the Programming Historian tutorial or the FeatureReader's Readme.md file.
You'll notice, from the output above, that the content for each volume is stored in a compressed JSON file, in a rather lengthy file directory. We can import and initialize FeatureReader with file paths pointing to those JSON files (using the full paths from the output above). If you chose to work with your own volumes in the previous step you can edit the cell above and re-run the cells below.
First we'll get all the data filepaths from the output of our command above:
In [ ]:
suffix = '.json.bz2'
file_paths = ['data/' + path for path in download_output if path.endswith(suffix)]
file_paths
Now we'll feed these paths into the FeatureReader
method which will create a FeatureReader
object:
In [ ]:
fr = FeatureReader(file_paths)
We can now cycle through properties of the FeatureReader
:
In [ ]:
for vol in fr.volumes():
print(vol.id, vol.title, vol.author)
print()
Let's try to pull out some more metadata about these titles, using the Volume object in FeatureReader
. We'll get the HT URL, year, and page count for each volume.
In [ ]:
for vol in fr.volumes():
print("URL: %s Year: %s Page count: %s " % (vol.handle_url, vol.year, vol.page_count))
The source_institution
tells us where the volumes were scanned:
In [ ]:
for vol in fr.volumes():
print("Source institution: %s " % (vol.source_institution))
Let's take a closer look at the first volume:
In [ ]:
vol = fr.first()
vol.title
The tokens_per_page
method will give us the words in the volume:
In [ ]:
tokens = vol.tokens_per_page()
tokens.head()
We can easily plot the number of tokens across every page of the book
In [ ]:
tokens.plot()
Now let's look at some specific pages, using the Page object in FeatureReader. We'll take the first 200 pages in this volume:
In [ ]:
pages = [page for page in vol.pages()]
Then we'll index the 200th page:
In [ ]:
page_200 = pages[199]
In [ ]:
print("The body has %s lines, %s empty lines, and %s sentences" % (page_200.line_count(),
page_200.empty_line_count(),
page_200.sentence_count()))
We can get a list of the tokens with the tokenlist
method:
In [ ]:
Table.from_df(page_200.tokenlist().reset_index())
We can do this for every page and get a huge table!
In [ ]:
all_pages_meta = Table.from_df(pd.concat([p.tokenlist().reset_index() for p in pages]))
all_pages_meta.show(10)
In [ ]:
Try typing vol.
and then tab to see everything that's provided in the volume object:
In [ ]:
vol.