Much of the preceding NLP code has been worked into a small library, and we'll call functions from that library to help keep these notebooks more readable. Take a look at the source code in pynlp.py, and an example usage:
In [ ]:
import pynlp
html_file = "html/article1.html"
json_file = "a1.json"
pynlp.full_parse(html_file, json_file)
That extracts text from HTML in the first article, then stores the parsed and annotated text as JSON, one line per sentence. Let's look at the first two sentences:
In [ ]:
%sx more a1.json
# representation of paragraphs, sentences, words, and sentence annotation
Now your turn, in the following code block, run the extract/parse/save-to-JSON for each of the example HTML files:
In [ ]: