In [ ]:

    
!python -m spacy download en

SpaCy: Industrial-Strength NLP

The tradtional NLP library has always been NLTK. While NLTK is still very useful for linguistics analysis and exporation, spacy has become a nice option for easy and fast implementation of the NLP pipeline. What's the NLP pipeline? It's a number of common steps computational linguists perform to help them (and the computer) better understand textual data. Digital Humanists are often fond of the pipeline because it gives us more things to count! Let's what spacy can give us that we can count.



In [ ]:

    
from datascience import *
import spacy

Let's start out with a short string from our reading and see what happens.



In [ ]:

    
my_string = '''
"What are you going to do with yourself this evening, Alfred?" said Mr.
Royal to his companion, as they issued from his counting-house in New
Orleans. "Perhaps I ought to apologize for not calling you Mr. King,
considering the shortness of our acquaintance; but your father and I
were like brothers in our youth, and you resemble him so much, I can
hardly realize that you are not he himself, and I still a young man.
It used to be a joke with us that we must be cousins, since he was a
King and I was of the Royal family. So excuse me if I say to you, as
I used to say to him. What are you going to do with yourself, Cousin
Alfred?"

"I thank you for the friendly familiarity," rejoined the young man.
"It is pleasant to know that I remind you so strongly of my good
father. My most earnest wish is to resemble him in character as much
as I am said to resemble him in person. I have formed no plans for the
evening. I was just about to ask you what there was best worth seeing
or hearing in the Crescent City."'''.replace("\n", " ")

We've downloaded the English model, and now we just have to load it. This model will do everything for us, but we'll only get a little taste today.



In [ ]:

    
nlp = spacy.load('en')
# nlp = spacy.load('en', parser=False)  # run this instead if you don't have > 1GB RAM

To parse an entire text we just call the model on a string.



In [ ]:

    
parsed_text = nlp(my_string)
parsed_text

That was quick! So what happened? We've talked a lot about tokenizing, either in words or sentences.

What about sentences?



In [ ]:

    
sents_tab = Table()
sents_tab.append_column(label="Sentence", values=[sentence.text for sentence in parsed_text.sents])
sents_tab.show()

Words?



In [ ]:

    
toks_tab = Table()
toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
toks_tab.show()

What about parts of speech?



In [ ]:

    
toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
toks_tab.show()

Lemmata?



In [ ]:

    
toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
toks_tab.show()

What else? Let's just make a function tablefy that will make a table of all this information for us:



In [ ]:

    
def tablefy(parsed_text):
    toks_tab = Table()
    toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
    toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
    toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
    toks_tab.append_column(label="Stop Word", values=[word.is_stop for word in parsed_text])
    toks_tab.append_column(label="Punctuation", values=[word.is_punct for word in parsed_text])
    toks_tab.append_column(label="Space", values=[word.is_space for word in parsed_text])
    toks_tab.append_column(label="Number", values=[word.like_num for word in parsed_text])
    toks_tab.append_column(label="OOV", values=[word.is_oov for word in parsed_text])
    toks_tab.append_column(label="Dependency", values=[word.dep_ for word in parsed_text])
    return toks_tab



In [ ]:

    
tablefy(parsed_text).show()

Challenge

What's the most common verb? Noun? What if you only include lemmata? What if you remove "stop words"?

How would lemmatizing or removing "stop words" help us better understand a text over regular tokenizing?



In [ ]:

Dependency Parsing

Let's look at our text again:



In [ ]:

    
parsed_text

Dependency parsing is one of the most useful and interesting NLP tools. A dependency parser will draw a tree of relationships between words. This is how you can find out specifically what adjectives are attributed to a specific person, what verbs are associated with a specific subject, etc.

spacy provides an online visualizer named "displaCy" to visualize dependencies. Let's look at the first sentence

We can loop through a dependency for a subject by checking the head attribute for the pos tag:



In [ ]:

    
from spacy.symbols import nsubj, VERB

SV = []
for possible_subject in parsed_text:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        SV.append((possible_subject.text, possible_subject.head))



In [ ]:

    
sv_tab = Table()
sv_tab.append_column(label="Subject", values=[x[0] for x in SV])
sv_tab.append_column(label="Verb", values=[x[1] for x in SV])
sv_tab.show()

You can imagine that you could look over a large corpus to analyze first person, second person, and third person characterizations. Dependency parsers are also important for understanding and processing natural language, a question answering system for example. These models help the computer understand what the question is that is being asked.

Limitations

How accurate are the models? What happens if we change the style of English we're working with?



In [ ]:

    
shakespeare = '''
Tush! Never tell me; I take it much unkindly
That thou, Iago, who hast had my purse
As if the strings were thine, shouldst know of this.
'''

shake_parsed = nlp(shakespeare.strip())
tablefy(shake_parsed).show()



In [ ]:

    
huck_finn_jim = '''
“Who dah?” “Say, who is you?  Whar is you?  Dog my cats ef I didn’ hear sumf’n.
Well, I know what I’s gwyne to do:  I’s gwyne to set down here and listen tell I hears it agin.”"
'''

hf_parsed = nlp(huck_finn_jim.strip())
tablefy(hf_parsed).show()



In [ ]:

    
text_speech = '''
LOL where r u rn? omg that's sooo funnnnnny. c u in a sec.
'''
ts_parsed = nlp(text_speech.strip())
tablefy(ts_parsed).show()



In [ ]:

    
old_english = '''
þæt wearð underne      eorðbuendum, 
þæt meotod hæfde      miht and strengðo 
ða he gefestnade      foldan sceatas. 
'''
oe_parsed = nlp(old_english.strip())
tablefy(oe_parsed).show()

NER and Civil War-Era Novels

Wilkens uses a technique called "NER", or "Named Entity Recognition" to let the computer identify all of the geographic place names. Wilkens writes:

Text strings representing named locations in the corpus were identified using the named entity recognizer of the Stanford CoreNLP package with supplied training data. To reduce errors and to narrow the results for human review, only those named-location strings that occurred at least five times in the corpus and were used by at least two different authors were accepted. The remaining unique strings were reviewed by hand against their context in each source volume. [883]

While we don't have the time for a human review right now, spacy does allow us to annotate place names (among other things!) in the same fashion as Stanford CoreNLP (a native Java library):



In [ ]:

    
ner_tab = Table()
ner_tab.append_column(label="NER Label", values=[ent.label_ for ent in parsed_text.ents])
ner_tab.append_column(label="NER Text", values=[ent.text for ent in parsed_text.ents])
ner_tab.show()

Cool! It's identified a few types of things for us. We can check what these mean here. GPE is country, cities, or states. Seems like that's what Wilkens was using.

Since we don't have his corpus of 1000 novels, let's just take our reading, A Romance of the Republic, as an example. We can use the requests library to get the raw HTML of a web page, and if we take the .text property we can make this a nice string.



In [ ]:

    
import requests

text = requests.get("http://www.gutenberg.org/files/10549/10549.txt").text
text = text[1050:].replace('\r\n', ' ')  # fix formatting and skip title header
print(text[:5000])

We'll leave the chapter headers for now, it shouldn't affect much. Now we need to parse this with that nlp function:



In [ ]:

    
parsed = nlp(text)

Challenge

With this larger string, find the most common noun, verb, and adjective. Then explore the other features of spacy and see what you can discover about our reading:



In [ ]:

Let's continue in the fashion that Wilkens did and extract the named entities, specifically those for "GPE". We can loop through each entity, and if it is labeled as GPE we'll add it to our places list. We'll then make a Counter object out of that to get the frequency of each place name.



In [ ]:

    
from collections import Counter

places = []

for ent in parsed.ents:
    if ent.label_ == "GPE":
        places.append(ent.text.strip())

places = Counter(places)
places

That looks OK, but it's pretty rough! Keep this in mind when using trained models. They aren't 100% accurate. That's why Wilkens went through by hand after to get rid of the garbage.

If you thought NER was cool, wait for this. Now that we have a list of "places", we can send that to an online database to get back latitude and longitude coordinates (much like Wilkens used Google's geocoder), along with the US state. To make sure it's actually a US state, we'll need a list to compare to. So let's load that:



In [ ]:

    
with open('data/us_states.txt', 'r') as f:
    states = f.read().split('\n')
    states = [x.strip() for x in states]

states

OK, now we're ready. The Nominatim function from the geopy library will return an object that has the properties we want. We'll append a new row to our table for each entry. Importantly, we're using the keys of the places counter because we don't need to ask the database for "New Orleans" 10 times to get the location. So after we get the information we'll just add as many rows as the counter tells us there are.



In [ ]:

    
from geopy.geocoders import Nominatim
from datascience import *
import time

geolocator = Nominatim(timeout=10)

geo_tab = Table(["latitude", "longitude", "name", "state"])

for name in places.keys():  # only want to loop through unique place names to call once per place name
    print("Getting information for " + name + "...")
    
    # finds the lat and lon of each name in the locations list
    location = geolocator.geocode(name)

    try:
        # index the raw response for lat and lon
        lat = float(location.raw["lat"])
        lon = float(location.raw["lon"])
        
        # string manipulation to find state name
        for p in location.address.split(","):
            if p.strip() in states:
                state = p.strip()
                break

        # add to our table
        for i in range(places[name] - 1):
            geo_tab.append(Table.from_records([{"name": name,
                                          "latitude": lat,
                                          "longitude": lon,
                                          "state": state}]).row(0))
    except:
        pass



In [ ]:

    
geo_tab.show()

Now we can plot a nice choropleth.



In [ ]:

    
%matplotlib inline

from scripts.choropleth import us_choropleth
us_choropleth(geo_tab)

Homework:

Find the text to three different Civil War-Era (1851-1875) novels on Project Gutenberg (maybe mentioned in our reading?!). Make sure you click for the .txt files, and use a GET request from the requests library to get the text.

First do some exploration on parts of speech. Then combine the NER location frequencies and plot a choropleth. Look closely at the words plotted. How did the NER model do? How does your choropleth look compared to Wilkens'?



In [ ]: