Named entity recognition is the process of identifing particular elements from text, such as names, places, quantities, percentages, times/dates, etc. Identifying and quantifying what the general content types an article contains seems like a good predictor of what type of article it is. World news articles, for example, might mention more places than opinion articles, and business articles might have more percentages or dates than other sections. For each article, I'll count how many total mentions of people or places there are in the titles, as well as how many unique mentions for article bodies.
The Stanford NLP group has published three Named-Entity Recognizers. The three class model recognizes locations, persons, and organizations, and at least for now, this is the one I'll be using. Although NER's are written in Java, there is the Pyner interface for Python, as well as an NLTK wrapper (which I'll be using).
Although state-of-the-art taggers can achieve near-human levels of accuracy, this one does make a few mistakes. One obvious flaw is that if I feed the tagger unigram terms, two-part names such as "Michael Jordan" will count as ("Michael", "PERSON") and ("Jordan", "PERSON"). I can roughly correct for this by dividing my average name entity count by two if need be. Additionally, sometimes the tagger mis-tags certain people or places. For instance, it failed to recognize "Cameroon" as a location, but tagged the word "Heartbreak" in the article title "A Personal Trainer for Heartbreak" as a person. That being said, let's see what it can do on my news data.
In [46]:
import articledata
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import operator
In [13]:
data = pd.read_pickle('/Users/teresaborcuch/capstone_project/notebooks/pickled_data.pkl')
In [ ]:
def count_entities(data = None, title = True):
# set up tagger
os.environ['CLASSPATH'] = "/Users/teresaborcuch/stanford-ner-2013-11-12/stanford-ner.jar"
os.environ['STANFORD_MODELS'] = '/Users/teresaborcuch/stanford-ner-2013-11-12/classifiers'
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
tagged_titles = []
persons = []
places = []
if title:
for x in data['title']:
tokens = word_tokenize(x)
tags = st.tag(tokens)
tagged_titles.append(tags)
for pair_list in tagged_titles:
person_count = 0
place_count = 0
for pair in pair_list:
if pair[1] == 'PERSON':
person_count +=1
elif pair[1] == 'LOCATION':
place_count +=1
else:
continue
persons.append(person_count)
places.append(place_count)
data['total_persons_title'] = persons
data['total_places_title'] = places
else:
for x in data['body']:
tokens = word_tokenize(x)
tags = st.tag(tokens)
tagged_titles.append(tags)
for pair_list in tagged_titles:
person_count = 0
place_count = 0
for pair in pair_list:
if pair[1] == 'PERSON':
person_count +=1
elif pair[1] == 'LOCATION':
place_count +=1
else:
continue
persons.append(person_count)
places.append(place_count)
data['total_persons_body'] = persons
data['total_places_body'] = places
return data
In [14]:
# Count people and places in article titles and save as new columns
# Warning - this is super slow!
data = articledata.count_entities(data = data, title = True)
In [17]:
data.head(1)
Out[17]:
In [18]:
# pickle the file to avoid having to re-run this for future analyses
data.to_pickle('/Users/teresaborcuch/capstone_project/notebooks/ss_entity_data.pkl')
In [42]:
sns.set_style("whitegrid", {'axes.grid' : False})
fig = plt.figure(figsize = (12, 5))
ax1 = fig.add_subplot(1,2,1)
ax1.hist(data['total_persons_title'])
ax1.set_xlabel("Total Person Count in Article Titles ")
ax1.set_ylim(0,2500)
ax1.set_xlim(0,6)
ax2 = fig.add_subplot(1,2,2)
ax2.hist(data['total_places_title'])
ax2.set_xlabel("Total Place Count in Article Titles")
ax2.set_ylim(0, 2500)
ax2.set_xlim(0,6)
plt.show()
These graphs indicate that person and place counts from article are both strongly right skewed. It might be more interesting to compare mean person and place counts among different sections.
In [45]:
data.pivot_table(
index = ['condensed_section'],
values = ['total_persons_title', 'total_places_title']).sort_values('total_persons_title', ascending = False)
Out[45]:
From this pivot table, it seems there are a few distinctions to be made between different sections. Entertainment and sports contain more person mentions on average than any other sections, and world news contains more places in the title than other sections.
In [ ]:
def evaluate_entities(data = None, section = None, source = None):
section_mask = (data['condensed_section'] == section)
source_mask = (data['source'] == source)
if section and source:
masked_data = data[section_mask & source_mask]
elif section:
masked_data = data[section_mask]
elif source:
masked_data = data[source_mask]
else:
masked_data = data
# set up tagger
os.environ['CLASSPATH'] = "/Users/teresaborcuch/stanford-ner-2013-11-12/stanford-ner.jar"
os.environ['STANFORD_MODELS'] = '/Users/teresaborcuch/stanford-ner-2013-11-12/classifiers'
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
# dictionaries to hold counts of entities
person_dict = {}
place_dict = {}
for x in masked_data['body']:
tokens = word_tokenize(x)
tags = st.tag(tokens)
for pair in tags:
if pair[1] == 'PERSON':
if pair[0] not in person_dict.keys():
person_dict[pair[0]] = 1
else:
person_dict[pair[0]] +=1
elif pair[1] == 'LOCATION':
if pair[0] not in place_dict.keys():
place_dict[pair[0]] = 1
else:
place_dict[pair[0]] += 1
return person_dict, place_dict
In [19]:
world_persons, world_places = articledata.evaluate_entities(data = data, section = 'world', source = None)
In [47]:
# get top 20 people from world news article bodies
sorted_wp = sorted(world_persons.items(), key=operator.itemgetter(1))
sorted_wp.reverse()
sorted_wp[:20]
Out[47]:
Perhaps as expected, Trump is the most commonly mentioned person in world news, with 1,237 mentions in 467 articles, with Obama and Putin coming in second and third. It's interesting to note that most of these names are political figures, but since the tagger only receives unigrams, partial names and first names are mentioned as well.
In [20]:
entertainment_persons, entertainment_places = articledata.evaluate_entities(data = data, section = 'entertainment', source = None)
In [34]:
sorted_ep = sorted(entertainment_persons.items(), key=operator.itemgetter(1))
sorted_ep.reverse()
sorted_ep[:20]
Out[34]:
Now, I'll compare the top 20 people mentioned in entertainment articles. Trump still takes the number one spot, but interestingly, he's followed by a string of first names. NLTK provides a corpus of male and female-tagged first names, so counting the number of informal mentions or even the ratio of men to women might be a useful feature for classifying articles.
Compared to those from the world news section, the locations in the entertainment section are mostly in the United States: New York City (pieced together from "New", "York", and "City") seems to be the most common, but Los Angeles, Manhattan, and Chicago also appear. There are a few international destinations (fashionable ones like London and Paris and their respective countries), but nowhere near as many as in the world news section, where, after the U.S, Iran, China, and Russia take the top spots.
In [51]:
# get top 20 places from world news article bodies
sorted_wp = sorted(world_places.items(), key=operator.itemgetter(1))
sorted_wp.reverse()
sorted_wp[:20]
Out[51]:
In [52]:
# get top 20 places from entertainment article bodies
sorted_ep = sorted(entertainment_places.items(), key=operator.itemgetter(1))
sorted_ep.reverse()
sorted_ep[:20]
Out[52]: