Let's look at the TextRank algorithm used to build a graph from a raw text, and then from that extract the top-ranked phrases. This work is based on "TextRank: Bringing Order into Text", Rada Mihalcea, Paul Tarau, Empirical Methods in Natural Language Processing (2004).
First we perform some basic housekeeping for Jupyter, then load spaCy
with a language model for English ...
In [1]:
import warnings
warnings.filterwarnings("ignore")
In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
Now, to get started, we'll create some text to use.
In [3]:
#text = "When Ada was twelve years old, this future 'Lady Fairy', as Charles Babbage affectionately called her, decided she wanted to fly. Ada Byron went about the project methodically, thoughtfully, with imagination and passion. Her first step, in February 1828, was to construct wings. She investigated different material and sizes. She considered various materials for the wings: paper, oilsilk, wires, and feathers. She examined the anatomy of birds to determine the right proportion between the wings and the body. She decided to write a book, Flyology, illustrating, with plates, some of her findings. She decided what equipment she would need; for example, a compass, to 'cut across the country by the most direct road', so that she could surmount mountains, rivers, and valleys. Her final step was to integrate steam with the 'art of flying."
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
doc = nlp(text)
How many sentences are in the parsed document and where are their boundaries?
In [4]:
for sent in doc.sents:
print(">", sent.start, sent.end)
What are the raw noun chunks in the parsed document, as well as its named entities?
In [5]:
for chunk in doc.noun_chunks:
print(chunk.text)
In [6]:
for ent in doc.ents:
print(ent.text, ent.label_, ent.start, ent.end)
Given those details about the parsed document, next we use NetworkX to manage an in-memory graph...
In [7]:
import networkx as nx
def increment_edge (graph, node0, node1):
print("link {} {}".format(node0, node1))
if graph.has_edge(node0, node1):
graph[node0][node1]["weight"] += 1.0
else:
graph.add_edge(node0, node1, weight=1.0)
Then construct a graph, sentence by sentence, based on the spaCy part-of-speech tags tags:
In [8]:
POS_KEPT = ["ADJ", "NOUN", "PROPN", "VERB"]
def link_sentence (doc, sent, lemma_graph, seen_lemma):
visited_tokens = []
visited_nodes = []
for i in range(sent.start, sent.end):
token = doc[i]
if token.pos_ in POS_KEPT:
key = (token.lemma_, token.pos_)
if key not in seen_lemma:
seen_lemma[key] = set([token.i])
else:
seen_lemma[key].add(token.i)
node_id = list(seen_lemma.keys()).index(key)
if not node_id in lemma_graph:
lemma_graph.add_node(node_id)
print("visit {} {}".format(visited_tokens, visited_nodes))
print("range {}".format(list(range(len(visited_tokens) - 1, -1, -1))))
for prev_token in range(len(visited_tokens) - 1, -1, -1):
print("prev_tok {} {}".format(prev_token, (token.i - visited_tokens[prev_token])))
if (token.i - visited_tokens[prev_token]) <= 3:
increment_edge(lemma_graph, node_id, visited_nodes[prev_token])
else:
break
print(" -- {} {} {} {} {} {}".format(token.i, token.text, token.lemma_, token.pos_, visited_tokens, visited_nodes))
visited_tokens.append(token.i)
visited_nodes.append(node_id)
Now iterate through the sentences to construct the lemma graph...
In [9]:
lemma_graph = nx.Graph()
seen_lemma = {}
for sent in doc.sents:
link_sentence(doc, sent, lemma_graph, seen_lemma)
#break # only test one sentence
print(seen_lemma)
Let's visualize the lemma graph, and for that first we need to collect a dictionary of the labels.
In [10]:
labels = {}
keys = list(seen_lemma.keys())
for i in range(len(seen_lemma)):
labels[i] = keys[i][0].lower()
labels
Out[10]:
Then use matplotlib
to visualize the lemma graph:
In [11]:
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(9, 9))
pos = nx.spring_layout(lemma_graph)
nx.draw(lemma_graph, pos=pos, with_labels=False, font_weight="bold")
nx.draw_networkx_labels(lemma_graph, pos, labels)
Out[11]:
Now to run the algorithm, we use PageRank
– which is approximately eigenvalue centrality – to calculate ranks for each of the nodes in the lemma graph.
In [12]:
ranks = nx.pagerank(lemma_graph)
ranks
Out[12]:
In [13]:
for node_id, rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
print(node_id, rank, labels[node_id])
Define a function to collect the top-ranked phrases from the lemma graph.
In [14]:
import math
def collect_phrases (chunk, phrases, counts):
chunk_len = chunk.end - chunk.start + 1
sq_sum_rank = 0.0
non_lemma = 0
compound_key = set([])
for i in range(chunk.start, chunk.end):
token = doc[i]
key = (token.lemma_, token.pos_)
if key in seen_lemma:
node_id = list(seen_lemma.keys()).index(key)
rank = ranks[node_id]
sq_sum_rank += rank
compound_key.add(key)
print(" {} {} {} {}".format(token.lemma_, token.pos_, node_id, rank))
else:
non_lemma += 1
# although the noun chunking is greedy, we discount the ranks using a
# point estimate based on the number of non-lemma tokens within a phrase
non_lemma_discount = chunk_len / (chunk_len + (2.0 * non_lemma) + 1.0)
# use root mean square (RMS) to normalize the contributions of all the tokens
phrase_rank = math.sqrt(sq_sum_rank / (chunk_len + non_lemma))
phrase_rank *= non_lemma_discount
# remove spurious punctuation
phrase = chunk.text.lower().replace("'", "")
# create a unique key for the the phrase based on its lemma components
compound_key = tuple(sorted(list(compound_key)))
if not compound_key in phrases:
phrases[compound_key] = set([ (phrase, phrase_rank) ])
counts[compound_key] = 1
else:
phrases[compound_key].add( (phrase, phrase_rank) )
counts[compound_key] += 1
print("{} {} {} {} {} {}".format(phrase_rank, chunk.text, chunk.start, chunk.end, chunk_len, counts[compound_key]))
Collect the top-ranked phrases based on both the noun chunks and the named entities...
In [15]:
phrases = {}
counts = {}
for chunk in doc.noun_chunks:
collect_phrases(chunk, phrases, counts)
for ent in doc.ents:
collect_phrases(ent, phrases, counts)
Since noun chunks can be expressed in different ways (e.g., they may have articles or prepositions), we need to find a minimum span for each phrase based on combinations of lemmas...
In [16]:
import operator
min_phrases = {}
for compound_key, rank_tuples in phrases.items():
l = list(rank_tuples)
l.sort(key=operator.itemgetter(1), reverse=True)
phrase, rank = l[0]
count = counts[compound_key]
min_phrases[phrase] = (rank, count)
Yield the results of TextRank...
In [17]:
for phrase, (rank, count) in sorted(min_phrases.items(), key=lambda x: x[1][0], reverse=True):
print(phrase, count, rank)
Just for kicks, compare with raw results of the non-chunked lemma nodes...
In [18]:
for node_id, rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
print(labels[node_id], rank)