Explain PyTextRank: the algorithm

Let's look at the TextRank algorithm used to build a graph from a raw text, and then from that extract the top-ranked phrases. This work is based on "TextRank: Bringing Order into Text", Rada Mihalcea, Paul Tarau, Empirical Methods in Natural Language Processing (2004).


First we perform some basic housekeeping for Jupyter, then load spaCy with a language model for English ...


In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

Now, to get started, we'll create some text to use.


In [3]:
#text = "When Ada was twelve years old, this future 'Lady Fairy', as Charles Babbage affectionately called her, decided she wanted to fly. Ada Byron went about the project methodically, thoughtfully, with imagination and passion. Her first step, in February 1828, was to construct wings. She investigated different material and sizes. She considered various materials for the wings: paper, oilsilk, wires, and feathers. She examined the anatomy of birds to determine the right proportion between the wings and the body. She decided to write a book, Flyology, illustrating, with plates, some of her findings. She decided what equipment she would need; for example, a compass, to 'cut across the country by the most direct road', so that she could surmount mountains, rivers, and valleys. Her final step was to integrate steam with the 'art of flying."

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

doc = nlp(text)

How many sentences are in the parsed document and where are their boundaries?


In [4]:
for sent in doc.sents:
    print(">", sent.start, sent.end)


> 0 13
> 13 33
> 33 61
> 61 91

What are the raw noun chunks in the parsed document, as well as its named entities?


In [5]:
for chunk in doc.noun_chunks:
    print(chunk.text)


Compatibility
systems
linear constraints
the set
natural numbers
Criteria
compatibility
a system
linear Diophantine equations
strict inequations
nonstrict inequations
Upper bounds
components
a minimal set
solutions
algorithms
construction
minimal generating sets
solutions
all types
systems
These criteria
the corresponding algorithms
a minimal supporting set
solutions
all the considered types systems
systems
mixed types

In [6]:
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)


Diophantine ORG 21 22

Given those details about the parsed document, next we use NetworkX to manage an in-memory graph...


In [7]:
import networkx as nx

def increment_edge (graph, node0, node1):
    print("link {} {}".format(node0, node1))
    
    if graph.has_edge(node0, node1):
        graph[node0][node1]["weight"] += 1.0
    else:
        graph.add_edge(node0, node1, weight=1.0)

Then construct a graph, sentence by sentence, based on the spaCy part-of-speech tags tags:


In [8]:
POS_KEPT = ["ADJ", "NOUN", "PROPN", "VERB"]

def link_sentence (doc, sent, lemma_graph, seen_lemma):
    visited_tokens = []
    visited_nodes = []

    for i in range(sent.start, sent.end):
        token = doc[i]

        if token.pos_ in POS_KEPT:
            key = (token.lemma_, token.pos_)

            if key not in seen_lemma:
                seen_lemma[key] = set([token.i])
            else:
                seen_lemma[key].add(token.i)

            node_id = list(seen_lemma.keys()).index(key)

            if not node_id in lemma_graph:
                lemma_graph.add_node(node_id)

            print("visit {} {}".format(visited_tokens, visited_nodes))
            print("range {}".format(list(range(len(visited_tokens) - 1, -1, -1))))
            
            for prev_token in range(len(visited_tokens) - 1, -1, -1):
                print("prev_tok {} {}".format(prev_token, (token.i - visited_tokens[prev_token])))
                
                if (token.i - visited_tokens[prev_token]) <= 3:
                    increment_edge(lemma_graph, node_id, visited_nodes[prev_token])
                else:
                    break

            print(" -- {} {} {} {} {} {}".format(token.i, token.text, token.lemma_, token.pos_, visited_tokens, visited_nodes))

            visited_tokens.append(token.i)
            visited_nodes.append(node_id)

Now iterate through the sentences to construct the lemma graph...


In [9]:
lemma_graph = nx.Graph()
seen_lemma = {}

for sent in doc.sents:
    link_sentence(doc, sent, lemma_graph, seen_lemma)
    #break # only test one sentence

print(seen_lemma)


visit [] []
range []
 -- 0 Compatibility compatibility NOUN [] []
visit [0] [0]
range [0]
prev_tok 0 2
link 1 0
 -- 2 systems system NOUN [0] [0]
visit [0, 2] [0, 1]
range [1, 0]
prev_tok 1 2
link 2 1
prev_tok 0 4
 -- 4 linear linear ADJ [0, 2] [0, 1]
visit [0, 2, 4] [0, 1, 2]
range [2, 1, 0]
prev_tok 2 1
link 3 2
prev_tok 1 3
link 3 1
prev_tok 0 5
 -- 5 constraints constraint NOUN [0, 2, 4] [0, 1, 2]
visit [0, 2, 4, 5] [0, 1, 2, 3]
range [3, 2, 1, 0]
prev_tok 3 3
link 4 3
prev_tok 2 4
 -- 8 set set NOUN [0, 2, 4, 5] [0, 1, 2, 3]
visit [0, 2, 4, 5, 8] [0, 1, 2, 3, 4]
range [4, 3, 2, 1, 0]
prev_tok 4 2
link 5 4
prev_tok 3 5
 -- 10 natural natural ADJ [0, 2, 4, 5, 8] [0, 1, 2, 3, 4]
visit [0, 2, 4, 5, 8, 10] [0, 1, 2, 3, 4, 5]
range [5, 4, 3, 2, 1, 0]
prev_tok 5 1
link 6 5
prev_tok 4 3
link 6 4
prev_tok 3 6
 -- 11 numbers number NOUN [0, 2, 4, 5, 8, 10] [0, 1, 2, 3, 4, 5]
visit [] []
range []
 -- 13 Criteria criterion NOUN [] []
visit [13] [7]
range [0]
prev_tok 0 2
link 0 7
 -- 15 compatibility compatibility NOUN [13] [7]
visit [13, 15] [7, 0]
range [1, 0]
prev_tok 1 3
link 1 0
prev_tok 0 5
 -- 18 system system NOUN [13, 15] [7, 0]
visit [13, 15, 18] [7, 0, 1]
range [2, 1, 0]
prev_tok 2 2
link 2 1
prev_tok 1 5
 -- 20 linear linear ADJ [13, 15, 18] [7, 0, 1]
visit [13, 15, 18, 20] [7, 0, 1, 2]
range [3, 2, 1, 0]
prev_tok 3 1
link 8 2
prev_tok 2 3
link 8 1
prev_tok 1 6
 -- 21 Diophantine Diophantine PROPN [13, 15, 18, 20] [7, 0, 1, 2]
visit [13, 15, 18, 20, 21] [7, 0, 1, 2, 8]
range [4, 3, 2, 1, 0]
prev_tok 4 1
link 9 8
prev_tok 3 2
link 9 2
prev_tok 2 4
 -- 22 equations equation NOUN [13, 15, 18, 20, 21] [7, 0, 1, 2, 8]
visit [13, 15, 18, 20, 21, 22] [7, 0, 1, 2, 8, 9]
range [5, 4, 3, 2, 1, 0]
prev_tok 5 2
link 10 9
prev_tok 4 3
link 10 8
prev_tok 3 4
 -- 24 strict strict ADJ [13, 15, 18, 20, 21, 22] [7, 0, 1, 2, 8, 9]
visit [13, 15, 18, 20, 21, 22, 24] [7, 0, 1, 2, 8, 9, 10]
range [6, 5, 4, 3, 2, 1, 0]
prev_tok 6 1
link 11 10
prev_tok 5 3
link 11 9
prev_tok 4 4
 -- 25 inequations inequation NOUN [13, 15, 18, 20, 21, 22, 24] [7, 0, 1, 2, 8, 9, 10]
visit [13, 15, 18, 20, 21, 22, 24, 25] [7, 0, 1, 2, 8, 9, 10, 11]
range [7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 7 3
link 12 11
prev_tok 6 4
 -- 28 nonstrict nonstrict NOUN [13, 15, 18, 20, 21, 22, 24, 25] [7, 0, 1, 2, 8, 9, 10, 11]
visit [13, 15, 18, 20, 21, 22, 24, 25, 28] [7, 0, 1, 2, 8, 9, 10, 11, 12]
range [8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 8 1
link 11 12
prev_tok 7 4
 -- 29 inequations inequation NOUN [13, 15, 18, 20, 21, 22, 24, 25, 28] [7, 0, 1, 2, 8, 9, 10, 11, 12]
visit [13, 15, 18, 20, 21, 22, 24, 25, 28, 29] [7, 0, 1, 2, 8, 9, 10, 11, 12, 11]
range [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 9 2
link 13 11
prev_tok 8 3
link 13 12
prev_tok 7 6
 -- 31 considered consider VERB [13, 15, 18, 20, 21, 22, 24, 25, 28, 29] [7, 0, 1, 2, 8, 9, 10, 11, 12, 11]
visit [] []
range []
 -- 33 Upper upper ADJ [] []
visit [33] [14]
range [0]
prev_tok 0 1
link 15 14
 -- 34 bounds bound NOUN [33] [14]
visit [33, 34] [14, 15]
range [1, 0]
prev_tok 1 2
link 16 15
prev_tok 0 3
link 16 14
 -- 36 components component NOUN [33, 34] [14, 15]
visit [33, 34, 36] [14, 15, 16]
range [2, 1, 0]
prev_tok 2 3
link 17 16
prev_tok 1 5
 -- 39 minimal minimal ADJ [33, 34, 36] [14, 15, 16]
visit [33, 34, 36, 39] [14, 15, 16, 17]
range [3, 2, 1, 0]
prev_tok 3 1
link 4 17
prev_tok 2 4
 -- 40 set set NOUN [33, 34, 36, 39] [14, 15, 16, 17]
visit [33, 34, 36, 39, 40] [14, 15, 16, 17, 4]
range [4, 3, 2, 1, 0]
prev_tok 4 2
link 18 4
prev_tok 3 3
link 18 17
prev_tok 2 6
 -- 42 solutions solution NOUN [33, 34, 36, 39, 40] [14, 15, 16, 17, 4]
visit [33, 34, 36, 39, 40, 42] [14, 15, 16, 17, 4, 18]
range [5, 4, 3, 2, 1, 0]
prev_tok 5 2
link 19 18
prev_tok 4 4
 -- 44 algorithms algorithm NOUN [33, 34, 36, 39, 40, 42] [14, 15, 16, 17, 4, 18]
visit [33, 34, 36, 39, 40, 42, 44] [14, 15, 16, 17, 4, 18, 19]
range [6, 5, 4, 3, 2, 1, 0]
prev_tok 6 2
link 20 19
prev_tok 5 4
 -- 46 construction construction NOUN [33, 34, 36, 39, 40, 42, 44] [14, 15, 16, 17, 4, 18, 19]
visit [33, 34, 36, 39, 40, 42, 44, 46] [14, 15, 16, 17, 4, 18, 19, 20]
range [7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 7 2
link 17 20
prev_tok 6 4
 -- 48 minimal minimal ADJ [33, 34, 36, 39, 40, 42, 44, 46] [14, 15, 16, 17, 4, 18, 19, 20]
visit [33, 34, 36, 39, 40, 42, 44, 46, 48] [14, 15, 16, 17, 4, 18, 19, 20, 17]
range [8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 8 1
link 21 17
prev_tok 7 3
link 21 20
prev_tok 6 5
 -- 49 generating generating NOUN [33, 34, 36, 39, 40, 42, 44, 46, 48] [14, 15, 16, 17, 4, 18, 19, 20, 17]
visit [33, 34, 36, 39, 40, 42, 44, 46, 48, 49] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21]
range [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 9 1
link 4 21
prev_tok 8 2
link 4 17
prev_tok 7 4
 -- 50 sets set NOUN [33, 34, 36, 39, 40, 42, 44, 46, 48, 49] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21]
visit [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4]
range [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 10 2
link 18 4
prev_tok 9 3
link 18 21
prev_tok 8 4
 -- 52 solutions solution NOUN [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4]
visit [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18]
range [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 11 3
link 22 18
prev_tok 10 5
 -- 55 types type NOUN [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18]
visit [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22]
range [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 12 2
link 1 22
prev_tok 11 5
 -- 57 systems system NOUN [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22]
visit [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55, 57] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22, 1]
range [13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 13 2
link 23 1
prev_tok 12 4
 -- 59 given give VERB [33, 34, 36, 39, 40, 42, 44, 46, 48, 49, 50, 52, 55, 57] [14, 15, 16, 17, 4, 18, 19, 20, 17, 21, 4, 18, 22, 1]
visit [] []
range []
 -- 62 criteria criterion NOUN [] []
visit [62] [7]
range [0]
prev_tok 0 3
link 24 7
 -- 65 corresponding corresponding ADJ [62] [7]
visit [62, 65] [7, 24]
range [1, 0]
prev_tok 1 1
link 19 24
prev_tok 0 4
 -- 66 algorithms algorithm NOUN [62, 65] [7, 24]
visit [62, 65, 66] [7, 24, 19]
range [2, 1, 0]
prev_tok 2 2
link 25 19
prev_tok 1 3
link 25 24
prev_tok 0 6
 -- 68 constructing construct VERB [62, 65, 66] [7, 24, 19]
visit [62, 65, 66, 68] [7, 24, 19, 25]
range [3, 2, 1, 0]
prev_tok 3 2
link 17 25
prev_tok 2 4
 -- 70 minimal minimal ADJ [62, 65, 66, 68] [7, 24, 19, 25]
visit [62, 65, 66, 68, 70] [7, 24, 19, 25, 17]
range [4, 3, 2, 1, 0]
prev_tok 4 1
link 26 17
prev_tok 3 3
link 26 25
prev_tok 2 5
 -- 71 supporting support VERB [62, 65, 66, 68, 70] [7, 24, 19, 25, 17]
visit [62, 65, 66, 68, 70, 71] [7, 24, 19, 25, 17, 26]
range [5, 4, 3, 2, 1, 0]
prev_tok 5 1
link 4 26
prev_tok 4 2
link 4 17
prev_tok 3 4
 -- 72 set set NOUN [62, 65, 66, 68, 70, 71] [7, 24, 19, 25, 17, 26]
visit [62, 65, 66, 68, 70, 71, 72] [7, 24, 19, 25, 17, 26, 4]
range [6, 5, 4, 3, 2, 1, 0]
prev_tok 6 2
link 18 4
prev_tok 5 3
link 18 26
prev_tok 4 4
 -- 74 solutions solution NOUN [62, 65, 66, 68, 70, 71, 72] [7, 24, 19, 25, 17, 26, 4]
visit [62, 65, 66, 68, 70, 71, 72, 74] [7, 24, 19, 25, 17, 26, 4, 18]
range [7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 7 3
link 27 18
prev_tok 6 5
 -- 77 used use VERB [62, 65, 66, 68, 70, 71, 72, 74] [7, 24, 19, 25, 17, 26, 4, 18]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77] [7, 24, 19, 25, 17, 26, 4, 18, 27]
range [8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 8 2
link 28 27
prev_tok 7 5
 -- 79 solving solve VERB [62, 65, 66, 68, 70, 71, 72, 74, 77] [7, 24, 19, 25, 17, 26, 4, 18, 27]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77, 79] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28]
range [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 9 3
link 13 28
prev_tok 8 5
 -- 82 considered consider VERB [62, 65, 66, 68, 70, 71, 72, 74, 77, 79] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13]
range [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 10 1
link 22 13
prev_tok 9 4
 -- 83 types type NOUN [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22]
range [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 11 1
link 1 22
prev_tok 10 2
link 1 13
prev_tok 9 5
 -- 84 systems system NOUN [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1]
range [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 12 2
link 1 1
prev_tok 11 3
link 1 22
prev_tok 10 4
 -- 86 systems system NOUN [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1]
range [13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 13 2
link 29 1
prev_tok 12 4
 -- 88 mixed mixed ADJ [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1]
visit [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86, 88] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1, 29]
range [14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
prev_tok 14 1
link 22 29
prev_tok 13 3
link 22 1
prev_tok 12 5
 -- 89 types type NOUN [62, 65, 66, 68, 70, 71, 72, 74, 77, 79, 82, 83, 84, 86, 88] [7, 24, 19, 25, 17, 26, 4, 18, 27, 28, 13, 22, 1, 1, 29]
{('compatibility', 'NOUN'): {0, 15}, ('system', 'NOUN'): {2, 18, 84, 86, 57}, ('linear', 'ADJ'): {4, 20}, ('constraint', 'NOUN'): {5}, ('set', 'NOUN'): {8, 40, 72, 50}, ('natural', 'ADJ'): {10}, ('number', 'NOUN'): {11}, ('criterion', 'NOUN'): {13, 62}, ('Diophantine', 'PROPN'): {21}, ('equation', 'NOUN'): {22}, ('strict', 'ADJ'): {24}, ('inequation', 'NOUN'): {25, 29}, ('nonstrict', 'NOUN'): {28}, ('consider', 'VERB'): {82, 31}, ('upper', 'ADJ'): {33}, ('bound', 'NOUN'): {34}, ('component', 'NOUN'): {36}, ('minimal', 'ADJ'): {48, 70, 39}, ('solution', 'NOUN'): {42, 52, 74}, ('algorithm', 'NOUN'): {66, 44}, ('construction', 'NOUN'): {46}, ('generating', 'NOUN'): {49}, ('type', 'NOUN'): {89, 83, 55}, ('give', 'VERB'): {59}, ('corresponding', 'ADJ'): {65}, ('construct', 'VERB'): {68}, ('support', 'VERB'): {71}, ('use', 'VERB'): {77}, ('solve', 'VERB'): {79}, ('mixed', 'ADJ'): {88}}

Let's visualize the lemma graph, and for that first we need to collect a dictionary of the labels.


In [10]:
labels = {}
keys = list(seen_lemma.keys())

for i in range(len(seen_lemma)):
    labels[i] = keys[i][0].lower()

labels


Out[10]:
{0: 'compatibility',
 1: 'system',
 2: 'linear',
 3: 'constraint',
 4: 'set',
 5: 'natural',
 6: 'number',
 7: 'criterion',
 8: 'diophantine',
 9: 'equation',
 10: 'strict',
 11: 'inequation',
 12: 'nonstrict',
 13: 'consider',
 14: 'upper',
 15: 'bound',
 16: 'component',
 17: 'minimal',
 18: 'solution',
 19: 'algorithm',
 20: 'construction',
 21: 'generating',
 22: 'type',
 23: 'give',
 24: 'corresponding',
 25: 'construct',
 26: 'support',
 27: 'use',
 28: 'solve',
 29: 'mixed'}

Then use matplotlib to visualize the lemma graph:


In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(9, 9))
pos = nx.spring_layout(lemma_graph)

nx.draw(lemma_graph, pos=pos, with_labels=False, font_weight="bold")
nx.draw_networkx_labels(lemma_graph, pos, labels)


Out[11]:
{0: Text(0.0012901698206639074, -0.24488563620557569, 'compatibility'),
 1: Text(-0.244224072417844, -0.33005011070048307, 'system'),
 2: Text(-0.28588876173627664, -0.4785892801098886, 'linear'),
 3: Text(-0.004250675507954382, -0.2927056259906219, 'constraint'),
 4: Text(0.25798107760457045, 0.08359323116936315, 'set'),
 5: Text(0.5614341045628787, -0.03475907259237776, 'natural'),
 6: Text(0.4956226628076055, -0.11906952225278584, 'number'),
 7: Text(0.3055938841559193, -0.16029282486461344, 'criterion'),
 8: Text(-0.4434103507437436, -0.5453429552416424, 'diophantine'),
 9: Text(-0.5625046690114252, -0.5782145949183082, 'equation'),
 10: Text(-0.6738091037489825, -0.5696139410986532, 'strict'),
 11: Text(-0.730296572849753, -0.40427662390703495, 'inequation'),
 12: Text(-0.8065626583538854, -0.2780670433730142, 'nonstrict'),
 13: Text(-0.5384807160505934, -0.20176294539683995, 'consider'),
 14: Text(0.5499271777610776, 1.0, 'upper'),
 15: Text(0.6574525528780653, 0.9522956279393495, 'bound'),
 16: Text(0.4944302978164222, 0.7403101781602907, 'component'),
 17: Text(0.3094440868947073, 0.3372681910005853, 'minimal'),
 18: Text(0.09655405473184611, 0.15363866760526132, 'solution'),
 19: Text(0.38197157731641146, 0.2989552937926355, 'algorithm'),
 20: Text(0.2747578313773389, 0.4918332820043162, 'construction'),
 21: Text(0.17124168168496184, 0.3240907459361337, 'generating'),
 22: Text(-0.23229740292004417, -0.15292900871503107, 'type'),
 23: Text(-0.2367544003192028, -0.6487936224180363, 'give'),
 24: Text(0.49655787516541655, 0.09609814992072034, 'corresponding'),
 25: Text(0.5144035996619168, 0.2899793739981105, 'construct'),
 26: Text(0.3213558333044916, 0.20819127644940214, 'support'),
 27: Text(-0.24941949469493382, 0.20823979838909723, 'use'),
 28: Text(-0.507446444795476, 0.08822154979653159, 'solve'),
 29: Text(-0.3746731443941781, -0.23336255837689104, 'mixed')}

Now to run the algorithm, we use PageRank – which is approximately eigenvalue centrality – to calculate ranks for each of the nodes in the lemma graph.


In [12]:
ranks = nx.pagerank(lemma_graph)
ranks


Out[12]:
{0: 0.025190055141357165,
 1: 0.09709174565608479,
 2: 0.03658656272138432,
 3: 0.022947339381092696,
 4: 0.07548767636963792,
 5: 0.01884004785394172,
 6: 0.01884004785394172,
 7: 0.019767433918161055,
 8: 0.03093250736456608,
 9: 0.031636552282656216,
 10: 0.0250439175297852,
 11: 0.03969617593153302,
 12: 0.02513276636673567,
 13: 0.0390375393827704,
 14: 0.02428614673389346,
 15: 0.02428614673389346,
 16: 0.031629446298645975,
 17: 0.06334806476862227,
 18: 0.061826419749828485,
 19: 0.03201021345587308,
 20: 0.02404712231242087,
 21: 0.029468555366439973,
 22: 0.04816979699201436,
 23: 0.010894627731045426,
 24: 0.026930354700088012,
 25: 0.03165915652710971,
 26: 0.029382686223731833,
 27: 0.019263849959229938,
 28: 0.01982332786706688,
 29: 0.016743716826448173}

In [13]:
for node_id, rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
    print(node_id, rank, labels[node_id])


1 0.09709174565608479 system
4 0.07548767636963792 set
17 0.06334806476862227 minimal
18 0.061826419749828485 solution
22 0.04816979699201436 type
11 0.03969617593153302 inequation
13 0.0390375393827704 consider
2 0.03658656272138432 linear
19 0.03201021345587308 algorithm
25 0.03165915652710971 construct
9 0.031636552282656216 equation
16 0.031629446298645975 component
8 0.03093250736456608 diophantine
21 0.029468555366439973 generating
26 0.029382686223731833 support
24 0.026930354700088012 corresponding
0 0.025190055141357165 compatibility
12 0.02513276636673567 nonstrict
10 0.0250439175297852 strict
14 0.02428614673389346 upper
15 0.02428614673389346 bound
20 0.02404712231242087 construction
3 0.022947339381092696 constraint
28 0.01982332786706688 solve
7 0.019767433918161055 criterion
27 0.019263849959229938 use
5 0.01884004785394172 natural
6 0.01884004785394172 number
29 0.016743716826448173 mixed
23 0.010894627731045426 give

Define a function to collect the top-ranked phrases from the lemma graph.


In [14]:
import math

def collect_phrases (chunk, phrases, counts):
    chunk_len = chunk.end - chunk.start + 1
    sq_sum_rank = 0.0
    non_lemma = 0
    compound_key = set([])

    for i in range(chunk.start, chunk.end):
        token = doc[i]
        key = (token.lemma_, token.pos_)
        
        if key in seen_lemma:
            node_id = list(seen_lemma.keys()).index(key)
            rank = ranks[node_id]
            sq_sum_rank += rank
            compound_key.add(key)
        
            print(" {} {} {} {}".format(token.lemma_, token.pos_, node_id, rank))
        else:
            non_lemma += 1
    
    # although the noun chunking is greedy, we discount the ranks using a
    # point estimate based on the number of non-lemma tokens within a phrase
    non_lemma_discount = chunk_len / (chunk_len + (2.0 * non_lemma) + 1.0)

    # use root mean square (RMS) to normalize the contributions of all the tokens
    phrase_rank = math.sqrt(sq_sum_rank / (chunk_len + non_lemma))
    phrase_rank *= non_lemma_discount

    # remove spurious punctuation
    phrase = chunk.text.lower().replace("'", "")

    # create a unique key for the the phrase based on its lemma components
    compound_key = tuple(sorted(list(compound_key)))
    
    if not compound_key in phrases:
        phrases[compound_key] = set([ (phrase, phrase_rank) ])
        counts[compound_key] = 1
    else:
        phrases[compound_key].add( (phrase, phrase_rank) )
        counts[compound_key] += 1

    print("{} {} {} {} {} {}".format(phrase_rank, chunk.text, chunk.start, chunk.end, chunk_len, counts[compound_key]))

Collect the top-ranked phrases based on both the noun chunks and the named entities...


In [15]:
phrases = {}
counts = {}

for chunk in doc.noun_chunks:
    collect_phrases(chunk, phrases, counts)

for ent in doc.ents:
    collect_phrases(ent, phrases, counts)


 compatibility NOUN 0 0.025190055141357165
0.07481838030465979 Compatibility 0 1 2 1
 system NOUN 1 0.09709174565608479
0.14688751982088183 systems 2 3 2 1
 linear ADJ 2 0.03658656272138432
 constraint NOUN 3 0.022947339381092696
0.10565323773654285 linear constraints 4 6 3 1
 set NOUN 4 0.07548767636963792
0.06868755180600318 the set 7 9 3 1
 natural ADJ 5 0.01884004785394172
 number NOUN 6 0.01884004785394172
0.08405366110543994 natural numbers 10 12 3 1
 criterion NOUN 7 0.019767433918161055
0.06627792311867262 Criteria 13 14 2 1
 compatibility NOUN 0 0.025190055141357165
0.07481838030465979 compatibility 15 16 2 2
 system NOUN 1 0.09709174565608479
0.07789887100276421 a system 17 19 3 2
 linear ADJ 2 0.03658656272138432
 Diophantine PROPN 8 0.03093250736456608
 equation NOUN 9 0.031636552282656216
0.1259559430077718 linear Diophantine equations 20 23 4 1
 strict ADJ 10 0.0250439175297852
 inequation NOUN 11 0.03969617593153302
0.11017607509798653 strict inequations 24 26 3 1
 nonstrict NOUN 12 0.02513276636673567
 inequation NOUN 11 0.03969617593153302
0.11025165160180313 nonstrict inequations 28 30 3 1
 upper ADJ 14 0.02428614673389346
 bound NOUN 15 0.02428614673389346
0.09543220119650414 Upper bounds 33 35 3 1
 component NOUN 16 0.031629446298645975
0.08383773520404489 components 36 37 2 1
 minimal ADJ 17 0.06334806476862227
 set NOUN 4 0.07548767636963792
0.0952198714085986 a minimal set 38 41 4 1
 solution NOUN 18 0.061826419749828485
0.1172143523159633 solutions 42 43 2 1
 algorithm NOUN 19 0.03201021345587308
0.0843408606072513 algorithms 44 45 2 1
 construction NOUN 20 0.02404712231242087
0.07310133349204889 construction 46 47 2 1
 minimal ADJ 17 0.06334806476862227
 generating NOUN 21 0.029468555366439973
 set NOUN 4 0.07548767636963792
0.1640996265710316 minimal generating sets 48 51 4 1
 solution NOUN 18 0.061826419749828485
0.1172143523159633 solutions 52 53 2 2
 type NOUN 22 0.04816979699201436
0.05486904693906117 all types 54 56 3 1
 system NOUN 1 0.09709174565608479
0.14688751982088183 systems 57 58 2 3
 criterion NOUN 7 0.019767433918161055
0.035149176660130545 These criteria 61 63 3 2
 corresponding ADJ 24 0.026930354700088012
 algorithm NOUN 19 0.03201021345587308
0.06204175981712335 the corresponding algorithms 64 67 4 1
 minimal ADJ 17 0.06334806476862227
 support VERB 26 0.029382686223731833
 set NOUN 4 0.07548767636963792
0.10465046837630346 a minimal supporting set 69 73 5 1
 solution NOUN 18 0.061826419749828485
0.1172143523159633 solutions 74 75 2 3
 consider VERB 13 0.0390375393827704
 type NOUN 22 0.04816979699201436
 system NOUN 1 0.09709174565608479
0.08278948056400119 all the considered types systems 80 85 6 1
 system NOUN 1 0.09709174565608479
0.14688751982088183 systems 86 87 2 4
 mixed ADJ 29 0.016743716826448173
 type NOUN 22 0.04816979699201436
0.1103235416443912 mixed types 88 90 3 1
 Diophantine PROPN 8 0.03093250736456608
0.082908929105731 Diophantine 21 22 2 1

Since noun chunks can be expressed in different ways (e.g., they may have articles or prepositions), we need to find a minimum span for each phrase based on combinations of lemmas...


In [16]:
import operator

min_phrases = {}

for compound_key, rank_tuples in phrases.items():
    l = list(rank_tuples)
    l.sort(key=operator.itemgetter(1), reverse=True)
    
    phrase, rank = l[0]
    count = counts[compound_key]
    
    min_phrases[phrase] = (rank, count)

Yield the results of TextRank...


In [17]:
for phrase, (rank, count) in sorted(min_phrases.items(), key=lambda x: x[1][0], reverse=True):
    print(phrase, count, rank)


minimal generating sets 1 0.1640996265710316
systems 4 0.14688751982088183
linear diophantine equations 1 0.1259559430077718
solutions 3 0.1172143523159633
mixed types 1 0.1103235416443912
nonstrict inequations 1 0.11025165160180313
strict inequations 1 0.11017607509798653
linear constraints 1 0.10565323773654285
a minimal supporting set 1 0.10465046837630346
upper bounds 1 0.09543220119650414
a minimal set 1 0.0952198714085986
algorithms 1 0.0843408606072513
natural numbers 1 0.08405366110543994
components 1 0.08383773520404489
diophantine 1 0.082908929105731
all the considered types systems 1 0.08278948056400119
compatibility 2 0.07481838030465979
construction 1 0.07310133349204889
the set 1 0.06868755180600318
criteria 2 0.06627792311867262
the corresponding algorithms 1 0.06204175981712335
all types 1 0.05486904693906117

Just for kicks, compare with raw results of the non-chunked lemma nodes...


In [18]:
for node_id, rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
    print(labels[node_id], rank)


system 0.09709174565608479
set 0.07548767636963792
minimal 0.06334806476862227
solution 0.061826419749828485
type 0.04816979699201436
inequation 0.03969617593153302
consider 0.0390375393827704
linear 0.03658656272138432
algorithm 0.03201021345587308
construct 0.03165915652710971
equation 0.031636552282656216
component 0.031629446298645975
diophantine 0.03093250736456608
generating 0.029468555366439973
support 0.029382686223731833
corresponding 0.026930354700088012
compatibility 0.025190055141357165
nonstrict 0.02513276636673567
strict 0.0250439175297852
upper 0.02428614673389346
bound 0.02428614673389346
construction 0.02404712231242087
constraint 0.022947339381092696
solve 0.01982332786706688
criterion 0.019767433918161055
use 0.019263849959229938
natural 0.01884004785394172
number 0.01884004785394172
mixed 0.016743716826448173
give 0.010894627731045426