Explain PyTextRank: extractive summarization

How does PyTextRank perform extractive summarization on a text document?

First we perform some basic housekeeping for Jupyter, then load spaCy with a language model for English ...



In [1]:

    
import warnings
warnings.filterwarnings("ignore")



In [2]:

    
import spacy
nlp = spacy.load("en_core_web_sm")

Create some text to use....



In [3]:

    
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

Then add PyTextRank into the spaCy pipeline...



In [4]:

    
import pytextrank

tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

doc = nlp(text)

Examine the results: a list of top-ranked phrases in the document



In [5]:

    
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)









    



0.1641     1  minimal generating sets
[minimal generating sets]
0.1469     4  systems
[systems, systems, systems, a system]
0.1260     1  linear diophantine equations
[linear Diophantine equations]
0.1172     3  solutions
[solutions, solutions, solutions]
0.1103     1  mixed types
[mixed types]
0.1103     1  nonstrict inequations
[nonstrict inequations]
0.1102     1  strict inequations
[strict inequations]
0.1057     1  linear constraints
[linear constraints]
0.1047     1  a minimal supporting set
[a minimal supporting set]
0.0954     1  upper bounds
[Upper bounds]
0.0952     1  a minimal set
[a minimal set]
0.0843     1  algorithms
[algorithms]
0.0841     1  natural numbers
[natural numbers]
0.0838     1  components
[components]
0.0829     1  diophantine
[Diophantine]
0.0828     1  all the considered types systems
[all the considered types systems]
0.0748     2  compatibility
[Compatibility, compatibility]
0.0731     1  construction
[construction]
0.0687     1  the set
[the set]
0.0663     2  criteria
[Criteria, These criteria]
0.0620     1  the corresponding algorithms
[the corresponding algorithms]
0.0549     1  all types
[all types]

Construct a list of the sentence boundaries with a phrase vector (initialized to empty set) for each...



In [6]:

    
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
sent_bounds









    Out[6]:





[[0, 13, set()], [13, 33, set()], [33, 61, set()], [61, 91, set()]]

Iterate through the top-ranked phrases, added them to the phrase vector for each sentence...



In [7]:

    
limit_phrases = 4

phrase_id = 0
unit_vector = []

for p in doc._.phrases:
    print(phrase_id, p.text, p.rank)
    
    unit_vector.append(p.rank)
    
    for chunk in p.chunks:
        print(" ", chunk.start, chunk.end)
        
        for sent_start, sent_end, sent_vector in sent_bounds:
            if chunk.start >= sent_start and chunk.start <= sent_end:
                print(" ", sent_start, chunk.start, chunk.end, sent_end)
                sent_vector.add(phrase_id)
                break

    phrase_id += 1

    if phrase_id == limit_phrases:
        break









    



0 minimal generating sets 0.1640996265710316
  48 51
  33 48 51 61
1 systems 0.14688751982088183
  2 3
  0 2 3 13
  57 58
  33 57 58 61
  86 87
  61 86 87 91
  17 19
  13 17 19 33
2 linear diophantine equations 0.1259559430077718
  20 23
  13 20 23 33
3 solutions 0.1172143523159633
  42 43
  33 42 43 61
  52 53
  33 52 53 61
  74 75
  61 74 75 91

Let's take a look at the results...



In [8]:

    
sent_bounds









    Out[8]:





[[0, 13, {1}], [13, 33, {1, 2}], [33, 61, {0, 1, 3}], [61, 91, {1, 3}]]



In [9]:

    
for sent in doc.sents:
    print(sent)









    



Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.
Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.

We also construct a unit_vector for all of the phrases, up to the limit requested...



In [10]:

    
unit_vector









    Out[10]:





[0.1640996265710316,
 0.14688751982088183,
 0.1259559430077718,
 0.1172143523159633]



In [11]:

    
sum_ranks = sum(unit_vector)
unit_vector = [ rank/sum_ranks for rank in unit_vector ]

unit_vector









    Out[11]:





[0.29612455634085855,
 0.26506459854824677,
 0.22729270334765767,
 0.21151814176323702]

Iterate through each sentence, calculating its euclidean distance from the unit vector...



In [12]:

    
from math import sqrt

sent_rank = {}
sent_id = 0

for sent_start, sent_end, sent_vector in sent_bounds:
    print(sent_vector)
    sum_sq = 0.0
    
    for phrase_id in range(len(unit_vector)):
        print(phrase_id, unit_vector[phrase_id])
        
        if phrase_id not in sent_vector:
            sum_sq += unit_vector[phrase_id]**2.0

    sent_rank[sent_id] = sqrt(sum_sq)
    sent_id += 1

print(sent_rank)









    



{1}
0 0.29612455634085855
1 0.26506459854824677
2 0.22729270334765767
3 0.21151814176323702
{1, 2}
0 0.29612455634085855
1 0.26506459854824677
2 0.22729270334765767
3 0.21151814176323702
{0, 1, 3}
0 0.29612455634085855
1 0.26506459854824677
2 0.22729270334765767
3 0.21151814176323702
{1, 3}
0 0.29612455634085855
1 0.26506459854824677
2 0.22729270334765767
3 0.21151814176323702
{0: 0.4290590287572672, 1: 0.363908885798414, 2: 0.22729270334765767, 3: 0.37329844074568086}

Sort the sentence indexes in descending order



In [13]:

    
from operator import itemgetter

sorted(sent_rank.items(), key=itemgetter(1))









    Out[13]:





[(2, 0.22729270334765767),
 (1, 0.363908885798414),
 (3, 0.37329844074568086),
 (0, 0.4290590287572672)]

Extract the sentences with the lowest distance, up to the limite requested...



In [14]:

    
limit_sentences = 2

sent_text = {}
sent_id = 0

for sent in doc.sents:
    sent_text[sent_id] = sent.text
    sent_id += 1

num_sent = 0

for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
    print(sent_id, sent_text[sent_id])
    num_sent += 1
    
    if num_sent == limit_sentences:
        break









    



2 Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.
1 Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.