First we perform some basic housekeeping for Jupyter, then load spaCy
with a language model for English ...
In [1]:
import warnings
warnings.filterwarnings("ignore")
In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
Create some text to use....
In [3]:
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
Then add PyTextRank into the spaCy
pipeline...
In [4]:
import pytextrank
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
Examine the results: a list of top-ranked phrases in the document
In [5]:
for p in doc._.phrases:
print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
print(p.chunks)
Construct a list of the sentence boundaries with a phrase vector (initialized to empty set) for each...
In [6]:
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
sent_bounds
Out[6]:
Iterate through the top-ranked phrases, added them to the phrase vector for each sentence...
In [7]:
limit_phrases = 4
phrase_id = 0
unit_vector = []
for p in doc._.phrases:
print(phrase_id, p.text, p.rank)
unit_vector.append(p.rank)
for chunk in p.chunks:
print(" ", chunk.start, chunk.end)
for sent_start, sent_end, sent_vector in sent_bounds:
if chunk.start >= sent_start and chunk.start <= sent_end:
print(" ", sent_start, chunk.start, chunk.end, sent_end)
sent_vector.add(phrase_id)
break
phrase_id += 1
if phrase_id == limit_phrases:
break
Let's take a look at the results...
In [8]:
sent_bounds
Out[8]:
In [9]:
for sent in doc.sents:
print(sent)
We also construct a unit_vector
for all of the phrases, up to the limit requested...
In [10]:
unit_vector
Out[10]:
In [11]:
sum_ranks = sum(unit_vector)
unit_vector = [ rank/sum_ranks for rank in unit_vector ]
unit_vector
Out[11]:
Iterate through each sentence, calculating its euclidean distance from the unit vector...
In [12]:
from math import sqrt
sent_rank = {}
sent_id = 0
for sent_start, sent_end, sent_vector in sent_bounds:
print(sent_vector)
sum_sq = 0.0
for phrase_id in range(len(unit_vector)):
print(phrase_id, unit_vector[phrase_id])
if phrase_id not in sent_vector:
sum_sq += unit_vector[phrase_id]**2.0
sent_rank[sent_id] = sqrt(sum_sq)
sent_id += 1
print(sent_rank)
Sort the sentence indexes in descending order
In [13]:
from operator import itemgetter
sorted(sent_rank.items(), key=itemgetter(1))
Out[13]:
Extract the sentences with the lowest distance, up to the limite requested...
In [14]:
limit_sentences = 2
sent_text = {}
sent_id = 0
for sent in doc.sents:
sent_text[sent_id] = sent.text
sent_id += 1
num_sent = 0
for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
print(sent_id, sent_text[sent_id])
num_sent += 1
if num_sent == limit_sentences:
break