Stage 1:

Perform statistical parsing/tagging on a document in JSON format

INPUTS: JSON doc for the text input
OUTPUT: JSON format ParsedGraf(id, sha1, graf)


In [6]:
import pytextrank
import sys

path_stage0 = "dat/mih.json"
path_stage1 = "o1.json"

with open(path_stage1, 'w') as f:
    for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
        f.write("%s\n" % pytextrank.pretty_print(graf._asdict()))
        # to view output in this notebook
        print(pytextrank.pretty_print(graf))


["777", "7b982e54fa330a6854a0ed5397d49223fdc70645", [[1, "Compatibility", "compatibility", "NN", 1, 0], [0, "of", "of", "IN", 0, 1], [2, "systems", "system", "NNS", 1, 2], [0, "of", "of", "IN", 0, 3], [3, "linear", "linear", "JJ", 1, 4], [4, "constraints", "constraint", "NNS", 1, 5], [0, "over", "over", "IN", 0, 6], [0, "the", "the", "DT", 0, 7], [5, "set", "set", "NN", 1, 8], [0, "of", "of", "IN", 0, 9], [6, "natural", "natural", "JJ", 1, 10], [7, "numbers", "number", "NNS", 1, 11], [0, ".", ".", ".", 0, 12]]]
["777", "dfa572a4a2d2c0fd9254172d95b574b3f6067f63", [[8, "Criteria", "criteria", "NNP", 1, 13], [0, "of", "of", "IN", 0, 14], [1, "compatibility", "compatibility", "NN", 1, 15], [0, "of", "of", "IN", 0, 16], [0, "a", "a", "DT", 0, 17], [2, "system", "system", "NN", 1, 18], [0, "of", "of", "IN", 0, 19], [3, "linear", "linear", "NN", 1, 20], [9, "Diophantine", "diophantine", "NNP", 1, 21], [10, "equations", "equation", "NNS", 1, 22], [0, ",", ",", ".", 0, 23], [11, "strict", "strict", "JJ", 1, 24], [12, "inequations", "inequation", "NNS", 1, 25], [0, ",", ",", ".", 0, 26], [0, "and", "and", "CC", 0, 27], [13, "nonstrict", "nonstrict", "NN", 1, 28], [12, "inequations", "inequation", "NNS", 1, 29], [14, "are", "be", "VBP", 1, 30], [15, "considered", "consider", "VBN", 1, 31], [0, ".", ".", ".", 0, 32]]]
["777", "cb9235d7c8b21321b88462fca3a0480e29aa8ec7", [[16, "Upper", "upper", "JJ", 1, 33], [17, "bounds", "bound", "NNS", 1, 34], [0, "for", "for", "IN", 0, 35], [18, "components", "component", "NNS", 1, 36], [0, "of", "of", "IN", 0, 37], [0, "a", "a", "DT", 0, 38], [19, "minimal", "minimal", "JJ", 1, 39], [5, "set", "set", "NN", 1, 40], [0, "of", "of", "IN", 0, 41], [20, "solutions", "solution", "NNS", 1, 42], [0, "and", "and", "CC", 0, 43], [21, "algorithms", "algorithm", "NNS", 1, 44], [0, "of", "of", "IN", 0, 45], [22, "construction", "construction", "NN", 1, 46], [0, "of", "of", "IN", 0, 47], [19, "minimal", "minimal", "JJ", 1, 48], [23, "generating", "generating", "NN", 1, 49], [5, "sets", "set", "NNS", 1, 50], [0, "of", "of", "IN", 0, 51], [20, "solutions", "solution", "NNS", 1, 52], [0, "for", "for", "IN", 0, 53], [0, "all", "all", "DT", 0, 54], [24, "types", "type", "NNS", 1, 55], [0, "of", "of", "IN", 0, 56], [2, "systems", "system", "NNS", 1, 57], [14, "are", "be", "VBP", 1, 58], [25, "given", "give", "VBN", 1, 59], [0, ".", ".", ".", 0, 60]]]
["777", "64db26d2b1979694d377776d5e53c9254ee6f85e", [[0, "These", "these", "DT", 0, 61], [26, "criteria", "criterion", "NNS", 1, 62], [0, "and", "and", "CC", 0, 63], [0, "the", "the", "DT", 0, 64], [27, "corresponding", "correspond", "VBG", 1, 65], [28, "algorithms", "algorithms", "NN", 1, 66], [0, "for", "for", "IN", 0, 67], [29, "constructing", "construct", "VBG", 1, 68], [0, "a", "a", "DT", 0, 69], [19, "minimal", "minimal", "JJ", 1, 70], [30, "supporting", "support", "VBG", 1, 71], [5, "set", "set", "NN", 1, 72], [0, "of", "of", "IN", 0, 73], [20, "solutions", "solution", "NNS", 1, 74], [0, "can", "can", "MD", 0, 75], [14, "be", "be", "VB", 1, 76], [31, "used", "use", "VBN", 1, 77], [0, "in", "in", "IN", 0, 78], [32, "solving", "solve", "VBG", 1, 79], [0, "all", "all", "PDT", 0, 80], [0, "the", "the", "DT", 0, 81], [15, "considered", "consider", "VBN", 1, 82], [24, "types", "type", "NNS", 1, 83], [2, "systems", "system", "NNS", 1, 84], [0, "and", "and", "CC", 0, 85], [2, "systems", "system", "NNS", 1, 86], [0, "of", "of", "IN", 0, 87], [33, "mixed", "mixed", "JJ", 1, 88], [24, "types", "type", "NNS", 1, 89], [0, ".", ".", ".", 0, 90]]]

Stage 2:

Collect and normalize the key phrases from a parsed document

INPUTS: <stage1>
OUTPUT: JSON format RankedLexeme(text, rank, ids, pos)


In [7]:
path_stage1 = "o1.json"
path_stage2 = "o2.json"

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
    for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
        f.write("%s\n" % pytextrank.pretty_print(rl._asdict()))
        # to view output in this notebook
        print(pytextrank.pretty_print(rl))


["types systems", 0.12580188866089437, [24, 2], "np", 1]
["mixed types", 0.08891323868549979, [33, 24], "np", 1]
["minimal set", 0.07071383636856185, [19, 5], "np", 1]
["systems", 0.06290094433044718, [2], "np", 1]
["strict inequations", 0.05170005955659954, [11, 12], "np", 1]
["considered", 0.04535170212808048, [15], "vbn", 2]
["types", 0.044456619342749894, [24], "nns", 3]
["natural numbers", 0.035680974352343166, [6, 7], "np", 1]
["set", 0.035356918184280925, [5], "nn", 4]
["minimal generating sets", 0.035356918184280925, [19, 23, 5], "np", 1]
["solutions", 0.03516111710876194, [20], "nns", 3]
["linear diophantine equations", 0.031027760122128312, [3, 9, 10], "np", 1]
["diophantine", 0.027937472512821634, [9], "np", 1]
["linear constraints", 0.027937472512821634, [3, 4], "np", 1]
["solving", 0.027584454272763282, [32], "vbg", 1]
["nonstrict inequations", 0.02585002977829977, [13, 12], "np", 1]
["inequations", 0.02585002977829977, [12], "nns", 2]
["numbers", 0.017840487176171583, [7], "nns", 1]
["given", 0.01741836357524441, [25], "vbn", 1]
["linear", 0.015513880061064156, [3], "nn", 1]
["constraints", 0.013968736256410817, [4], "nns", 1]
["equations", 0.013964687396243649, [10], "nns", 1]
["construction", 0.013071555591451742, [22], "nn", 1]
["generating", 0.01236559416376835, [23], "nn", 1]
["constructing", 0.012003834512632964, [29], "vbg", 1]
["supporting", 0.011911768549549907, [30], "vbg", 1]
["algorithms", 0.011104588624268798, [21], "nns", 1]
["nonstrict", 0.010956410339326364, [13], "nn", 1]
["upper bounds", 0.010352566711446532, [16, 17], "np", 1]
["components", 0.009576138633057875, [18], "nns", 1]
["compatibility", 0.006720084796696289, [1], "nn", 2]
["corresponding", 0.006720084796696289, [27], "vbg", 1]
["algorithms", 0.006488535751111831, [28], "nn", 1]
["bounds", 0.005176283355723266, [17], "nns", 1]
["criteria", 0.0036324819147502438, [26], "nns", 1]
["criteria", 0.0036324819147502438, [8], "nnp", 1]

In [8]:
import networkx as nx
import pylab as plt

nx.draw(graph, with_labels=True) 
plt.show()


Stage 3:

Calculate a significance weight for each sentence, using MinHash to approximate a Jaccard distance from key phrases determined by TextRank

INPUTS: <stage1> <stage2>
OUTPUT: JSON format SummarySent(dist, idx, text)


In [9]:
path_stage1 = "o1.json"
path_stage2 = "o2.json"
path_stage3 = "o3.json"

kernel = pytextrank.rank_kernel(path_stage2)

with open(path_stage3, 'w') as f:
    for s in pytextrank.top_sentences(kernel, path_stage1):
        f.write(pytextrank.pretty_print(s._asdict()))
        f.write("\n")
        # to view output in this notebook
        print(pytextrank.pretty_print(s._asdict()))


{"dist": 0.06495815088405221, "idx": 0, "text": "Compatibility of systems of linear constraints over the set of natural numbers ."}
{"dist": 0.059241314822135904, "idx": 2, "text": "Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given ."}
{"dist": 0.05775806902914533, "idx": 3, "text": "These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types ."}
{"dist": 0.05064812344999486, "idx": 1, "text": "Criteria of compatibility of a system of linear Diophantine equations , strict inequations , and nonstrict inequations are considered ."}

Stage 4:

Summarize a document based on most significant sentences and key phrases

INPUTS: <stage2> <stage3>
OUTPUT: Markdown format


In [10]:
path_stage2 = "o2.json"
path_stage3 = "o3.json"

phrases = ", ".join(set([p for p in pytextrank.limit_keyphrases(path_stage2, phrase_limit=12)]))
sent_iter = sorted(pytextrank.limit_sentences(path_stage3, word_limit=150), key=lambda x: x[1])
s = []

for sent_text, idx in sent_iter:
    s.append(pytextrank.make_sentence(sent_text))

graf_text = " ".join(s)
print("**excerpts:** %s\n\n**keywords:** %s" % (graf_text, phrases,))


**excerpts:** Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.

**keywords:** natural numbers, systems, set, types, types systems, diophantine, linear constraints, linear diophantine equations, mixed types, solutions, strict inequations, minimal generating sets, minimal set