There are two approaches to automatic summarization nowadays: extraction and abstraction. Abstraction method tries to generate a summary based on the text. This summary could have words which aren't present in the text itself. This method looks very promising, but currently it is considered to be too complex. As a result extraction methods are more commonly used. They work by selecting certain words or sentences from the text and creating summary using them.
Usually unsupervised approaches are used, as they don't require training data, so that they can summarize a given text without additional information. And their quality is good enough.
In [1]:
from nltk import FreqDist
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
from nltk.corpus import stopwords
stop = stopwords.words('english')
from bs4 import BeautifulSoup
from urllib.request import urlopen
from gensim.models import Phrases
from gensim.models.phrases import Phraser
import os
from collections import Counter
import string
punctuations = list(string.punctuation)
#Add some more punctuation, as the list doesn't cover all cases.
punctuations.extend(['”', '–', '``', "''"])
stop = stop + punctuations
The basic idea behind unsupervised summarization is the following:
The main point, obviously, is assigning scores to sentences. Here are some of the ways to do this:
In this notebook I'll use the following news article:
In [2]:
url = urlopen('http://news.sky.com/story/snap-election-to-be-held-in-march-after-northern-ireland-government-collapses-10731488')
soup = BeautifulSoup(url.read().decode('utf8'), "lxml")
text = '\n\n'.join(map(lambda p: p.text, soup.find_all('p')))
text = text[text.find('An early election'):]
title = soup.find('h1').text.strip()
print(title, '\n', '_' * 60, '\n', text)
This method goes through the following steps:
At first I'll simply split sentences into words, using space as a separator.
In [3]:
def intersection(sent1, sent2):
s1 = sent1.split(' ')
s2 = sent2.split(' ')
intersection = [i for i in s1 if i in s2]
#Normalization
return len(intersection) / ((len(s1) + len(s2)) / 2)
Now creating a matrix of similarities between each pair of sentences. This is a 2D-matrix with a length equal to the number of sentences.
In [4]:
sentences = sent_tokenize(text)
matrix = [[intersection(sentences[i], sentences[j]) for i in range(0,len(sentences))] for j in range(0,len(sentences))]
matrix[:2]
Out[4]:
Now calculating the score for each sentence, which is a sum of simiarity scores with other sentences.
In [5]:
scores = {sentences[i]: sum(matrix[i]) for i in range(len(matrix))}
scores
Out[5]:
Now I'll select five best sentences.
In [6]:
sents = sorted(scores, key=scores.__getitem__, reverse=True)[:5]
sents
Out[6]:
Maybe there is a better way to sort sentences based on the order in which they appear in text, but this still works.
In [7]:
tuples = [(i, text.find(i)) for i in sents]
sorted_tuples = sorted(tuples, key=lambda x: x[0])
#Leave only sentences.
best_sents = [i[0] for i in sorted_tuples]
best_sents
Out[7]:
Now, I'll put everything together with a nice output.
In [8]:
def intersection(sent1, sent2):
s1 = sent1.split(' ')
s2 = sent2.split(' ')
intersection = [i for i in s1 if i in s2]
return len(intersection) / ((len(s1) + len(s2)) / 2)
def get_summary(text, limit=3):
sentences = sent_tokenize(text)
matrix = [[intersection(sentences[i], sentences[j]) for i in range(0,len(sentences))] for j in range(0,len(sentences))]
scores = {sentences[i]: sum(matrix[i]) for i in range(len(matrix))}
sents = sorted(scores, key=scores.__getitem__, reverse=True)[:limit]
best_sents = [i[0] for i in sorted([(i, text.find(i)) for i in sents], key=lambda x: x[0])]
return best_sents
def summarize(text, limit=3):
summary = get_summary(text, limit)
print(title)
print()
print(' '.join(summary))
In [9]:
summarize(text,5)
So, this is a summary. The number of sentences in summary is arbitrary and can be changed to get the necessary result.
How can this algorithm be improved? I think that splitting sentences while calculating intersections should be changed. Splitting by spaces leaves punctuation attached to the words, which leads to mistakes when evaluating similarity between sentences. So I'll tokenize sentences using nltk and remove stopwords and punctuation. Also taking lemmas of words could help (but didn't help in this case - I tried).
In [10]:
def intersection(sent1, sent2):
s1 = [i for i in word_tokenize(sent1) if i not in punctuations and i not in stop]
s2 = [i for i in word_tokenize(sent2) if i not in punctuations and i not in stop]
intersection = [i for i in s1 if i in s2]
return len(intersection) / ((len(s1) + len(s2)) / 2)
In [11]:
summarize(text,5)
We see that the summary changed. And in one last change I'll increase the complexity of the model even further. Tokenizing sentences is good, but a better idea would be to use n_grams. For this I use gensim's Phrases. Phrases detects collocations in text and can be used for finding n_grams in text.
In [12]:
sents = sent_tokenize(text)
#Phrases need input as list of lists of tokens.
sentence_stream = [[i for i in word_tokenize(sent) if i not in stop] for sent in sents]
bigram = Phrases(sentence_stream, min_count=2, threshold=2, delimiter=b' ')
#Create Phraser object.
bigram_phraser = Phraser(bigram)
bigram_tokens = bigram_phraser[sentence_stream]
trigram = Phrases(bigram_tokens, min_count=2, threshold=2, delimiter=b' ')
trigram_phraser = Phraser(trigram)
trigram_tokens = trigram_phraser[bigram_tokens]
all_words = [i for j in trigram_tokens for i in j]
Counter(all_words).most_common(20)
Out[12]:
We can see that there are bigrams and trigrams among the most common words. Now I'll use this.
In [13]:
def intersection(sent1, sent2):
#As sentences are lists of tokens, there is no need to split them.
intersection = [i for i in sent1 if i in sent2]
return len(intersection) / ((len(sent1) + len(sent2)) / 2)
def split_sentences(sents):
sentence_stream = [[i for i in word_tokenize(sent) if i not in stop] for sent in sents]
bigram = Phrases(sentence_stream, min_count=2, threshold=2, delimiter=b'_')
bigram_phraser = Phraser(bigram)
bigram_tokens = bigram_phraser[sentence_stream]
trigram = Phrases(bigram_tokens,min_count=2, threshold=2, delimiter=b'_')
trigram_phraser = Phraser(trigram)
trigram_tokens = trigram_phraser[bigram_tokens]
return [i for i in trigram_tokens]
def get_summary(text, limit=3):
sents = sent_tokenize(text)
sentences = split_sentences(sents)
matrix = [[intersection(sentences[i], sentences[j]) for i in range(0,len(sentences))] for j in range(0,len(sentences))]
scores = {sents[i]: sum(matrix[i]) for i in range(len(matrix))}
sents = sorted(scores, key=scores.__getitem__, reverse=True)[:limit]
best_sents = [i[0] for i in sorted([(i, text.find(i)) for i in sents], key=lambda x: x[0])]
return best_sents
In [14]:
summarize(text,5)
The summary changed again. Various ways to split sentences may work better on some types of texts and worse on others.
This method goes through the following steps:
In [15]:
def score_sentences(words, sentences):
#Return scores for sentences.
scores = Counter()
#Words - list of words and their scores, first element is the word, second - its score.
for word in words:
for i in range(0, len(sentences)):
#If word is also in title, then add double score to the sentence.
if word[0] in sentences[i] and word[0] in title:
scores[i] += 2 * word[1]
elif word[0] in sentences[i]:
scores[i] += word[1]
sentence_scores = sorted(scores.items(), key=scores.__getitem__, reverse=True)
return sentence_scores
def split_sentences(sents):
sentence_stream = [[i for i in word_tokenize(sent) if i not in stop] for sent in sents]
bigram = Phrases(sentence_stream, min_count=2, threshold=2, delimiter=b'_')
bigram_phraser = Phraser(bigram)
bigram_tokens = bigram_phraser[sentence_stream]
trigram = Phrases(bigram_tokens,min_count=2, threshold=2, delimiter=b'_')
trigram_phraser = Phraser(trigram)
trigram_tokens = trigram_phraser[bigram_tokens]
all_words = [i for j in trigram_tokens for i in j]
frequent_words = [i for i in Counter(all_words).most_common() if i[1] > 1]
sentences = [i for i in trigram_tokens]
return frequent_words, sentences
def get_summary(text, limit=3):
sents = sent_tokenize(text)
frequent_words, sentences = split_sentences(sents)
sentence_scores = score_sentences(frequent_words, sentences)
limited_sents = [sents[num] for num, count in sentence_scores[:limit]]
best_sents = [i[0] for i in sorted([(i, text.find(i)) for i in limited_sents], key=lambda x: x[0])]
return best_sents
def summarize(text, limit=3):
summary = get_summary(text, limit)
print(title)
print()
print(' '.join(summary))
In [16]:
summarize(text, 5)
As I have shown, there are many ways to summarize articles with extraction methods. Of course, there are many other ideas which could improve the algorithms. And it is difficult to measure the accuracy of summaries - often there are many "meaningful" sentences and choosing one best combination of them isn't possible. So we just try several ways and choose the best implementation for a particular case. And try developing abstraction methods, as extraction methods are limited.