layout: post author: csiu date: 2017-03-03 title: "Day07:" categories: update tags:
DAY 07 - Mar 3, 2017
Yesterday, the Flesch reading ease score got me thinking ...
Flesch reading ease is a measure of how difficult a passage in English is to understand. The formula for the readability ease measure is calculated as follows:
$RE = 206.835 – (1.015 x \frac{total\ words}{total\ sentences}) – (84.6 x \frac{total\ syllables}{total\ words})$
where $\frac{total\ words}{total\ sentences}$ refers to the average sentence length (ASL) and $\frac{total\ syllables}{total\ words}$ refers to the average number of syllables per word (ASW).
In [1]:
def readability_ease(num_sentences, num_words, num_syllables):
asl = num_words / num_sentences
asw = num_syllables / num_words
return(206.835 - (1.015 * asl) - (84.6 * asw))
The readability ease (RE) score ranges from 0 to 100 and a higher scores indicate material that is easier to read.
In [2]:
def readability_ease_interpretation(x):
if 90 <= x:
res = "5th grade] "
res += "Very easy to read. Easily understood by an average 11-year-old student."
elif 80 <= x < 90:
res = "6th grade] "
res += "Easy to read. Conversational English for consumers."
elif 70 <= x < 80:
res = "7th grade] "
res += "Fairly easy to read."
elif 60 <= x < 70:
res = "8th & 9th grade] "
res += "Plain English. Easily understood by 13- to 15-year-old students."
elif 50 <= x < 60:
res = "10th to 12th grade] "
res += "Fairly difficult to read."
elif 30 <= x < 50:
res = "College] "
res += "Difficult to read."
elif 0 <= x < 30:
res = "College Graduate] "
res += "Very difficult to read. Best understood by university graduates."
print("[{:.1f}|{}".format(x, res))
In [3]:
text = "Hello world, how are you? I am great. Thank you for asking!"
In this test case, we have 12 words, 14 syllables, and 3 sentences.
In [4]:
import nltk
import re
text = text.lower()
words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))
num_words = len(words)
print(words)
print(num_words)
Counting syllables is a bit more tricky. According to Using Python and the NLTK to Find Haikus in the Public Twitter Stream by Brandon Wood (2013), the Carnegie Mellon University (CMU) Pronouncing Dictionary corpora contain the syllable count for over 125,000 (English) words and thus could be used to count syllables.
In [5]:
from nltk.corpus import cmudict
from curses.ascii import isdigit
d = cmudict.dict()
def count_syllables(word):
return([len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]][0])
In [6]:
print("Number of syllables per word", "="*28, sep="\n")
for word in words:
num_syllables = count_syllables(word)
print("{}: {}".format(word, num_syllables))
In [7]:
sentences = nltk.tokenize.sent_tokenize(text)
num_sentences = len(sentences)
print("Number of sentences: {}".format(num_sentences), "="*25, sep="\n")
for sentence in sentences:
print(sentence)
In [8]:
def flesch_reading_ease(text):
## Preprocessing
text = text.lower()
sentences = nltk.tokenize.sent_tokenize(text)
words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))
## Count
num_sentences = len(sentences)
num_words = len(words)
num_syllables = sum([count_syllables(word) for word in words])
## Calculate
fre = readability_ease(num_sentences, num_words, num_syllables)
return(fre)
In [9]:
fre = flesch_reading_ease(text)
readability_ease_interpretation(fre)
In the example, the sentence was constructed at a 5th grade level. It's strange the score is above 100.
In [10]:
# (As You Like it Act 2, Scene 7)
text = """
All the world's a stage,
and all the men and women merely players.
They have their exits and their entrances;
And one man in his time plays many parts
"""
fre = flesch_reading_ease(text)
readability_ease_interpretation(fre)
... Thus, Shakespeare is be doable in high school.