layout: post author: csiu date: 2017-03-03 title: "Day07:" categories: update tags:

100daysofcode
text-mining excerpt:

DAY 07 - Mar 3, 2017

Yesterday, the Flesch reading ease score got me thinking ...

Flesch reading ease

Flesch reading ease is a measure of how difficult a passage in English is to understand. The formula for the readability ease measure is calculated as follows:

$RE = 206.835 – (1.015 x \frac{total\ words}{total\ sentences}) – (84.6 x \frac{total\ syllables}{total\ words})$

where $\frac{total\ words}{total\ sentences}$ refers to the average sentence length (ASL) and $\frac{total\ syllables}{total\ words}$ refers to the average number of syllables per word (ASW).



In [1]:

    
def readability_ease(num_sentences, num_words, num_syllables):
    asl = num_words / num_sentences
    asw = num_syllables / num_words
    
    return(206.835 - (1.015 * asl) - (84.6 * asw))

The readability ease (RE) score ranges from 0 to 100 and a higher scores indicate material that is easier to read.



In [2]:

    
def readability_ease_interpretation(x):
    if 90 <= x:
        res = "5th grade] "
        res += "Very easy to read. Easily understood by an average 11-year-old student."
        
    elif 80 <= x < 90:
        res = "6th grade] "
        res += "Easy to read. Conversational English for consumers."
        
    elif 70 <= x < 80:
        res = "7th grade] "
        res += "Fairly easy to read."
        
    elif 60 <= x < 70:
        res = "8th & 9th grade] "
        res += "Plain English. Easily understood by 13- to 15-year-old students."
        
    elif 50 <= x < 60:
        res = "10th to 12th grade] "
        res += "Fairly difficult to read."
        
    elif 30 <= x < 50:
        res = "College] "
        res += "Difficult to read."
        
    elif 0 <= x < 30:
        res = "College Graduate] "
        res += "Very difficult to read. Best understood by university graduates."
        
    print("[{:.1f}|{}".format(x, res))

Test case



In [3]:

    
text = "Hello world, how are you? I am great. Thank you for asking!"

In this test case, we have 12 words, 14 syllables, and 3 sentences.

Counting words

Counting words is easy.



In [4]:

    
import nltk
import re

text = text.lower()

words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))
num_words = len(words)

print(words)
print(num_words)









    



['hello', 'world', 'how', 'are', 'you', 'i', 'am', 'great', 'thank', 'you', 'for', 'asking']
12

Counting syllables

Counting syllables is a bit more tricky. According to Using Python and the NLTK to Find Haikus in the Public Twitter Stream by Brandon Wood (2013), the Carnegie Mellon University (CMU) Pronouncing Dictionary corpora contain the syllable count for over 125,000 (English) words and thus could be used to count syllables.



In [5]:

    
from nltk.corpus import cmudict
from curses.ascii import isdigit

d = cmudict.dict()

def count_syllables(word):
    return([len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]][0])



In [6]:

    
print("Number of syllables per word", "="*28, sep="\n")
for word in words:
    num_syllables = count_syllables(word)
    print("{}: {}".format(word, num_syllables))









    



Number of syllables per word
============================
hello: 2
world: 1
how: 1
are: 1
you: 1
i: 1
am: 1
great: 1
thank: 1
you: 1
for: 1
asking: 2

Counting sentences

This was already done in Day03.



In [7]:

    
sentences = nltk.tokenize.sent_tokenize(text)
num_sentences = len(sentences)

print("Number of sentences: {}".format(num_sentences), "="*25, sep="\n")
for sentence in sentences:
    print(sentence)









    



Number of sentences: 3
=========================
hello world, how are you?
i am great.
thank you for asking!

Putting it all together



In [8]:

    
def flesch_reading_ease(text):
    ## Preprocessing
    text = text.lower()
    
    sentences = nltk.tokenize.sent_tokenize(text)
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))

    ## Count
    num_sentences = len(sentences)
    num_words = len(words)
    num_syllables = sum([count_syllables(word) for word in words])

    ## Calculate
    fre = readability_ease(num_sentences, num_words, num_syllables)
    return(fre)



In [9]:

    
fre = flesch_reading_ease(text)

readability_ease_interpretation(fre)









    



[104.1|5th grade] Very easy to read. Easily understood by an average 11-year-old student.

In the example, the sentence was constructed at a 5th grade level. It's strange the score is above 100.

What about Shakespeare?



In [10]:

    
# (As You Like it Act 2, Scene 7)
text = """
All the world's a stage, 
and all the men and women merely players. 
They have their exits and their entrances; 
And one man in his time plays many parts
"""

fre = flesch_reading_ease(text)
readability_ease_interpretation(fre)









    



[87.1|6th grade] Easy to read. Conversational English for consumers.

... Thus, Shakespeare is be doable in high school.