Author: Daniel Lowd lowd@cs.uoregon.edu, May 2017.
What is a 'covfefe'? And how would anyone (or anything) come up with such a word?
In this notebook, we'll estimate the probability of generating unusual words such as 'covfefe' with a simple Markov chain. In this Markov chain, the first letter is chosen at random with probability $P(X_1)$, and each subsequent letter is chosen randomly based on the previous letter, with probability $P(X_i | X_{i-1})$ where $X_i$ is the $i$th letter and $X_{i-1}$ is the previous letter. Thus, the total probability of a $k$-letter word is: $P(x_1, \ldots, x_k) = P(x_1) \prod_{i=2}^k P(x_i | x_{i-1})$.
First, we read in a list of bigram counts (source: https://gist.github.com/lydell/c439049abac2c9226e53) and use them to generate unigram counts.
In [33]:
import json
# Read in bigram (letter-pair) counts
with open('bigrams.json') as bigram_file:
bigrams = dict(json.load(bigram_file))
# Generate unigram (single-letter) counts from bigrams
unigrams = dict()
for k in bigrams:
first_letter = k[0]
if first_letter not in unigrams:
unigrams[first_letter] = 0
unigrams[first_letter] += bigrams[k]
unigrams
Out[33]:
We use the unigram probabilities to estimate the probability of each letter coming first. (This could be improved by using the actual distribution for the first letter of a word, since some letters may be more or less common at the beginning of a word.)
In [34]:
# Estimate P(X_0)
total_counts = sum([unigrams[c] for c in unigrams])
p_first_letter = {c: float(unigrams[c])/total_counts for c in unigrams}
p_first_letter
Out[34]:
We can similarly obtain the probability of the next letter by dividing the counts for each bigram by the counts for its first letter. For example, the probability that 't' is followed by 'h' can be estimated as the number of times 'th' appears divided by the number of times 't' appears. These are maximum likelihood parameter estimates.
In [35]:
# Estimate P(X_t | X_t-1)
p_next_letter = {k: float(bigrams[k])/unigrams[k[0]] for k in bigrams}
print p_next_letter['aa']
print p_next_letter['ab']
print p_next_letter['ac']
print p_next_letter['ad']
Now we're ready to compute the probability of a particular character sequence. Since the raw probability is likely to be very small (and could underflow), we work with log probabilities instead.
In [36]:
# Compute P(x_1, x_2, ..., x_n) given P(X_1), P(X_t | X_t-1), and a particular x
from math import log
def word_log_prob(w):
# Make a list of all bigrams appearing in the word
w_bigrams = [w[i:(i+2)] for i in xrange(len(w)-1)]
# Add up the total log probability for the first letter, and then each letter given the previous
return log(p_first_letter[w[0]]) + sum([log(p_next_letter[b]) for b in w_bigrams])
Let's test it out on a few words and see how likely they are:
In [37]:
print "log P('hi') = ",word_log_prob('hi')
print "log P('hello') = ",word_log_prob('hello')
print "log P('goodbye') = ",word_log_prob('goodbye')
print "log P('thththt') = ",word_log_prob('thththt')
print "log P('ooooooo') = ",word_log_prob('ooooooo')
print "log P('covfefe') = ",word_log_prob('covfefe')
Note that, since these are log probabilities, a negative number means a very small (positive) probability. Interestingly, 'thththt' is more probable than some actual words, like 'goodbye'. Also, longer words will generally be less probable than shorter ones, since each additional character reduces the probability.
Let's go ahead and compute the probability of every word now. First, I filter the words to remove those that have unusual unicode characters or character sequences with probability zero. (This is a hack and could probably be improved upon.)
Word list taken from here: https://github.com/jacksonrayhamilton/wordlist-english/blob/master/english-words.json
In [38]:
def word_ok(w):
if len(w) == 0:
return False
for i in xrange(len(w)-1):
b = w[i:(i+2)]
if b not in p_next_letter or p_next_letter[b] <= 0.0:
return False
return True
with open('english.json') as english_words:
all_words = json.load(english_words)
filtered_words = list(filter(word_ok, all_words))
filtered_words[0:10]
Out[38]:
We can now order words by their log probability. Here are the 20 least likely words, according to this simple probabilistic model:
In [39]:
for w in sorted(filtered_words, key=word_log_prob)[0:20]:
print word_log_prob(w), w
We can normalize by length to find the least likely words /per character/:
In [40]:
for w in sorted(filtered_words, key=lambda k: word_log_prob(k)/len(k))[0:20]:
print word_log_prob(w)/len(w), w
In [41]:
for w in sorted(filtered_words, key=word_log_prob, reverse=True)[0:20]:
print word_log_prob(w), w
In [42]:
for w in sorted(filtered_words, key=lambda k: word_log_prob(k)/len(k), reverse=True)[0:20]:
print word_log_prob(w)/len(w), w
How unlikely is 'covfefe' compared to other 7-letter words?
In [43]:
seven_letter_words = list(filter(lambda w: len(w) == 7, filtered_words)) + ['covfefe']
words_by_prob = sorted(seven_letter_words, key=lambda w: word_log_prob(w))
rank = words_by_prob.index('covfefe')
print rank, "out of ", len(seven_letter_words)
Here are the words near 'covfefe' in terms of probability:
In [44]:
words_by_prob[(rank-10):(rank+11)]
Out[44]:
So there we have it! The word 'covfefe' is an unusual sequence of characters, but a few other words have even more unusual character sequences, at least according to a simple Markov chain.