In [ ]:from __future__ import print_function, division from collections import Counter import numpy as np
In [ ]:def is_anagram(word1, word2): """Checks whether the words are anagrams. word1: string word2: string returns: boolean """ return Counter(word1) == Counter(word2)
In [ ]:is_anagram('tachymetric', 'mccarthyite')
In [ ]:is_anagram('banana', 'peach')
Counter class inherits from
dict so all methods and functions that work with a dictionary will also work with a
Read the documentation of Counter, then use a
Counter to find the three most common letters in the word "pneumonoultramicroscopicsilicovolcanoconiosis".
In [ ]:# Solution Counter('pneumonoultramicroscopicsilicovolcanoconiosis').most_common(3)
Counter is a natural representation of a multiset, which is a set where the elements can appear more than once.
You could use multisets for a game like Scrabble to see if a given set of tiles can be used to spell a given word.
Exercise: Write a definition for a class called
Multiset that inherits from
Counter and defines an additional method called
is_subset, which should take
other as parameters, where
other is another
It should check whether
self is a subset of
other; for multisets, that means that every element of
self appears in
other with at least the same frequency. For example,
aa is a subset of
aabb is not.
In [ ]:# Solution class Multiset(Counter): """A multiset is a set where elements can appear more than once.""" def is_subset(self, other): """Checks whether self is a subset of other. other: Multiset returns: boolean """ for char, count in self.items(): if other[char] < count: return False return True
The following function uses
Multiset.is_subset to check whether a particular word can be spelled using a particular set of tiles.
In [ ]:def can_spell(word, tiles): """Checks whether a set of tiles can spell a word. word: string tiles: string returns: boolean """ return Multiset(word).is_subset(Multiset(tiles))
In [ ]:can_spell('SYZYGY', 'AGSYYYZ')
In [ ]:can_spell('omelette', 'breaking a few eggs')
Optional Exercise: If you change the name of
__le__, you can use the
<= operator to test whether one
Multiset is a subset of another.
You can also extend
Counter to represent a probability mass function (PMF). A PMF is a map from possible outcomes to their probabilities. The probabilities in a PMF are "normalized" if they add up to 1 (and they are all non-negative).
PMF class inherits from
Counter and adds the following methods:
normalize computes the total of the frequencies and divides through, yielding probabilities that add to 1.
__add__ enumerates all pairs of outcomes and returns a new Pmf that represents the distribution of the sum.
render returns the outcomes and probabilities in a form ready for plotting.
In [ ]:class Pmf(Counter): """A Counter with probabilities.""" def normalize(self): """Normalizes the PMF so the probabilities add to 1.""" total = sum(self.values()) for key in self: self[key] /= total def __add__(self, other): """Adds two distributions. The result is the distribution of sums of outcomes from the two distributions. Note that this method is only correct if the selections from the two distributions are independent; that is, if the outcome of the first selection does not affect the probabilities of the outcomes for the second selection. other: Pmf returns: new Pmf """ pmf = Pmf() for key1, prob1 in self.items(): for key2, prob2 in other.items(): pmf[key1 + key2] += prob1 * prob2 return pmf def render(self): """Returns outcomes and their probabilities, suitable for plotting.""" return zip(*sorted(self.items()))
As an example, we can make a Pmf object that represents a 6-sided die.
In [ ]:d6 = Pmf([1,2,3,4,5,6]) d6.normalize() d6.name = 'one die' print(d6)
Using the add operator, we can compute the distribution for the sum of two dice.
In [ ]:d6_twice = d6 + d6 d6_twice.name = 'two dice' for key, prob in d6_twice.items(): print(key, prob)
np.sum, we can compute the distribution for the sum of three dice.
In [ ]:# if we use the built-in sum we have to provide a Pmf additive identity pmf_ident = Pmf() d6_thrice = sum([d6]*3, pmf_ident)
In [ ]:# with np.sum, we don't need an identity d6_thrice = np.sum([d6]*3) d6_thrice.name = 'three dice'
And then plot the results (using Pmf.render)
In [ ]:import matplotlib.pyplot as plt %matplotlib inline
In [ ]:for die in [d6, d6_twice, d6_thrice]: xs, ys = die.render() plt.plot(xs, ys, label=die.name, linewidth=3, alpha=0.5) plt.xlabel('Total') plt.ylabel('Probability') plt.legend() plt.show()
In [ ]:# Solution d4 = Pmf([1,2,3,4]) d4.normalize() hp = d6 + d4 hp
In [ ]:# Solution hp + hp
Suite is a
Pmf that represents a set of hypotheses and their probabilities; it provides
bayesian_update, which updates the probability of the hypotheses based on new data.
Suite is an abstract parent class; child classes should provide a
likelihood method that evaluates the likelihood of the data under a given hypothesis.
update_bayesian loops through the hypotheses, evaluates the likelihood of the data under each hypothesis, and updates the probabilities accordingly. Then it re-normalizes the PMF.
In [ ]:class Suite(Pmf): """Map from hypothesis to probability.""" def bayesian_update(self, data): """Performs a Bayesian update. Note: called bayesian_update to avoid overriding dict.update data: result of a die roll """ for hypo in self: like = self.likelihood(data, hypo) self[hypo] *= like self.normalize() def print_probs(self): for hypo in sorted(self): print(hypo, self[hypo])
As an example, I'll use
Suite to solve the "Dice Problem," from Chapter 3 of Think Bayes.
"Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?"
I'll start by defining
DiceSuite, which inherits
bayesian_update from Suite and provides
data is the observed die roll, 6 in the example.
hypo is the hypothetical number of sides on the die.
data > hypo, that means the outcome exceeds the number of sides on the die; that's impossible, so it has probability 0.
Otherwise, the probability of any outcome is
hypo is the number of sides on the die.
In [ ]:class DiceSuite(Suite): def likelihood(self, data, hypo): """Computes the likelihood of the data under the hypothesis. data: result of a die roll hypo: integer number of sides on the die """ if data > hypo: return 0 else: return 1/hypo
Now we can make a
DiceSuite object that represents the possible number of sides on the die. By default, all dice have the same prior probability.
Then I update the distribution with the given outcome and print the results:
In [ ]:dice_suite = DiceSuite([4, 6, 8, 12, 20]) dice_suite.normalize() dice_suite.print_probs()
In [ ]:dice_suite.bayesian_update(6) dice_suite.print_probs()
As expected, the 4-sided die has been eliminated; it now has 0 probability. The 6-sided die is the most likely, but the 8-sided die is still quite possible.
Now suppose I roll the die again and get an 8. We can update the Suite again with the new data.
In [ ]:dice_suite.bayesian_update(8) dice_suite.print_probs()
Now the 6-sided die has been eliminated, the 8-sided die is most likely, and there is less than a 10% chance that I am rolling a 20-sided die.
Exercise: Draw a UML class diagram that shows the relationships among all classes in this notebook, plus
Exercise: Suppose you know that up to 100 sequentially-numbered raffle tickets have been sold, and you think it is equally likely that the number sold is anywhere from 1 to 100. You find a randomly discarded ticket that is number 37. What is the probability that it is the winning ticket?
Hint: Define a class named
TicketSuite that inherits from
Suite and provides an appropriate
likelihood function. It should also define a method that computes the average probability of winning, weighted by the probability of each outcome.
Note: This exercise is pretty challenging if you are not familiar with Bayesian statistics. But if it intrigues you, consider taking Computational Bayesian Statistics.
In [ ]:# Solution class TicketSuite(Suite): def likelihood(self, data, hypo): """Computes the likelihood of the data under the hypothesis. data: result of a die roll hypo: integer number of sides on the die """ if data > hypo: return 0 else: return 1/hypo def weighted_average(self): """Compute the probability of winning. If the number of tickets is hypo, the probability of winning is 1/hypo. """ total = 0 for hypo, prob in self.items(): total += prob / hypo return total
In [ ]:# Solution ts = TicketSuite(range(1, 101)) ts.bayesian_update(37) xs, ys = ts.render() plt.plot(xs, ys, linewidth=3, alpha=0.5) plt.xlabel('Number of tickets') plt.ylabel('Probability') plt.show()
In [ ]:# Solution ts.weighted_average()