This notebook contains my solution to this puzzle, written by John Bohannon for the Boston Python User Group.
This solution is copyright 2014 Allen B. Downey
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
The first step is to define OPERATORS
and DIGITS
. I am putting the sequence of digits into a NumPy array so I can take advantage of numpy.insert()
, which can do multiple inserts with one function call.
In [210]:
from itertools import *
import numpy
OPERATORS = '+-'
DIGITS = numpy.array(list('123456789'))
NUM_DIGITS = len(DIGITS)
DIGITS
Out[210]:
Assuming that we know how many operators we're going to insert in an expression, we want to enumerate all possible combinations of operators. itertools.product
does what we want.
In [211]:
def operator_iter(n):
"""Iterates all sequences of n operators."""
return product(OPERATORS, repeat=n)
list(operator_iter(2))
Out[211]:
Given the operators, we have to figure out where to put them. itertools.combinations
enumerates the possible sequences of indices.
In [212]:
def index_iter(n):
"""Iterates all the places we could put n operators."""
return combinations(range(1, NUM_DIGITS), n)
list(index_iter(2))
Out[212]:
Now we can use numpy.insert
to combine the digits and operators.
In [213]:
def make_expr(index, operator):
"""Inserts the operators at the given locations and make a string."""
return ''.join(numpy.insert(DIGITS, index, operator))
make_expr((2, 4), ('+', '-'))
Out[213]:
For a given number of operators, n
, we map make_expr
onto the product of the indices and the operators, yielding all possible expressions with n
operators.
In [214]:
def expression_iter(n):
"""Iterates expressions with n operators."""
return starmap(make_expr, product(index_iter(n), operator_iter(n)))
list(expression_iter(1))
Out[214]:
Finally, map expression_iter
onto the range 0..NUMDIGITS-1
and chain together the results.
In [215]:
def all_expression_iter():
"""Iterates all expressions."""
return chain(*imap(expression_iter, range(NUM_DIGITS)))
list(islice(all_expression_iter(), 10))
Out[215]:
forward
maps from each string expression to its numerical value.
In [216]:
forward = dict(imap(lambda s: (s, eval(s)), all_expression_iter()))
min(forward.itervalues())
Out[216]:
counter
maps from each numerical value to the number of times it appears.
In [217]:
from collections import Counter
counter = Counter(forward.itervalues())
counter[100]
Out[217]:
most_frequent
is a list of value, frequency pairs in descending order of frequency
In [218]:
from operator import itemgetter
most_frequent = sorted(counter.iteritems(), key=itemgetter(1), reverse=True)
for value, freq in most_frequent[:10]:
print value, freq
Now we want to invert forward
.
In [219]:
def invert_dict(forward):
"""Inverts a dictionary using a set comprehension."""
reverse = {}
set(reverse.setdefault(v, []).append(k) for k, v in forward.iteritems())
return reverse
reverse
maps from each numerical value to a list of expressions.
In [220]:
reverse = invert_dict(forward)
reverse[100]
Out[220]:
A cumulative distribution function (CDF) is represented by a sorted list of values and a list of the corresponding cumulative probabilities.
In [221]:
def make_cdf(counter):
"""Makes a cumulative distribution function."""
vals, freqs = zip(*sorted(counter.iteritems()))
probs = numpy.cumsum(freqs, dtype=numpy.float)
probs /= probs[-1]
return vals, probs
abs_counter
is a map from the magnitude of the values (abs) to frequence. cdf
is the distribution of the absolute values.
In [222]:
abs_counter = Counter(imap(abs, forward.itervalues()))
cdf = make_cdf(abs_counter)
zip(*cdf)[:11]
Out[222]:
You can evaluate a CDF forward or backward in log time using bisect
.
In [223]:
from bisect import bisect_left
def eval_cdf(cdf, x):
"""Evaluates a cumulative distribution function."""
vals, probs = cdf
return probs[bisect_left(vals, x)]
print 'x CDF(x)'
for x in [10, 100, 1000, 10000, 100000]:
print x, eval_cdf(cdf, x) * 100
Let's try plotting the CDF.
In [224]:
import matplotlib.pyplot as pyplot
def plot_cdf(cdf, xscale='linear'):
pyplot.plot(*cdf, linewidth=4, alpha=0.5)
pyplot.xlabel('abs(value)')
pyplot.ylabel('CDF')
pyplot.xscale(xscale)
pyplot.show()
plot_cdf(cdf)
On a linear scale, we can't see much. Let's try a log-x scale:
In [225]:
plot_cdf(cdf, xscale='log')
That looks like a long-tailed distribution. To see if it has a power-law tail, we can plot the complementary CDF (CCDF) on a log-log scale.
In [226]:
def plot_ccdf(cdf):
ccdf = cdf[0], 1-cdf[1]
pyplot.plot(*ccdf, linewidth=4, alpha=0.5)
pyplot.xlabel('abs(value)')
pyplot.ylabel('CCDF')
pyplot.xscale('log')
pyplot.yscale('log')
pyplot.show()
plot_ccdf(cdf)
A straight line in this plot indicates a power-law tail. In this example is spans about 6 orders of magnitude.
So you might wonder whether this pattern persists. Let's try it with hexadecimal numbers.
In [40]:
DIGITS = numpy.array(list('123456789abcdef'))
NUM_DIGITS = len(DIGITS)
def hex_eval(expr):
"""Evaluates an expression with hexadecimal numbers."""
expr = '0x' + expr.replace('+', '+0x').replace('-', '-0x')
return eval(expr)
forward2 = dict(imap(lambda s: (s, hex_eval(s)), all_expression_iter()))
In [166]:
def num_and_log(n):
"""Returns a number and its number of digits (actually log base 10)."""
return n, numpy.log10(abs(n))
num_and_log(len(forward2))
Out[166]:
In [167]:
num_and_log(min(forward2.itervalues()))
Out[167]:
In [168]:
num_and_log(max(forward2.itervalues()))
Out[168]:
In [169]:
abs_counter2 = Counter(imap(abs, forward2.itervalues()))
cdf2 = make_cdf(abs_counter2)
plot_ccdf(cdf2)
Yup, that looks like a power tail over about 14 orders of magnitude! The ripple pattern suggests that the tail is a mixture of distributions with geometrically increasing central tendency and decreasing probability. And it looks like the width of each ripple is a factor of 16 (there are 12 ripples; if each is a hexadecade, they should span 14.4 decades).
In [170]:
12 * numpy.log10(16)
Out[170]:
That gives us a hint about the cause of the ripples. The value of each expression will tend to be dominated by the largest term. Here's a function that takes an expression and finds the number of digits in the dominant term.
In [171]:
import re
def dom_term_len(expr, val):
return abs(val), max(imap(len, re.split('\+|-', expr)))
dom_term_len('1234+56789', 1234+56789)
Out[171]:
dom_term_map
maps from the number of digits in the largest term to a list of values.
In [172]:
dom_term_map = invert_dict(dict(starmap(dom_term_len, forward.iteritems())))
In [174]:
dom_term_map[6][:10]
Out[174]:
Now we can compute the distribution of values within each decade (or hexadecade).
In [175]:
plot_cdf(make_cdf(Counter(dom_term_map[2])))
In [176]:
plot_cdf(make_cdf(Counter(dom_term_map[3])))
In [177]:
plot_cdf(make_cdf(Counter(dom_term_map[4])))
In [178]:
plot_cdf(make_cdf(Counter(dom_term_map[5])))
In [179]:
plot_cdf(make_cdf(Counter(dom_term_map[6])))
In [180]:
plot_cdf(make_cdf(Counter(dom_term_map[7])))
In [181]:
plot_cdf(make_cdf(Counter(dom_term_map[8])))
Within each decade (or hexadecade) the distribution of values is roughly uniform.
It looks like the distribution of values is a mixture of uniform distributions where each element of the mixture spans a decade. So why does the resulting mixture have a power-law tail? Because the number of values in each element of the mixture decreases geometrically. To see that, let's plot the number of values as a function of the size of the largest term.
In [190]:
pairs = sorted(starmap(lambda x, t: (x, len(t)), dom_term_map.iteritems()))
xs, ys = zip(*pairs)
pyplot.plot(xs, ys, linewidth=4, alpha=0.5)
pyplot.xlabel('Digits in largest term')
pyplot.ylabel('Number of values')
pyplot.show()
From 4 on up, that looks like a geometric decay. To confirm, let's look at the log ratio of successive elements.
In [192]:
diffs = numpy.diff(numpy.log10(ys))
pyplot.plot(xs[1:], diffs, linewidth=4, alpha=0.5)
pyplot.show()
That looks like it's converging at about -0.6, so we expect the mixture distribution, plotted on a log scale, to drop by 0.6 per decade. And it does.